Skip to contents

🌟 What is this article about?

In order to use cram_bandit(), users must supply a matrix of action selection probabilities πt(Xj,Aj)\pi_t(X_j, A_j) for each combination of policy update tt and context jj in the historical dataset.

While some environments log these probabilities directly, many contextual bandit libraries (such as contextual) only store policy parameters (e.g., regression coefficients) without explicit probability tracking.

This article explains how Cram Bandit Helpers reconstruct πt(Xj,Aj)\pi_t(X_j, A_j) from these parameters for common policies:

Policy Type Class Name
Epsilon-Greedy BatchContextualEpsilonGreedyPolicy
LinUCB Disjoint with ε\varepsilon-greedy exploration BatchLinUCBDisjointPolicyEpsilon
Thompson Sampling BatchContextualLinTSPolicy

Both theoretical formulas and practical code snippets are provided.


🛠️Policy parameters explained

When using linear bandit algorithms like Epsilon-Greedy, LinUCB, or Thompson Sampling, each arm kk maintains summary statistics (parameters) to estimate the expected reward:

  • AkA_k is the Gram matrix:
    Ak=XkTXk A_k = X_k^T X_k where XkX_k is the matrix of feature vectors (contexts) for all rounds where arm kk was selected.
    Interpretation: AkA_k captures the amount of information (and correlation structure) about the features for arm kk. It plays the role of a “feature covariance matrix.”

  • bkb_k is the response vector:
    bk=XkTyk b_k = X_k^T y_k where yky_k are the observed rewards for arm kk.
    Interpretation: bkb_k captures the relationship between the observed rewards and the contexts for arm kk.

These sufficient statistics allow the policy to compute the Least Squares estimate for the reward model:

θk=Ak1bk \theta_k = A_k^{-1} b_k

where:

  • θk\theta_k is the estimated coefficient vector that predicts the expected reward of arm kk as a function of the context.

Thus:

  • AkA_k tells us how confident we are about θk\theta_k (smaller eigenvalues of AkA_k imply more uncertainty).
  • bkb_k provides the observed signal (reward-weighted context features) to fit θk\theta_k.

The policy selects an action based on the θk\theta_k of each arm kk and then observe the reward associated with this choice, which is used to update the parameters AkA_k and bkb_k of the policy.


✨ Epsilon-Greedy Policy

🤖 Theoretical computation

In Epsilon-Greedy, with exploration rate ε\varepsilon, the probability of selecting one of the best arms is:

P(At|Xt)=(1ε)×1#best arms+ε×1K P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{\# \text{best arms}} + \varepsilon \times \frac{1}{K}

While the probability of selecting an arm that is not among the best arms is:

P(At|Xt)=ε×1K P(A_t | X_t) = \varepsilon \times \frac{1}{K}

where:

  • Best arms are those with maximal estimated rewards.
  • KK is the total number of available arms.

We define the least squares estimate as:

θk=Ak1bk(Least Squares estimate) \theta_k = A_k^{-1} b_k \quad \text{(Least Squares estimate)}

where:

  • AkA_k is the Gram matrix XkTXkX_k^T X_k
  • bkb_k is the vector XkTYkX_k^T Y_k

Best arms are identified via the estimated expected reward:

Expected reward=XtTθk \text{Expected reward} = X_t^T \theta_k

📊 Code helper

In cramR, this is implemented by:

get_proba_c_eps_greedy(eps, A_list, b_list, contexts, chosen_arms)

This function:

  • Computes θk\theta_k for each arm
  • Calculates expected rewards XtTθkX_t^T \theta_k
  • Identifies the best arms
  • Applies the above formula

🔢 LinUCB Disjoint Policy with ε\varepsilon-Greedy

🤖 Theoretical computation

LinUCB selects arms based on Upper Confidence Bounds (UCBs):

UCBk(Xt)=μk(Xt)+ασk(Xt) \text{UCB}_k(X_t) = \mu_k(X_t) + \alpha \sigma_k(X_t)

where:

  • μk(Xt)=XtTθk\mu_k(X_t) = X_t^T \theta_k
  • σk(Xt)=XtTAk1Xt\sigma_k(X_t) = \sqrt{X_t^T A_k^{-1} X_t} measures uncertainty
  • α\alpha controls the exploration strength

The action probabilities follow the same structure as Epsilon-Greedy but with UCB scores instead of plain expected rewards i.e. the probability to select one of the best arms is:

P(At|Xt)=(1ε)×1#best arms+ε×1K P(A_t | X_t) = (1 - \varepsilon) \times \frac{1}{\# \text{best arms}} + \varepsilon \times \frac{1}{K}

While the probability to select an arm that is not among the best arms is:

P(At|Xt)=ε×1K P(A_t | X_t) = \varepsilon \times \frac{1}{K}

where “best arms” are those with highest UCB scores.

📊 Code helper

In cramR, this is implemented by:

get_proba_ucb_disjoint(alpha, eps, A_list, b_list, contexts, chosen_arms)

This function:

  • Computes θk\theta_k
  • Computes μk(Xt)\mu_k(X_t) and σk(Xt)\sigma_k(X_t)
  • Identifies arms maximizing UCBk(Xt)\text{UCB}_k(X_t)
  • Applies the Epsilon-Greedy selection formula

🤓 Thompson Sampling (LinTS)

🤖 Theoretical computation

In Thompson Sampling, actions are sampled according to posterior draws and the action associated with the maximum value is chosen. The probability that the arm AtA_t is optimal is:

P(At|Xt)=(θAtTXt>θkTXtkAt) P(A_t | X_t) = \mathbb{P}\left( \theta_{A_t}^T X_t > \theta_{k}^T X_t \quad \forall k \neq A_t \right)

where θk𝒩(Ak1bk,σ2Ak1)\theta_k \sim \mathcal{N}(A_k^{-1} b_k, \sigma^2 A_k^{-1}).

This requires computing a multivariate probability, which we approximate via adaptive numerical integration.

📊 Code helper

In cramR, this is implemented by:

get_proba_thompson(sigma, A_list, b_list, contexts, chosen_arms)

This function:

  • Computes posterior means and variances
  • Integrates over the space where chosen arm AtA_t has the highest sampled reward
  • Returns clipped probabilities for numerical stability

👨‍💻 Practical Workflow

When using your bandit policy in practice:

  1. Record action choices, contexts, and policy parameters (e.g., AA, bb)
  2. Calculate the action selection probabilities. If your policy is within the ones presented above, please feel free to rely on our helper functions to build π\pi.
  3. Feed pi, arm, and reward into cram_bandit() for evaluation of your policy.

🧪 Estimand Calculation in cram_bandit_sim()

The following only concerns the simulation framework we implemented for benchmarking purposes.

Once the policies are reconstructed, we compute their true expected value — referred to as the estimand — by applying the learned policy to independent contexts and evaluating it against the known reward function used in the simulation.

This is done via:

compute_estimand(data_group, list_betas, policy, policy_name, batch_size, bandit)

Accurately computing the estimand is critical for properly assessing the bias and confidence interval coverage of the Cram estimate in our simulations.


🌟 Acknowledgments

These helper functions were designed to faithfully reconstruct action probabilities for the policies implemented in contextual, while enabling reproducible Cram-based evaluation.