🌟 What is this article about?
In order to use cram_bandit()
, users must supply a
matrix of action selection probabilities
for each combination of policy update
and context
in the historical dataset.
While some environments log these probabilities directly, many
contextual bandit libraries (such as contextual
)
only store policy parameters (e.g., regression
coefficients) without explicit probability tracking.
This article explains how Cram Bandit Helpers reconstruct from these parameters for common policies:
Policy Type | Class Name |
---|---|
Epsilon-Greedy | BatchContextualEpsilonGreedyPolicy |
LinUCB Disjoint with -greedy exploration | BatchLinUCBDisjointPolicyEpsilon |
Thompson Sampling | BatchContextualLinTSPolicy |
Both theoretical formulas and practical code snippets are provided.
🛠️Policy parameters explained
When using linear bandit algorithms like Epsilon-Greedy, LinUCB, or Thompson Sampling, each arm maintains summary statistics (parameters) to estimate the expected reward:
is the Gram matrix:
where is the matrix of feature vectors (contexts) for all rounds where arm was selected.
➔ Interpretation: captures the amount of information (and correlation structure) about the features for arm . It plays the role of a “feature covariance matrix.”is the response vector:
where are the observed rewards for arm .
➔ Interpretation: captures the relationship between the observed rewards and the contexts for arm .
These sufficient statistics allow the policy to compute the Least Squares estimate for the reward model:
where:
- is the estimated coefficient vector that predicts the expected reward of arm as a function of the context.
Thus:
- tells us how confident we are about (smaller eigenvalues of imply more uncertainty).
- provides the observed signal (reward-weighted context features) to fit .
The policy selects an action based on the of each arm and then observe the reward associated with this choice, which is used to update the parameters and of the policy.
✨ Epsilon-Greedy Policy
🤖 Theoretical computation
In Epsilon-Greedy, with exploration rate , the probability of selecting one of the best arms is:
While the probability of selecting an arm that is not among the best arms is:
where:
- Best arms are those with maximal estimated rewards.
- is the total number of available arms.
We define the least squares estimate as:
where:
- is the Gram matrix
- is the vector
Best arms are identified via the estimated expected reward:
🔢 LinUCB Disjoint Policy with -Greedy
🤖 Theoretical computation
LinUCB selects arms based on Upper Confidence Bounds (UCBs):
where:
- measures uncertainty
- controls the exploration strength
The action probabilities follow the same structure as Epsilon-Greedy but with UCB scores instead of plain expected rewards i.e. the probability to select one of the best arms is:
While the probability to select an arm that is not among the best arms is:
where “best arms” are those with highest UCB scores.
🤓 Thompson Sampling (LinTS)
🤖 Theoretical computation
In Thompson Sampling, actions are sampled according to posterior draws and the action associated with the maximum value is chosen. The probability that the arm is optimal is:
where .
This requires computing a multivariate probability, which we approximate via adaptive numerical integration.
👨💻 Practical Workflow
When using your bandit policy in practice:
- Record action choices, contexts, and policy parameters (e.g., , )
- Calculate the action selection probabilities. If your policy is within the ones presented above, please feel free to rely on our helper functions to build .
- Feed
pi
,arm
, andreward
intocram_bandit()
for evaluation of your policy.
🧪 Estimand Calculation in cram_bandit_sim()
The following only concerns the simulation framework we implemented for benchmarking purposes.
Once the policies are reconstructed, we compute their true expected value — referred to as the estimand — by applying the learned policy to independent contexts and evaluating it against the known reward function used in the simulation.
This is done via:
compute_estimand(data_group, list_betas, policy, policy_name, batch_size, bandit)
Accurately computing the estimand is critical for properly assessing the bias and confidence interval coverage of the Cram estimate in our simulations.
📂 Useful Links
-
contextual
package: original framework -
cram_bandit()
: Cram evaluation for contextual bandits -
cram_bandit_sim()
: Full simulation engine with automatic pi estimation
🌟 Acknowledgments
These helper functions were designed to faithfully reconstruct action
probabilities for the policies implemented in contextual
,
while enabling reproducible Cram-based evaluation.