Cram Bandit: On-policy Statistical Evaluation in Contextual Bandits

Performs the Cram method for On-policy Statistical Evaluation in Contextual Bandits

Usage

cram_bandit(pi, arm, reward, batch = 1, alpha = 0.05)

Arguments

pi: An array of shape (T × B, T, K) or (T × B, T), where T is the number of learning steps (or policy updates), B is the batch size, K is the number of arms, T x B is the total number of contexts. If 3D, pi[j, t, a] gives the probability that the policy pi_t assigns arm a to context X_j. If 2D, pi[j, t] gives the probability that the policy pi_t assigns arm A_j (arm actually chosen under X_j in the history) to context X_j. Please see vignette for more details.
arm: A vector of length T x B indicating which arm was selected in each context
reward: A vector of observed rewards of length T x B
batch: (Optional) A vector or integer. If a vector, gives the batch assignment for each context. If an integer, interpreted as the batch size and contexts are assigned to a batch in the order of the dataset. Default is 1.
alpha: Significance level for confidence intervals for calculating the empirical coverage. Default is 0.05 (95% confidence).

Value

A list containing:

raw_results: A data frame summarizing key metrics: Empirical Bias on Policy Value, Average relative error on Policy Value, RMSE using relative errors on Policy Value, Empirical Coverage of Confidence Intervals.
interactive_table: An interactive table summarizing the same key metrics in a user-friendly interface.

Examples

# Example with batch size of 1

# Set random seed for reproducibility
set.seed(42)

# Define parameters
T <- 100  # Number of timesteps
K <- 4    # Number of arms

# Simulate a 3D array pi of shape (T, T, K)
# - First dimension: Individuals (context Xj)
# - Second dimension: Time steps (pi_t)
# - Third dimension: Arms (depth)
pi <- array(runif(T * T * K, 0.1, 1), dim = c(T, T, K))

# Normalize probabilities so that each row sums to 1 across arms
for (t in 1:T) {
  for (j in 1:T) {
    pi[j, t, ] <- pi[j, t, ] / sum(pi[j, t, ])
    }
 }

# Simulate arm selections (randomly choosing an arm)
arm <- sample(1:K, T, replace = TRUE)

# Simulate rewards (assume normally distributed rewards)
reward <- rnorm(T, mean = 1, sd = 0.5)

result <- cram_bandit(pi, arm, reward, batch=1, alpha=0.05)
result$raw_results
#>                        Metric   Value
#> 1       Policy Value Estimate 0.67621
#> 2 Policy Value Standard Error 0.04394
#> 3       Policy Value CI Lower 0.59008
#> 4       Policy Value CI Upper 0.76234
result$interactive_table