Skip to contents

This function runs on-policy simulation for contextual bandit algorithms using the Cram method. It evaluates the statistical properties of policy value estimates.

Usage

cram_bandit_sim(
  horizon,
  simulations,
  bandit,
  policy,
  alpha = 0.05,
  do_parallel = FALSE,
  seed = 42
)

Arguments

horizon

An integer specifying the number of timesteps (rounds) per simulation.

simulations

An integer specifying the number of independent Monte Carlo simulations to perform.

bandit

A contextual bandit environment object that generates contexts (feature vectors) and observed rewards for each arm chosen.

policy

A policy object that takes in a context and selects an arm (action) at each timestep.

alpha

Significance level for confidence intervals for calculating the empirical coverage. Default is 0.05 (95% confidence).

do_parallel

Whether to parallelize the simulations. Default to FALSE. We recommend keeping to FALSE unless necessary, please see vignette.

seed

An optional integer to set the random seed for reproducibility. If NULL, no seed is set.

Value

A list containing:

estimates

A table containing the detailed history of estimates and errors for each simulation.

raw_results

A data frame summarizing key metrics: Empirical Bias on Policy Value, Average relative error on Policy Value, RMSE using relative errors on Policy Value, Empirical Coverage of Confidence Intervals.

interactive_table

An interactive table summarizing the same key metrics in a user-friendly interface.

Examples

# \donttest{
# Number of time steps
horizon       <- 500L

# Number of simulations
simulations   <- 100L

# Number of arms
k = 4

# Number of context features
d= 3

# Reward beta parameters of linear model (the outcome generation models,
# one for each arm, are linear with arm-specific parameters betas)
list_betas <- cramR::get_betas(simulations, d, k)

# Define the contextual linear bandit, where sigma is the scale
# of the noise in the outcome linear model
bandit        <- cramR::ContextualLinearBandit$new(k = k,
                                                    d = d,
                                                    list_betas = list_betas,
                                                    sigma = 0.3)

# Define the policy object (choose between Contextual Epsilon Greedy,
# UCB Disjoint and Thompson Sampling)
policy <- cramR::BatchContextualEpsilonGreedyPolicy$new(epsilon=0.1,
                                                         batch_size=5)
# policy <- cramR::BatchLinUCBDisjointPolicyEpsilon$new(alpha=1.0,epsilon=0.1,batch_size=1)
# policy <- cramR::BatchContextualLinTSPolicy$new(v = 0.1, batch_size=1)


sim <- cram_bandit_sim(horizon, simulations,
                       bandit, policy,
                       alpha=0.05, do_parallel = FALSE)
#> Simulation horizon: 500
#> Number of simulations: 101
#> Number of batches: 1
#> Starting main loop.
#> Finished main loop.
#> Completed simulation in 0:00:04.844
#> Computing statistics.

sim$summary_table
#>                                       Metric   Value
#> 1             Empirical Bias on Policy Value 0.01049
#> 2     Average relative error on Policy Value 0.03564
#> 3 RMSE using relative errors on Policy Value 0.31144
#> 4 Empirical Coverage of Confidence Intervals 0.97000
# }