This function implements the crammed variance estimate of the policy value estimate for the contextual armed bandit on-policy evaluation setting.
Arguments
- pi
An array of shape (T × B, T, K) or (T × B, T), where T is the number of learning steps (or policy updates), B is the batch size, K is the number of arms, T x B is the total number of contexts. If 3D, pi[j, t, a] gives the probability that the policy pi_t assigns arm a to context X_j. If 2D, pi[j, t] gives the probability that the policy pi_t assigns arm A_j (arm actually chosen under X_j in the history) to context X_j. Please see vignette for more details.
- reward
A vector of observed rewards of length T x B
- arm
A vector of length T x B indicating which arm was selected in each context
- batch
(Optional) A vector or integer. If a vector, gives the batch assignment for each context. If an integer, interpreted as the batch size and contexts are assigned to a batch in the order of the dataset. Default is 1.