Cram ML
In the article “Introduction & Cram Policy Part 1”, we
introduced the Cram method, which enables simultaneous learning and
evaluation of a binary policy. In this section, we extend the framework
to machine learning tasks through the cram_ml()
function.
Output of Cram ML
Cram ML outputs the Expected Loss Estimate, which
refers to the following statistical quantity:
The Expected Loss Estimate represents
the average loss that would be incurred if a model, trained on a given
data sample, were deployed across the entire population. In the Cram
framework, this corresponds to estimating how the learned model
generalizes to unseen data—i.e., how it performs on new observations
drawn from the true data-generating distribution
,
independently of the training data.
This expected loss serves as the population-level performance metric (analogous to a policy value in policy learning), and Cram provides a consistent, low-bias estimate of this quantity by combining models trained on sequential batches and evaluating them on held-out observations.
Built-in Model
To illustrate the use of cram_ml()
, we begin by
generating a synthetic dataset for a regression task. The data consists
of three independent covariates and a continuous outcome.
set.seed(42)
X_data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100))
Y_data <- rnorm(100)
data_df <- data.frame(X_data, Y = Y_data)
This section illustrates how to use cram_ml()
with
built-in modeling options available through the cramR
package. The function integrates with the caret
framework,
allowing users to specify a learning algorithm, a loss function, and a
batching strategy to evaluate model performance.
Beyond caret
, cram_ml()
also supports fully
custom model training, prediction, and loss functions, making it
suitable for virtually any machine learning task — including regression
and classification.
The cram_ml()
function offers extensive flexibility
through its loss_name
and caret_params
arguments.
loss_name argument
The loss_name
argument specifies the performance metric
used to evaluate the model at each batch. Note that Cram needs to
calculate individual losses (i.e. map a data point and prediction to a
loss value) that are then internally averaged across batches and
observations to form the Expected Loss Estimate.
Depending on the task, losses are interpreted as follows:
We denote a data point and a model trained on the first k batches of data to illustrate how the individual losses are computed using the built-in loss names of the package.
Regression Losses
Squared Error (
"se"
):
Measures the squared difference between predicted and actual outcomes.Absolute Error (
"ae"
):
Captures the magnitude of prediction error, regardless of direction.
Classification Losses
Accuracy (
"accuracy"
):
It is more a performance metric than a loss here - Cram allows you to define any performance metric that you want to estimate and accuracy is a built-in example. The metric is 1 for correct predictions, 0 for incorrect ones.-
Logarithmic Loss (
"logloss"
):The
"logloss"
loss function measures how well predicted class probabilities align with the true class labels. It applies to both binary and multiclass classification tasks.For a given observation , let:
be the true class label,
be the predicted probability assigned to class by the model.
The individual log loss is computed as: That is, we take the negative log of the probability assigned to the true class.
caret_params argument
The caret_params
list defines how the model should be
trained using the caret
package. It can include any argument supported by
caret::train()
, allowing full control over model
specification and tuning. Common components include:
-
method
: the machine learning algorithm (e.g.,"lm"
for linear regression,"rf"
for random forest,"xgbTree"
for XGBoost,"svmLinear"
for support vector machines) -
trControl
: the resampling strategy (e.g.,trainControl(method = "cv", number = 5)
for 5-fold cross-validation, or"none"
for training without resampling) -
tuneGrid
: a grid of hyperparameters for tuning (e.g.,expand.grid(mtry = c(2, 3, 4))
) -
metric
: the model selection metric used during tuning (e.g.,"RMSE"
or"Accuracy"
) -
preProcess
: optional preprocessing steps (e.g., centering, scaling) -
importance
: logical flag to compute variable importance (useful for tree-based models)
Refer to the full documentation at caret model training and tuning for the complete list of supported arguments and options.
caret_params_lm <- list(
method = "lm",
trControl = trainControl(method = "none")
)
result <- cram_ml(
data = data_df,
formula = Y ~ .,
batch = 5,
loss_name = "se",
caret_params = caret_params_lm
)
print(result)
#> $raw_results
#> Metric Value
#> 1 Expected Loss Estimate 0.86429
#> 2 Expected Loss Standard Error 0.73665
#> 3 Expected Loss CI Lower -0.57952
#> 4 Expected Loss CI Upper 2.30809
#>
#> $interactive_table
#>
#> $final_ml_model
#> Linear Regression
#>
#> 100 samples
#> 3 predictor
#>
#> No pre-processing
#> Resampling: None
Case of categorical target variable
The cram_ml()
function can also be used for
classification tasks, whether predicting hard labels or
class probabilities. This is controlled via the classify
argument and loss_name
. Below, we demonstrate two typical
use cases.
Also note that all data inputs needs to be of numeric types, hence
for Y
categorical, it should contain numeric values
representing the class of each observation. No need to use the type
factor
for cram_ml()
.
Case 1: Predicting Class Labels
In this case, the model outputs hard predictions (labels, e.g. 0, 1, 2 etc.), and the metric used is classification accuracy—the proportion of correctly predicted labels.
- Use
loss_name = "accuracy"
- Set
classProbs = FALSE
intrainControl
- Set
classify = TRUE
incram_ml()
set.seed(42)
# Generate binary classification dataset
X_data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100))
Y_data <- rbinom(nrow(X_data), 1, 0.5)
data_df <- data.frame(X_data, Y = Y_data)
# Define caret parameters: predict labels (default behavior)
caret_params_rf <- list(
method = "rf",
trControl = trainControl(method = "none")
)
# Run CRAM ML with accuracy as loss
result <- cram_ml(
data = data_df,
formula = Y ~ .,
batch = 5,
loss_name = "accuracy",
caret_params = caret_params_rf,
classify = TRUE
)
print(result)
#> $raw_results
#> Metric Value
#> 1 Expected Loss Estimate 0.48750
#> 2 Expected Loss Standard Error 0.43071
#> 3 Expected Loss CI Lower -0.35668
#> 4 Expected Loss CI Upper 1.33168
#>
#> $interactive_table
#>
#> $final_ml_model
#> Random Forest
#>
#> 100 samples
#> 3 predictor
#> 2 classes: 'class0', 'class1'
#>
#> No pre-processing
#> Resampling: None
Case 2: Predicting Class Probabilities
In this setup, the model outputs class
probabilities, and the loss is evaluated using
logarithmic loss (logloss
)—a standard
metric for probabilistic classification.
- Use
loss_name = "logloss"
- Set
classProbs = TRUE
intrainControl
- Set
classify = TRUE
incram_ml()
set.seed(42)
# Generate binary classification dataset
X_data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100))
Y_data <- rbinom(nrow(X_data), 1, 0.5)
data_df <- data.frame(X_data, Y = Y_data)
# Define caret parameters for probability output
caret_params_rf_probs <- list(
method = "rf",
trControl = trainControl(method = "none", classProbs = TRUE)
)
# Run CRAM ML with logloss as the evaluation loss
result <- cram_ml(
data = data_df,
formula = Y ~ .,
batch = 5,
loss_name = "logloss",
caret_params = caret_params_rf_probs,
classify = TRUE
)
print(result)
#> $raw_results
#> Metric Value
#> 1 Expected Loss Estimate 0.93225
#> 2 Expected Loss Standard Error 0.48118
#> 3 Expected Loss CI Lower -0.01085
#> 4 Expected Loss CI Upper 1.87534
#>
#> $interactive_table
#>
#> $final_ml_model
#> Random Forest
#>
#> 100 samples
#> 3 predictor
#> 2 classes: 'class0', 'class1'
#>
#> No pre-processing
#> Resampling: None
Together, these arguments allow users to apply cram_ml()
using a wide variety of built-in machine learning models and losses. If
users need to go beyond these built-in choices, we also provide in the
next section a friendly workflow on how to specify custom models and
losses with cram_ml()
.
Custom Model
In addition to using built-in learners via caret
,
cram_ml()
also supports fully custom model
workflows. You can specify your own:
- Model fitting function (
custom_fit
) - Prediction function (
custom_predict
) - Loss function (
custom_loss
)
This offers maximum flexibility, allowing CRAM to evaluate any learning model with any performance criterion, including regression, classification, or even unsupervised losses such as clustering distance.
1. custom_fit(data, ...)
This function takes a data frame and returns a fitted model. You may define additional arguments such as hyperparameters or training settings.
-
data
: A data frame that includes both predictors and the outcome variableY
.
Example: A basic linear model fit on three predictors:
custom_fit <- function(data) {
lm(Y ~ x1 + x2 + x3, data = data)
}
2. custom_predict(model, data)
This function generates predictions from the fitted model on new data. It returns a numeric vector of predicted outcomes.
-
model
: The fitted model returned bycustom_fit()
-
data
: A data frame of new observations (typically including all original predictors)
Example: Extract predictors and apply a standard
predict()
call:
3. custom_loss(predictions, data)
This function defines the loss metric used to evaluate model
predictions. It should return a numeric vector of individual
losses, one per observation. These are internally aggregated by
cram_ml()
to compute the overall performance.
-
predictions
: A numeric vector of predicted values from the model -
data
: The data frame containing the true outcome values (Y
)
Example: Define a custom loss function using Squared Error (SE)
custom_loss <- function(predictions, data) {
actuals <- data$Y
se_loss <- (predictions - actuals)^2
return(se_loss)
}
4. Use cram_ml()
with Custom Functions
Once you have defined your custom training, prediction, and loss
functions, you can pass them directly to cram_ml()
as shown
below, note that caret_params
and loss_name
that were used for built-in functionalities are now
NULL
:
set.seed(42)
X_data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100))
Y_data <- rnorm(100)
data_df <- data.frame(X_data, Y = Y_data)
result <- cram_ml(
data = data_df,
formula = Y ~ .,
batch = 5,
custom_fit = custom_fit,
custom_predict = custom_predict,
custom_loss = custom_loss
)
print(result)
#> $raw_results
#> Metric Value
#> 1 Expected Loss Estimate 0.86429
#> 2 Expected Loss Standard Error 0.73665
#> 3 Expected Loss CI Lower -0.57952
#> 4 Expected Loss CI Upper 2.30809
#>
#> $interactive_table
#>
#> $final_ml_model
#>
#> Call:
#> lm(formula = Y ~ x1 + x2 + x3, data = data)
#>
#> Coefficients:
#> (Intercept) x1 x2 x3
#> 0.031503 0.057754 0.008829 -0.031611