Introduction to cvms

Ludvig Renbo Olsen

2019-09-29

Abstract

This vignette is an introduction to the package cvms.
Cross-Validation for Model Selection.
 
 
Contact author at r-pkgs@ludvigolsen.dk  
 

Main functions

The main functions of cvms are:

The difference between cross_validate() and cross_validate_fn()

Originally, cvms only provided the option to cross-validate Gaussian and binomial regression models, fitting the models internally with the lm(), lmer(), glm(), and glmer() functions. The cross_validate() function has thus been designed specifically to work with those functions.

To allow cross-validation of custom model functions like support-vector machines, neural networks, etc., the cross_validate_fn() function has been added. You provide a model function and (if defaults fail) a predict function, and it does the rest (see examples below).

Examples

Attach packages

Load data

The dataset participant.scores comes with cvms.

Fold data

Create a grouping factor for subsetting of folds using groupdata2::fold(). Order the dataset by the folds.

participant age diagnosis score session .folds
9 34 0 33 1 1
9 34 0 53 2 1
9 34 0 66 3 1
8 21 1 16 1 1
8 21 1 32 2 1
8 21 1 44 3 1
2 23 0 24 1 2
2 23 0 40 2 2
2 23 0 67 3 2
1 20 1 10 1 2
1 20 1 24 2 2
1 20 1 45 3 2
6 31 1 14 1 2
6 31 1 25 2 2
6 31 1 30 3 2

Cross-validate a single model

Gaussian

RMSE MAE r2m r2c AIC AICc BIC Dependent Fixed
16.35261 13.75772 0.270991 0.270991 194.6218 195.9276 197.9556 score diagnosis
Fold Column Fold Target Prediction
.folds 1 33 51.00000
.folds 1 53 51.00000
.folds 1 66 51.00000
.folds 1 16 30.66667
.folds 1 32 30.66667
.folds 1 44 30.66667
Fold Column Fold RMSE MAE r2m r2c AIC AICc BIC
.folds 1 12.56760 10.72222 0.2439198 0.2439198 209.9622 211.1622 213.4963
.folds 2 16.60767 14.77778 0.2525524 0.2525524 182.8739 184.2857 186.0075
.folds 3 15.97355 12.87037 0.2306104 0.2306104 207.9074 209.1074 211.4416
.folds 4 20.26162 16.66049 0.3568816 0.3568816 177.7436 179.1554 180.8772
term estimate std.error statistic p.value Fold Fold Column
(Intercept) 51.00000 5.901264 8.642216 0.0000000 1 .folds
diagnosis -20.33333 7.464574 -2.723978 0.0123925 1 .folds
(Intercept) 53.33333 5.718886 9.325826 0.0000000 2 .folds
diagnosis -19.66667 7.565375 -2.599563 0.0176016 2 .folds
(Intercept) 49.77778 5.653977 8.804030 0.0000000 3 .folds
diagnosis -18.77778 7.151778 -2.625610 0.0154426 3 .folds
(Intercept) 49.55556 5.061304 9.791065 0.0000000 4 .folds
diagnosis -22.30556 6.695476 -3.331437 0.0035077 4 .folds
Folds Fold Columns Convergence Warnings Singular Fit Messages Other Warnings Warnings and Messages Family
4 1 0 0 0 list(Fold Column = character(0), Fold = integer(0), Type = character(0), Message = character(0)) gaussian

Binomial

Balanced Accuracy F1 Sensitivity Specificity Pos Pred Value Neg Pred Value AUC Lower CI Upper CI
0.7361111 0.8205128 0.8888889 0.5833333 0.7619048 0.7777778 0.7685185 0.5962701 0.9407669
Kappa MCC Detection Rate Detection Prevalence Prevalence
0.4927536 0.5048268 0.5333333 0.7 0.6
Sensitivities Specificities
1.0000000 0.0000000
1.0000000 0.0833333
0.9444444 0.0833333
0.9444444 0.1666667
0.9444444 0.2500000
0.8888889 0.2500000
Fold Column Prediction Target Pos_0 Pos_1 N
.folds 0 0 TP TN 7
.folds 1 0 FN FP 5
.folds 0 1 FP FN 2
.folds 1 1 TN TP 16

Cross-validate multiple models

Repeated cross-validation

Let’s first add some extra fold columns. We will use the num_fold_cols argument to add 3 unique fold columns. We tell fold() to keep the existing fold column and simply add three extra columns. We could also choose to remove the existing fold column, if for instance we were changing the number of folds (k). Note, that the original fold column will be renamed to “.folds_1”.

participant age diagnosis score session .folds_1 .folds_2 .folds_3 .folds_4
10 32 0 29 1 4 4 3 1
10 32 0 55 2 4 4 3 1
10 32 0 81 3 4 4 3 1
2 23 0 24 1 2 3 1 2
2 23 0 40 2 2 3 1 2
2 23 0 67 3 2 3 1 2
4 21 0 35 1 3 2 4 4
4 21 0 50 2 3 2 4 4
4 21 0 78 3 3 2 4 4
9 34 0 33 1 1 1 2 3
Fold Column Balanced Accuracy F1 Sensitivity Specificity Pos Pred Value Neg Pred Value AUC
.folds_1 0.7361111 0.8205128 0.8888889 0.5833333 0.7619048 0.7777778 0.7685185
.folds_2 0.7361111 0.8205128 0.8888889 0.5833333 0.7619048 0.7777778 0.7777778
.folds_3 0.7083333 0.7894737 0.8333333 0.5833333 0.7500000 0.7000000 0.7476852
.folds_4 0.7361111 0.8205128 0.8888889 0.5833333 0.7619048 0.7777778 0.7662037

Cross-validating custom model functions

cross_validate_fn() works with regression (gaussian), binary classification (binomial), and multiclass classification (multinomial).

SVM

Let’s cross-validate a support-vector machine using the svm() function from the e1071 package. First, we will create a model function. You can do anything you want in it, as long as it takes the arguments train_data and formula and returns the fitted model object.

For the svm() function, the default predict function and settings within cross_validate_fn() works, so we don’t have to specify a predict function. In many cases, it’s probably safer to supply a predict function anyway, so you’re sure everything is correct. We will see how in the naive Bayes example below, but first, let’s cross-validate the model function. Note, that some of the arguments have changed names (models -> formulas, family -> type).

Naive Bayes

The naive Bayes classifier requires us to supply a predict function, so we will go through that next. First, let’s create the model function.

Now, we will create a predict function. This will usually wrap stats::predict() and just make sure, the predictions have the correct format. When type is binomial, the predictions should be a vector, or a one-column matrix / data frame, with the probabilities of the second class (alphabetically). That is, if we have the classes 0 and 1, it should be the probabilities of the observations being in class 1. The help file, ?cross_validate_fn, describes the formats for the other types (gaussian and multinomial).

The predict function should take the arguments test_data, model, and formula. You do not need to use the formula within your function.

With both functions specified, we are ready to cross-validate our naive Bayes classifier.

Evaluating predictions

Evaluate predictions from a model trained outside cvms. Works with regression (gaussian), binary classification (binomial), and multiclass classification (multinomial). The following is an example of multinomial evaluation.

Multinomial

Create a dataset with 3 predictors and a target column. Partition it with groupdata2::partition() to create a training set and a validation set. multiclass_probability_tibble() is a simple helper function for generating random tibbles.

Train multinomial model using the nnet package and get the predicted probabilities.

Perform the evaluation. This will create one-vs-all binomial evaluations and summarize the results.

The class level results (i.e., the one-vs-all evaluations) are also included, and would usually be reported alongside the above results.

Baseline evaluations

Create baseline evaluations of a test set.

Gaussian

Approach: The baseline model (y ~ 1), where 1 is simply the intercept (i.e. mean of y), is fitted on n random subsets of the training set and evaluated on the test set. We also perform an evaluation of the model fitted on the entire training set.

Start by partitioning the dataset.

Create the baseline evaluations:

baseline(test_data = test_set, train_data = train_set,
         n = 100, dependent_col = "score", family = "gaussian")
#> $summarized_metrics
#> # A tibble: 9 x 9
#>   Measure   RMSE    MAE   r2m   r2c   AIC  AICc   BIC `Training Rows`
#>   <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>           <dbl>
#> 1 Mean     19.7  15.8       0     0  87.0  89.5  87.4            9.63
#> 2 Median   19.2  15.5       0     0  83.3  85.3  83.7            9   
#> 3 SD        1.05  0.759     0     0  28.9  27.6  29.6            3.22
#> 4 IQR       1.16  0.264     0     0  45.9  44.3  47.0            5   
#> 5 Max      24.1  19.4       0     0 137.  138.  138.            15   
#> 6 Min      18.9  15.5       0     0  42.0  48.0  41.2            5   
#> 7 NAs       0     0         0     0   0     0     0              0   
#> 8 INFs      0     0         0     0   0     0     0              0   
#> 9 All_rows 19.1  15.5       0     0 161.  162.  163.            18   
#> 
#> $random_evaluations
#> # A tibble: 100 x 13
#>     RMSE   MAE   r2m   r2c   AIC  AICc   BIC Predictions Coefficients
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list>      <list>      
#>  1  20.0  16.3     0     0  72.5  74.9  72.7 <tibble [1… <tibble [1 …
#>  2  19.0  15.5     0     0 137.  138.  138.  <tibble [1… <tibble [1 …
#>  3  20.2  15.7     0     0  61.3  64.3  61.2 <tibble [1… <tibble [1 …
#>  4  20.0  15.7     0     0  97.7  99.2  98.5 <tibble [1… <tibble [1 …
#>  5  19.3  15.6     0     0  73.3  75.7  73.5 <tibble [1… <tibble [1 …
#>  6  20.4  15.9     0     0  44.4  50.4  43.6 <tibble [1… <tibble [1 …
#>  7  19.0  15.5     0     0 118.  120.  119.  <tibble [1… <tibble [1 …
#>  8  19.4  15.5     0     0  93.3  95.1  94.0 <tibble [1… <tibble [1 …
#>  9  20.7  16.2     0     0  71.2  73.6  71.3 <tibble [1… <tibble [1 …
#> 10  20.8  17.1     0     0  43.7  49.7  42.9 <tibble [1… <tibble [1 …
#> # … with 90 more rows, and 4 more variables: `Training Rows` <int>,
#> #   Family <chr>, Dependent <chr>, Fixed <chr>

Binomial

Approach: n random sets of predictions are evaluated against the dependent variable in the test set. We also evaluate a set of all 0s and a set of all 1s.

Create the baseline evaluations:

baseline(test_data = test_set, n = 100, 
         dependent_col = "diagnosis", family = "binomial")
#> $summarized_metrics
#> # A tibble: 10 x 15
#>    Measure `Balanced Accur…     F1 Sensitivity Specificity `Pos Pred Value`
#>    <chr>              <dbl>  <dbl>       <dbl>       <dbl>            <dbl>
#>  1 Mean               0.502  0.495       0.478       0.525            0.498
#>  2 Median             0.5    0.5         0.5         0.5              0.500
#>  3 SD                 0.147  0.159       0.215       0.210            0.194
#>  4 IQR                0.167  0.252       0.333       0.333            0.200
#>  5 Max                0.833  0.833       0.833       1                1    
#>  6 Min                0.167  0.182       0           0                0    
#>  7 NAs                0      4           0           0                0    
#>  8 INFs               0      0           0           0                0    
#>  9 All_0              0.5   NA           0           1              NaN    
#> 10 All_1              0.5    0.667       1           0                0.5  
#> # … with 9 more variables: `Neg Pred Value` <dbl>, AUC <dbl>, `Lower
#> #   CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection
#> #   Rate` <dbl>, `Detection Prevalence` <dbl>, Prevalence <dbl>
#> 
#> $random_evaluations
#> # A tibble: 100 x 19
#>    `Balanced Accur…    F1 Sensitivity Specificity `Pos Pred Value`
#>               <dbl> <dbl>       <dbl>       <dbl>            <dbl>
#>  1            0.417 0.364       0.333       0.5              0.4  
#>  2            0.5   0.5         0.5         0.5              0.5  
#>  3            0.417 0.364       0.333       0.5              0.4  
#>  4            0.667 0.6         0.5         0.833            0.75 
#>  5            0.583 0.667       0.833       0.333            0.556
#>  6            0.667 0.6         0.5         0.833            0.75 
#>  7            0.25  0.308       0.333       0.167            0.286
#>  8            0.5   0.4         0.333       0.667            0.500
#>  9            0.25  0.182       0.167       0.333            0.20 
#> 10            0.417 0.222       0.167       0.667            0.333
#> # … with 90 more rows, and 14 more variables: `Neg Pred Value` <dbl>,
#> #   AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>,
#> #   `Detection Rate` <dbl>, `Detection Prevalence` <dbl>,
#> #   Prevalence <dbl>, Predictions <list>, ROC <list>, `Confusion
#> #   Matrix` <list>, Family <chr>, Dependent <chr>

Multinomial

Approach: Creates one-vs-all (binomial) baseline evaluations for n sets of random predictions against the dependent variable, along with sets of “all class x,y,z,…” predictions.

Create the baseline evaluations:

multiclass_baseline <- baseline(
  test_data = multiclass_test_set, n = 100,
  dependent_col = "class", family = "multinomial")

# Summarized metrics
multiclass_baseline$summarized_metrics
#> # A tibble: 12 x 16
#>    Measure `Overall Accura… `Balanced Accur…      F1 Sensitivity
#>    <chr>              <dbl>            <dbl>   <dbl>       <dbl>
#>  1 Mean              0.250            0.501   0.283       0.252 
#>  2 Median            0.231            0.494   0.280       0.243 
#>  3 SD                0.0841           0.0567  0.0737      0.0853
#>  4 IQR               0.115            0.0795  0.0920      0.121 
#>  5 Max               0.538            0.786   0.667       1     
#>  6 Min               0.0769           0.262   0.111       0     
#>  7 NAs              NA                0      61           0     
#>  8 INFs             NA                0       0           0     
#>  9 All_cl…           0.192            0.5    NA           0.25  
#> 10 All_cl…           0.269            0.5    NA           0.25  
#> 11 All_cl…           0.269            0.5    NA           0.25  
#> 12 All_cl…           0.269            0.5    NA           0.25  
#> # … with 11 more variables: Specificity <dbl>, `Pos Pred Value` <dbl>,
#> #   `Neg Pred Value` <dbl>, AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>,
#> #   Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>, `Detection
#> #   Prevalence` <dbl>, Prevalence <dbl>

# Summarized class level results for class 1
multiclass_baseline$summarized_class_level_results %>% 
  dplyr::filter(Class == "class_1") %>%
  tidyr::unnest(Results)
#> # A tibble: 10 x 16
#>    Class Measure `Balanced Accur…     F1 Sensitivity Specificity
#>    <chr> <chr>              <dbl>  <dbl>       <dbl>       <dbl>
#>  1 clas… Mean               0.514  0.284       0.28       0.748 
#>  2 clas… Median             0.529  0.286       0.2        0.762 
#>  3 clas… SD                 0.102  0.106       0.191      0.0979
#>  4 clas… IQR                0.124  0.182       0.2        0.0952
#>  5 clas… Max                0.786  0.526       1          0.952 
#>  6 clas… Min                0.262  0.125       0          0.524 
#>  7 clas… NAs                0     18           0          0     
#>  8 clas… INFs               0      0           0          0     
#>  9 clas… All_0              0.5   NA           0          1     
#> 10 clas… All_1              0.5    0.323       1          0     
#> # … with 10 more variables: `Pos Pred Value` <dbl>, `Neg Pred
#> #   Value` <dbl>, AUC <dbl>, `Lower CI` <dbl>, `Upper CI` <dbl>,
#> #   Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>, `Detection
#> #   Prevalence` <dbl>, Prevalence <dbl>

# Random evaluations
# Note, that the class level results for each repetition
# is available as well
multiclass_baseline$random_evaluations
#> # A tibble: 100 x 21
#>    Repetition `Overall Accura… `Balanced Accur…      F1 Sensitivity
#>         <dbl>            <dbl>            <dbl>   <dbl>       <dbl>
#>  1          1            0.154            0.445 NaN           0.171
#>  2          2            0.269            0.518 NaN           0.279
#>  3          3            0.192            0.460   0.195       0.193
#>  4          4            0.385            0.591   0.380       0.386
#>  5          5            0.154            0.430 NaN           0.143
#>  6          6            0.154            0.438 NaN           0.157
#>  7          7            0.154            0.445 NaN           0.171
#>  8          8            0.346            0.574   0.341       0.364
#>  9          9            0.308            0.541   0.315       0.314
#> 10         10            0.308            0.536   0.322       0.3  
#> # … with 90 more rows, and 16 more variables: Specificity <dbl>, `Pos Pred
#> #   Value` <dbl>, `Neg Pred Value` <dbl>, AUC <dbl>, `Lower CI` <dbl>,
#> #   `Upper CI` <dbl>, Kappa <dbl>, MCC <dbl>, `Detection Rate` <dbl>,
#> #   `Detection Prevalence` <dbl>, Prevalence <dbl>, Predictions <list>,
#> #   `Confusion Matrix` <list>, `Class Level Results` <list>, Family <chr>,
#> #   Dependent <chr>

Plot results

There are currently a small set of plots for quick visualization of the results. It is supposed to be easy to extract the needed information to create your own plots. If you lack access to any information or have other requests or ideas, feel free to open an issue.

Generate model formulas

Instead of manually typing all possible model formulas for a set of fixed effects (including the possible interactions), combine_predictors() can do it for you (with some constraints).

When including interactions, >200k formulas have been precomputed for up to 8 fixed effects, with a maximum interaction size of 3, and a maximum of 5 fixed effects per formula. It’s possible to further limit the generated formulas.

We can also append a random effects structure to the generated formulas.

If two or more fixed effects should not be in the same formula, like an effect and its log-transformed version, we can provide them as sublists.