Get model predictions and plot them with ggplot2

Stefano Coretta

2019-06-17

While plot_smooths() offers a streamlined way of plotting predicted smooths from a GAM model (see vignette("plot-smooths", package = "tidymv")), it is too constrained for other more complex cases.

The most general solution is to get the predicted values of the outcome variable according to all the combinations of terms in the model and use this dataframe for plotting. This method grants the user maximum control over what can be plotted and how to transform the data (if necessary).

I will illustrate how to use the function predict_gam() to create a prediction dataframe and how this dataframe can be used for plotting different cases.

Let’s load the necessary packages.

library(ggplot2)
theme_set(theme_bw())
library(dplyr)
library(mgcv)
library(tidymv)

Smooths

First of all let’s generate some simulated data and create a GAM model with a factor by variable.

library(mgcv)
set.seed(10)
data <- gamSim(4, 400)
#> Factor `by' variable example

model <- gam(
  y ~
    fac +
    s(x2, by = fac),
  data = data
)

summary(model)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> y ~ fac + s(x2, by = fac)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   1.3427     0.1564   8.585 2.34e-16 ***
#> fac2         -1.8593     0.2280  -8.154 5.14e-15 ***
#> fac3          1.9861     0.2313   8.588 2.29e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>              edf Ref.df      F p-value    
#> s(x2):fac1 6.647  7.765  2.748 0.00574 ** 
#> s(x2):fac2 2.096  2.612 65.024 < 2e-16 ***
#> s(x2):fac3 7.203  8.218 31.293 < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.627   Deviance explained = 64.4%
#> GCV = 3.6623  Scale est. = 3.4888    n = 400

We can extract the predicted values with predict_gam(). The predicted values of the outcome variable are in the column fit, while fit.se reports the standard error of the predicted values.

model_p <- predict_gam(model)
model_p
#> # A tibble: 150 x 4
#>    fac        x2     fit se.fit
#>    <fct>   <dbl>   <dbl>  <dbl>
#>  1 1     0.00134 -0.0223  0.669
#>  2 2     0.00134 -3.38    0.472
#>  3 3     0.00134 -1.05    0.973
#>  4 1     0.0217   0.425   0.524
#>  5 2     0.0217  -3.29    0.434
#>  6 3     0.0217  -0.123   0.732
#>  7 1     0.0421   0.859   0.425
#>  8 2     0.0421  -3.20    0.398
#>  9 3     0.0421   0.812   0.551
#> 10 1     0.0625   1.25    0.388
#> # … with 140 more rows

Now plotting can be done with ggplot2. The convenience function geom_smooth_ci() can be used to plot the predicted smooths with confidence intervals.

model_p %>%
  ggplot(aes(x2, fit)) +
  geom_smooth_ci(fac)

Surface smooths

Now let’s plot a model that has a tensor product interaction term (ti()).

model_2 <- gam(
  y ~
    s(x2) +
    s(f1) +
    ti(x2, f1),
  data = data
)

summary(model_2)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> y ~ s(x2) + s(f1) + ti(x2, f1)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    1.315      0.151   8.711   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>            edf Ref.df     F p-value   
#> s(x2)     1.00  1.000 0.002 0.96890   
#> s(f1)     1.00  1.000 4.462 0.03528 * 
#> ti(x2,f1) 1.99  2.394 4.422 0.00615 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.0372   Deviance explained = 4.68%
#> GCV = 9.1241  Scale est. = 9.0103    n = 400

Let’s get the prediction dataframe and produce a contour plot. We can adjust labels and aesthetics using the usual ggplot2 methods.

model_2_p <- predict_gam(model_2)
model_2_p
#> # A tibble: 2,500 x 4
#>         x2      f1    fit se.fit
#>      <dbl>   <dbl>  <dbl>  <dbl>
#>  1 0.00134 0.00125 -1.21   0.656
#>  2 0.0217  0.00125 -1.13   0.636
#>  3 0.0421  0.00125 -1.05   0.616
#>  4 0.0625  0.00125 -0.975  0.597
#>  5 0.0829  0.00125 -0.898  0.578
#>  6 0.103   0.00125 -0.821  0.559
#>  7 0.124   0.00125 -0.744  0.540
#>  8 0.144   0.00125 -0.667  0.522
#>  9 0.164   0.00125 -0.590  0.504
#> 10 0.185   0.00125 -0.513  0.487
#> # … with 2,490 more rows
model_2_p %>%
  ggplot(aes(x2, f1, z = fit)) +
  geom_raster(aes(fill = fit)) +
  geom_contour(colour = "white") +
  scale_fill_continuous(name = "y") +
  theme_minimal() +
  theme(legend.position = "top")

Smooths at specified values of a continuous predictor

To plot the smooths across a few values of a continuous predictor, we can use the values argument in predict_gam().

predict_gam(model_2, values = list(f1 = c(0.5, 1, 1.5))) %>%
  ggplot(aes(x2, fit)) +
  geom_smooth_ci(f1)

Exclude terms (like random effects)

It is possible to exclude terms when predicting values by means of the exclude_terms argument. This can be useful when there are random effects, like in the following model.

data_re <- data %>%
  mutate(rand = rep(letters[1:4], each = 100), rand = as.factor(rand))

model_3 <- gam(
  y ~
    s(x2) +
    s(x2, rand, bs = "fs", m = 1),
  data = data_re
)
#> Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has
#> repeated 1-d smooths of same variable.

summary(model_3)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> y ~ s(x2) + s(x2, rand, bs = "fs", m = 1)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)   1.2804     0.4151   3.085  0.00219 **
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>              edf Ref.df     F p-value
#> s(x2)       4.20  4.941 1.611   0.178
#> s(x2,rand) 12.22 35.000 0.481   0.109
#> 
#> R-sq.(adj) =  0.0621   Deviance explained = 10.1%
#> GCV = 9.1764  Scale est. = 8.7768    n = 400

exclude_terms takes a character vector of term names, as they appear in the output of summary() (rather than as they are specified in the model formula). For example, to remove the term s(x2, fac, bs = "fs", m = 1), "s(x2,fac)" should be used since this is how the summary output reports this term. The output still contains the excluded columns. The predicted values of the outcome variable are not affected by the value the excluded terms (the predicted values are repeated for each value of the excluded terms). In other words, the coefficients for the excluded terms are set to 0 when predicting. We can filter the predicted dataset to get unique predicted values by choosing any value or level of the excluded terms.\footnote{Alternatively, we can use splice(): group_by(a) %>% splice(1). See ?splice.}

predict_gam(model_3, exclude_terms = "s(x2,rand)") %>%
  filter(rand == "a") %>%
  ggplot(aes(x2, fit)) +
  geom_smooth_ci()

To speed up the calculation of the predictions when excluding terms, it is helpful to select a single value for the unnecessary terms using the values argument, rather than filtering with filter(). As with filter(), any value of the excluded variable can be used. If the value is NULL, the first value/level of the term is automatically selected (in the example below, values = list(rand = NULL) and values = list(rand = "a") would be equivalent).

predict_gam(model_3, exclude_terms = "s(x2,rand)", values = list(rand = NULL)) %>%
  ggplot(aes(x2, fit)) +
  geom_smooth_ci()

Of course, it is possible to plot the predicted values of random effects if we wish to do so. In the following example, the random effect rand is not excluded when predicting, and it is used to facet the plot.

predict_gam(model_3) %>%
  ggplot(aes(x2, fit)) +
  geom_smooth_ci() +
  facet_wrap(~rand)