Note that this vignette is adapted from the arXiv paper


Suppose we fit a predictive model on a training set and predict on a test set. Dataset shift (Quionero-Candela et al. 2009; Moreno-Torres et al. 2012; Kelly, Hand, and Adams 1999), also known as data or population drift, occurs when training and test distributions are not alike. This is essentially a sample mismatch problem. Some regions of the data space are either too sparse or absent during training and gain importance at test time. We want methods to alert us to the presence of unexpected inputs in the test set (Rabanser, Günnemann, and Lipton 2019). To do so, a measure of divergence between training and test set is required. Can we not simply use the many modern off-the-shelf multivariate tests of equal distributions for this?

One reason for moving beyond tests of equal distributions is that they are often too strict. They require high fidelity between training and test set everywhere in the input domain. However, not all changes in distribution are a cause for concern – some changes are benign. Practitioners distrust these tests because of false alarms. Polyzotis et al. (2019) comment:

statistical tests for detecting changes in the data distribution […] are too sensitive and also uninformative for the typical scale of data in machine learning pipelines, which led us to seek alternative methods to quantify changes between data distributions.

Even when the difference is small or negligible, tests of equal distributions reject the null hypothesis of no difference. An alarm should only be raised