Advanced Usage of the Model-Free Knockoff Filter

2017-09-28

The function MFKnockoffs.filter is a wrapper around several simpler functions that

1. Construct knockoff variables (various functions with prefix MFKnockoffs.create)
2. Compute the test statistic $$W$$ (various functions with prefix MFKnockoffs.stat)
3. Compute the threshold for variable selection (MFKnockoffs.threshold)

These functions may be called directly if desired. The purpose of this vignette is to illustrate the flexibility of this package with some examples.

set.seed(1234)
library(MFKnockoffs)

Creating an artificial problem

Let us begin by creating some synthetic data.

# Problem parameters
n = 1000         # number of observations
p = 1000         # number of variables
k = 60           # number of variables with nonzero coefficients
amplitude = 7.5  # signal amplitude (for noise level = 1)

# Generate the variables from a multivariate normal distribution
mu = rep(0,p); Sigma = diag(p)
X = matrix(rnorm(n*p),n)

# Generate the response from a logistic model and encode it as a factor.
nonzero = sample(p, k)
beta = amplitude * (1:p %in% nonzero) / sqrt(n)
invlogit = function(x) exp(x) / (1+exp(x))
y.sample = function(x) rbinom(n, prob=invlogit(x %*% beta), size=1)
y = factor(y.sample(X), levels=c(0,1), labels=c("A","B"))

Looking inside the knockoff filter

Instead of using MFKnockoffs.filter directly, we can run the filter manually by calling its main components one by one.

The first step is to generate the knockoff variables for the true Gaussian distribution of the variables.

X_k = MFKnockoffs.create.gaussian(X, mu, Sigma)

Then, we compute the knockoff statistics using 10-fold cross-validated lasso

W = MFKnockoffs.stat.glmnet_coef_difference(X, X_k, y, nfolds=10, family="binomial")

Now we can compute the rejection threshold

thres = MFKnockoffs.threshold(W, q=0.15, method='knockoff+')

The final step is to select the variables

selected = which(W >= thres)
print(selected)
##     3  10  61  76 108 148 172 173 182 210 248 273 297 329 334 378 426
##  428 443 471 494 510 557 563 595 596 602 631 648 668 708 736 787 814
##  843 844 931 953 959 965

The false discovery proportion is

fdp = function(selected) sum(beta[selected] == 0) / max(1, length(selected))
fdp(selected)
##  0.375

Performing numerical simulations

We show how to manually run the knockoff filter multiple times and compute average quantities. This is particularly useful to estimate the FDR (or the power) for a particular configuration of the knockoff filter on artificial problems.

# Optimize the parameters needed for generating Gaussian knockoffs,
# by solving as SDP to minimize correlations with the original variables.
# This calculation requires only the model parameters mu and Sigma,
# not the observed variables X. Therefore, there is no reason to perform it
# more than once for our simulation.

diag_s = MFKnockoffs.knocks.solve_sdp(Sigma)

# Compute the fdp over 20 iterations
nIterations = 20
fdp_list = sapply(1:nIterations, function(it) {
# Run the knockoff filter manually, using the pre-computed value of diag_s
X_k = MFKnockoffs.create.gaussian(X, mu, Sigma, diag_s=diag_s)
W = MFKnockoffs.stat.glmnet_lambda_signed_max(X, X_k, y, family="binomial")
t = MFKnockoffs.threshold(W, q=0.15, method='knockoff+')
selected = which(W >= t)
# Compute and store the fdp
fdp(selected)
})
# Estimate the FDR
mean(fdp_list)
##  0.05531649