The **wsrf** package is a parallel implementation of the Weighted Subspace Random Forest algorithm (wsrf) of Xu et al. (2012). A novel variable weighting method is used for variable subspace selection in place of the traditional approach of random variable sampling. This new approach is particularly useful in building models for high dimensional data — often consisting of thousands of variables. Parallel computation is used to take advantage of multi-core machines and clusters of machines to build random forest models from high dimensional data with reduced elapsed times.

Currently, **wsrf** requires R (>= 3.3.0), **Rcpp** (>= 0.10.2) (Eddelbuettel and François 2011; Eddelbuettel 2013). For the use of multi-threading, a C++ compiler with C++11 standard support of threads is required. To install the latest stable version of the package, from within R run:

`install.packages("wsrf")`

or the latest development version:

`::install_github("simonyansenzhao/wsrf") devtools`

The version of R before 3.3.0 doesn’t provide fully support of C++11, thus we provided other options for installation of wsrf. From 1.6.0, we drop the support for those options. One can find the usage in the documentation from previous version if interested.

This section demonstrates how to use **wsrf**, especially on a cluster of machines.

The example uses a small dataset *iris* from R. See the help page in R (`?iris`

) for more details of *iris*. Below are the basic information of it.

```
iris
ds <-dim(ds)
```

`## [1] 150 5`

`names(ds)`

`## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"`

Before building the model we need to prepare the training dataset. First we specify the target variable.

```
"Species"
target <- names(ds) vars <-
```

Next we deal with missing values, using `na.roughfix()`

from **randomForest** to take care of them.

```
library("randomForest")
if (sum(is.na(ds[vars]))) ds[vars] <- na.roughfix(ds[vars])
as.factor(ds[[target]])
ds[target] <- table(ds[target])) (tt <-
```

```
## Species
## setosa versicolor virginica
## 50 50 50
```

We construct the formula that describes the model which will predict the target based on all other variables.

` as.formula(paste(target, "~ ."))) (form <-`

`## Species ~ .`

Finally we create the randomly selected training and test datasets, setting a seed so that the results can be exactly replicated.

```
42
seed <-set.seed(seed)
length(train <- sample(nrow(ds), 0.7*nrow(ds)))
```

`## [1] 105`

`length(test <- setdiff(seq_len(nrow(ds)), train))`

`## [1] 45`

The function to build a weighted random forest model in **wsrf** is:

`wsrf(formula, data, ...)`

and

```
wsrf(x,
y,mtry=floor(log2(length(x))+1),
ntree=500,
weights=TRUE,
parallel=TRUE,
na.action=na.fail,
importance=FALSE,
nodesize=2,
clusterlogfile, ...)
```

We use the training dataset to build a random forest model. All parameters, except `formula`

and `data`

, use their default values: `500`

for `ntree`

— the number of trees; `TRUE`

for `weights`

— weighted subspace random forest or random forest; `TRUE`

for `parallel`

— use multi-thread or other options, etc.

```
library("wsrf")
.1 <- wsrf(form, data=ds[train, vars], parallel=FALSE)
model.wsrfprint(model.wsrf.1)
```

```
## A Weighted Subspace Random Forest model with 500 trees.
##
## No. of variables tried at each split: 3
## Minimum size of terminal nodes: 2
## Out-of-Bag Error Rate: 0.08
## Strength: 0.84
## Correlation: 0.10
##
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 37 1 0 0.03
## versicolor 0 33 2 0.06
## virginica 0 5 27 0.16
```

`print(model.wsrf.1, 1) # Print tree 1.`

```
## Tree 1 has 4 tests (internal nodes), with OOB error rate 0.1000:
##
## 1) Petal.Width <= 0.5 [setosa] (1 0 0) *
## 1) Petal.Width > 0.5
## .. 2) Petal.Width <= 1.6
## .. .. 3) Petal.Length <= 4.9 [versicolor] (0 1 0) *
## .. .. 3) Petal.Length > 4.9 [versicolor] (0 0.5 0.5) *
## .. 2) Petal.Width > 1.6
## .. .. 4) Petal.Length <= 4.8 [versicolor] (0 1 0) *
## .. .. 4) Petal.Length > 4.8 [virginica] (0 0 1) *
```

Then, `predict`

the classes of test data.

```
predict(model.wsrf.1, newdata=ds[test, vars], type="class")$class
cl <- ds[test, target]
actual <- mean(cl == actual, na.rm=TRUE)) (accuracy.wsrf <-
```

`## [1] 0.9555556`

Thus, we have built a model that is around 96% accurate on unseen testing data.

Using different random seed, we obtain another model.

```
set.seed(seed+1)
# Here we build another model without weighting.
.2 <- wsrf(form, data=ds[train, vars], weights=FALSE, parallel=FALSE)
model.wsrfprint(model.wsrf.2)
```

```
## A Weighted Subspace Random Forest model with 500 trees.
##
## No. of variables tried at each split: 3
## Minimum size of terminal nodes: 2
## Out-of-Bag Error Rate: 0.07
## Strength: 0.85
## Correlation: 0.08
##
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 38 0 0 0.00
## versicolor 0 33 2 0.06
## virginica 0 5 27 0.16
```

We can also derive a subset of the forest from the model or a combination of multiple forests.

```
subset.wsrf(model.wsrf.1, 1:150)
submodel.wsrf <-print(submodel.wsrf)
```

```
## A Weighted Subspace Random Forest model with 150 trees.
##
## No. of variables tried at each split: 3
## Minimum size of terminal nodes: 2
## Out-of-Bag Error Rate: 0.09
## Strength: 0.84
## Correlation: 0.10
##
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 36 2 0 0.05
## versicolor 0 33 2 0.06
## virginica 0 5 27 0.16
```

```
combine.wsrf(model.wsrf.1, model.wsrf.2)
bigmodel.wsrf <-print(bigmodel.wsrf)
```

```
## A Weighted Subspace Random Forest model with 1000 trees.
##
## No. of variables tried at each split: 3
## Minimum size of terminal nodes: 2
## Out-of-Bag Error Rate: 0.08
## Strength: 0.84
## Correlation: 0.08
##
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 37 1 0 0.03
## versicolor 0 33 2 0.06
## virginica 0 5 27 0.16
```

Next, we will specify building the model on a cluster of servers.

```
paste0("node", 31:40)
servers <-.3 <- wsrf(form, data=ds[train, vars], parallel=servers) model.wsrf
```

All we need is a character vector specifying the hostnames of which nodes to use, or a named integer vector, whose values of the elements give how many threads to use for model building, in other words, how many trees built simultaneously. More detail descriptions about **wsrf** are presented in the manual.

Eddelbuettel, Dirk. 2013. *Seamless R and C++ Integration with Rcpp*. New York: Springer.

Eddelbuettel, Dirk, and Romain François. 2011. “Rcpp: Seamless R and C++ Integration.” *Journal of Statistical Software* 40 (8): 1–18. https://doi.org/10.18637/jss.v040.i08.

Xu, Baoxun, Joshua Zhexue Huang, Graham Williams, Qiang Wang, and Yunming Ye. 2012. “Classifying Very High-Dimensional Data with Random Forests Built from Small Subspaces.” *International Journal of Data Warehousing and Mining (IJDWM)* 8 (2): 44–63.