Overview of loose.rock

André Veríssimo

2018-08-31

loose rock

Set of Functions to Use in Survival Analysis and in Data Science

Collection of function to improve workflow in survival analysis and data science. Among the many features, the generation of balanced datasets, retrieval of protein coding genes from two public databases (live) and generation of random matrix based on covariance matrix.

The work has been mainly supported by two grants: FCT SFRH/BD/97415/2013 and the EU Commission under SOUND project with contract number 633974.

Install

The only pre-requirement is to install biomaRt bioconductor package as it cannot be installed automatically via CRAN.

All other dependencies should be installed when running the install command.

Overview

Libraries required for this vignette

Get a current list of protein coding genes

Showing only a random sample of 15

ensembl_gene_id external_gene_name
ENSG00000130338 TULP4
ENSG00000066422 ZBTB11
ENSG00000267710 EDDM13
ENSG00000120952 PRAMEF2
ENSG00000172818 OVOL1
ENSG00000167968 DNASE1L2
ENSG00000272297 AC018709.1
ENSG00000174469 CNTNAP2
ENSG00000162654 GBP4
ENSG00000215641 TRIM27
ENSG00000134461 ANKRD16
ENSG00000277885 KIR2DS2
ENSG00000162461 SLC25A34
ENSG00000198746 GPATCH3
ENSG00000284695 AC108941.2

Balanced test/train dataset

This is specially relevant in survival or binary output with few cases of one category that need to be well distributed among test/train datasets or in cross-validation folds.

Example below sets aside 90% of the data to the training set. As samples are already divided in two sets (set1 and set2), it performs the 90% separation for each and then joins (with option join.all = T) the result.

Generate synthetic matrix with covariance

#> Using .2^|i-j| to generate co-variance matrix
#> X generated
#>            X1         X2         X3         X4           X5
#> 1   0.1944384  0.7614053  1.5732604  1.0969908  0.516623623
#> 2  -0.2088722  0.3299644  0.6771205  0.9606580 -1.693842188
#> 3   0.5282128  0.3816847 -0.9694817  0.2070983 -0.145626644
#> 4   0.6035690 -0.6545657 -0.3682524 -2.0477890 -1.138906864
#> 5   0.1594990 -0.2468407 -1.6687048  0.2038405  1.113790270
#> 6   0.3644722 -1.8633723 -0.3614212  1.1336067 -0.091376129
#> 7   1.7395142  1.5068401  0.4833638 -0.5649658 -0.007276351
#> 8  -2.0126534  0.4070373 -0.7969888 -0.6071440 -0.762008597
#> 9  -0.4556821  0.5761596  0.4132342  0.3332174  1.476525617
#> 10 -0.9124980 -1.1983127  1.0178702 -0.7155129  0.732097263
#> cov(X)
#>       X1    X2   X3    X4     X5
#> 1 1.0000 0.200 0.04 0.008 0.0016
#> 2 0.2000 1.000 0.20 0.040 0.0080
#> 3 0.0400 0.200 1.00 0.200 0.0400
#> 4 0.0080 0.040 0.20 1.000 0.2000
#> 5 0.0016 0.008 0.04 0.200 1.0000

#> Using .75^|i-j| to generate co-variance matrix (plotting correlation)
#> X generated
#>            X1         X2          X3          X4          X5
#> 1   0.4208010  0.9286722  1.45831161  1.67371415  1.17461275
#> 2  -1.1038121 -1.2636549 -0.07741947  0.41873089  1.06764561
#> 3   1.7356855  0.4526740 -0.37879089 -0.02327963  0.30411766
#> 4  -1.0318495 -0.6538382 -1.18185378 -0.08377025 -0.08410934
#> 5   0.1145984 -0.2168866 -0.21616338 -0.26522464 -1.74803648
#> 6   1.0900261  1.1550673  1.12052081 -0.12683280  0.60808321
#> 7  -0.2929580  1.1762793  0.21909445 -0.13175354 -0.07821635
#> 8   0.8276833  0.7144643  1.11215892  1.37228251  0.75499461
#> 9  -0.9345230 -1.1943438 -0.46819589 -1.49889929 -1.43608946
#> 10 -0.8256519 -1.0984336 -1.58766238 -1.33496740 -0.56300222
#> cov(X)
#>          X1       X2     X3       X4        X5
#> 1 1.0000000 0.750000 0.5625 0.421875 0.3164062
#> 2 0.7500000 1.000000 0.7500 0.562500 0.4218750
#> 3 0.5625000 0.750000 1.0000 0.750000 0.5625000
#> 4 0.4218750 0.562500 0.7500 1.000000 0.7500000
#> 5 0.3164062 0.421875 0.5625 0.750000 1.0000000

Save in cache

Uses a cache to save and retrieve results. The cache is automatically created with the arguments and source code for function, so that if any of those changes, the cache is regenerated.

Caution: Files are not deleted so the cache directory can become rather big.

Set a temporary directory to save all caches (optional)

Run sum function twice

Run rnorm function with an explicit seed (otherwise it would return the same random number)

Proper

One of such is a proper function that capitalizes a string.

Custom colors and symbols

my.colors() and my.symbols() can be used to improve plot readability.