# Getting started with wrMisc

## Introduction

This package contains a collection of various (low-level) tools which may be of general interest. These functions were accumulated over a number of years of data-wrangling when treating high-throughput data from biomedical applications. Besides, these functions are further used/integrated in more specialized functions dedicated to specific applications in the packages wrProteo, wrGraph or wrTopDownFrag. All these packages are available on CRAN.

If you are not familiar with R you may find many introductory documents on the official R-site in contributed documents or under Documentation/Manuals. Of course, numerous other documents/sites with tutorials and courses exist, too.

### Dependencies and Compilation

One of the aims was to write a robust package easy to install (in respect to low system requirements).
All code is written in pure R and does not need any special compilers. The number of obligatory dependencies was kept to a minumum. This helps to reduce potential problems at installation with dependencies which fail to get installed themselves.

Most of additional packages used here at some point were declared as ‘suggested’ (ie not obligatory), to allow installation of wrMisc even if some these additional packages can’t be installed/compiled by the user’s instance. This means that, when a feature of one of the ‘suggested’ packages is about to be used, its presence/installation will be checked and, only if found as missing, the user will be propmted a message inviting to install specific package(s) before using these specific functions.

To get started, we need to install (if not yet installed) and load the package “wrMisc” available from CRAN.

## If not already installed, you'll have to install the package first.
## This is the basic installation commande in R
install.packages("wrMisc")

Since the functions in this vignette require a few more optional packages, let’s check if they are installed and add them (via a small function), if not yet installed.

packages <- c("knitr", "rmarkdown", "BiocManager", "kableExtra", "boot", "data.tree", "data.table", "fdrtool",
"RColorBrewer", "Rcpp", "wrMisc", "wrGraph", "wrProteo")
checkInstallPkg <- function(pkg) {       # install function
if(!requireNamespace(pkg, quietly=TRUE)) install.packages(pkg) }

## install if not yet present
sapply(packages, checkInstallPkg)

Finally, this package also uses the Bioconductor package limma which has to be installed differently :

## Installation of limma
BiocManager::install("limma")

The vignettes (like this one) are also accessible from R command-line:

## Now you can open this vignette out of R:
vignette("wrMiscVignette1", package="wrMisc")

Before using, we actually need to load the package first :

library("wrMisc")
library("knitr")

## This is 'wrMisc' version number :
packageVersion("wrMisc")
## [1] '1.7.0'

## Speed Optimized Functions in the Package wrMisc

In high-throughput experiments in biology (like transcriptomics, proteomics etc…) many different features get measured a number if times (different samples like patients or evolution of a disease). The resulting data typically contain many (independent) rows (eg >1000 different genes or proteins who’s abundance was measured) and much fewer columns that may get further organized in groups of replicates. As R is a versatile language, multiple options exist for assessing the global characteristics of such data, some are more efficient on a computational point of view. In order to allow fast treatment of very large data-sets some tools have been re-designed for optimal performance.

### Assessing basic information about variability (for matrix)

Many mesurement techniques applied in high throughput manner suffer from precision. This means, the same measurements taken twice in a row (ie repeated on the same subject) will very likely not give an identical result. For this reason it is common practice to make replicate measurements to i) estimate mean (ie representative) values and ii) asses the factors contributing to the variablity observed. Briefly, technical replicates represent the case where multiple read-outs of the very same sample are generated and the resulting variability is associated to technical issues during the process of taking measures. Biological replicates represent independant samples and reflect therefore the varibility a given parameter may have in a certain population of individuals. With the tools presented here, both technical and biological replicates can be dealt with. In several cases the interpretation of the resulting numbers should consider the experimental setup, though.

Let’s make a simple matrix as toy data:

grp1 <- rep(LETTERS[1:3], c(3,4,3))
sampNa1 <- paste0(grp1, c(1:3,1:4,1:3))
set.seed(2016); dat1 <- matrix(round(c(runif(50000) +rep(1:1000,50)),3),
ncol=10, dimnames=list(NULL,sampNa1))
dim(dat1)
## [1] 5000   10
head(dat1)
##         A1    A2    A3    B1    B2    B3    B4    C1    C2    C3
## [1,] 1.180 1.640 1.199 1.118 1.425 1.745 1.253 1.554 1.303 1.856
## [2,] 2.143 2.237 2.730 2.693 2.603 2.293 2.542 2.452 2.148 2.776
## [3,] 3.842 3.155 3.191 3.520 3.686 3.408 3.409 3.871 3.345 3.588
## [4,] 4.134 4.394 4.982 4.320 4.380 4.888 4.965 4.462 4.250 4.647
## [5,] 5.478 5.472 5.488 5.570 5.626 5.765 5.551 5.016 5.659 5.139
## [6,] 6.121 6.294 6.718 6.890 6.999 6.316 6.542 6.119 6.763 6.487

Now lets estimate the standard deviation (sd) for every row:

head(rowSds(dat1))
## [1] 0.2583693 0.2426026 0.2477899 0.3089102 0.2307722 0.3124493
system.time(sd1 <- rowSds(dat1))
##    user  system elapsed
##       0       0       0
system.time(sd2 <- apply(dat1,1,sd))
##    user  system elapsed
##    0.10    0.00    0.09

On most systems the equivalent calculation using apply() will run much slower.

Note, there is a minor issue with rounding :

table(round(sd1,13)==round(sd2,13))
##
## FALSE  TRUE
##     1  4999

Similarly we can easily calculate the CV (coefficient of variance) for every row using rowCVs :

system.time(cv1 <- rowCVs(dat1))
##    user  system elapsed
##    0.02    0.00    0.02
system.time(cv2 <- apply(dat1,1,sd)/rowMeans(dat1))
##    user  system elapsed
##    0.06    0.00    0.06
# typically the calculation using rowCVs is much faster
head(cv1)
## [1] 0.18101959 0.09855083 0.07076678 0.06800894 0.04213940 0.04788568
# results from the 'conventional' way
head(cv2)
## [1] 0.18101959 0.09855083 0.07076678 0.06800894 0.04213940 0.04788568

Note, these calculations will be very efficient as long as the number of rows is much higher (>>) than the number of columns.

### Data organized in (sub-)groups as sets of columns

Now, let’s assume our data is contains 3 initial samples measured as several replicates (already defined in grp1). Similarly, we can also calculate the sd or CV for each line while splitting into groups of replicates (functions rowGrpMeans, rowGrpSds and rowGrpCV):

# we already defined the grouping :
grp1
##  [1] "A" "A" "A" "B" "B" "B" "B" "C" "C" "C"
## the mean for each group and row
system.time(mean1Gr <- rowGrpMeans(dat1, grp1))
##    user  system elapsed
##       0       0       0
## Now the sd for each row and group
system.time(sd1Gr <- rowGrpSds(dat1, grp1))
##    user  system elapsed
##       0       0       0
# will give us a matrix with the sd for each group & line
head(sd1Gr)
##                A          B         C
## [1,] 0.260269732 0.27074758 0.2768917
## [2,] 0.315291928 0.17144557 0.3140531
## [3,] 0.386666523 0.13115989 0.2632534
## [4,] 0.434443706 0.33521970 0.1986530
## [5,] 0.008082904 0.09672297 0.3413156
## [6,] 0.307168249 0.31475851 0.3230934
# Let's check the results of the first line :
sd1Gr[1,] == c(sd(dat1[1,1:3]), sd(dat1[1,4:7]), sd(dat1[1,8:10]))
##    A    B    C
## TRUE TRUE TRUE
# The CV :
system.time(cv1Gr <- rowGrpCV(dat1, grp1))
##    user  system elapsed
##       0       0       0
head(cv1Gr)
##                A          B          C
## [1,] 0.194279471 0.19545033 0.17625186
## [2,] 0.133034569 0.06769147 0.12773308
## [3,] 0.113859400 0.03741279 0.07309886
## [4,] 0.096471585 0.07227288 0.04461104
## [5,] 0.001475162 0.01718603 0.06474939
## [6,] 0.048163108 0.04707197 0.05004286

#### Counting number of NAs per row and group of columns

Some data, like with quantitative proteomics measures, may contain an elevated number of NAs (see also the package wrProteo for further options for dealing with such data). Similar as above there is an easy way to count the number of NAs to get an overview how NAs are distributed.

Let’s assume we have measures from 3 groups/samples with 4 replicates each :

mat2 <- c(22.2, 22.5, 22.2, 22.2, 21.5, 22.0, 22.1, 21.7, 21.5, 22, 22.2, 22.7,
NA, NA, NA, NA, NA, NA, NA, 21.2,   NA, NA, NA, NA,
NA, 22.6, 23.2, 23.2,  22.4, 22.8, 22.8, NA,  23.3, 23.2, NA, 23.7,
NA, 23.0, 23.1, 23.0,  23.2, 23.2, NA, 23.3,  NA, NA, 23.3, 23.8)
mat2 <- matrix(mat2, ncol=12, byrow=TRUE)
## The definition of the groups (ie replicates)
gr4 <- gl(3, 4, labels=LETTERS[1:3])

Now we can easily count the number of NAs per row and set of replicates.

rowGrpNA(mat2,gr4)
##      A B C
## [1,] 0 0 0
## [2,] 4 3 4
## [3,] 1 1 1
## [4,] 1 1 2

### Fast NA-omit for very large objects

The function na.omit() from the package stats also keeps a trace of all omitted instances. This can be penalizing in terms of memory usage when handling very large vectors with a high content of NAs (eg >10000 NAs). If you don’t need to document precisely which elements got eliminated, the function naOmit() may offer smoother functioning for very large objects.

aA <- c(11:13,NA,10,NA)

str(naOmit(aA))
##  num [1:4] 11 12 13 10
# the 'classical' na.omit also stores which elements were NA
str(na.omit(aA))
##  num [1:4] 11 12 13 10
##  - attr(*, "na.action")= 'omit' int [1:2] 4 6

### Minimum distance/difference between values

If you need to find the closest neighbour(s) of a numeric vector, the function minDiff() will tell you the distance (“dif”,“ppm” or “ratio”) and index (“best”) of the closest neighbour. In case of multiple shortest distances the index if the first one is reported, and the column “nbest” will display a value of >1.

set.seed(2017); aa <- 10*c(0.1 +round(runif(20),2), 0.53, 0.53)
head(aa)
## [1] 10.2  6.4  5.7  3.9  8.7  8.7
minDiff(aa,ppm=FALSE)
##       index value  dif   rat ncur nbest best
##  [1,]     1  10.2 -0.2 0.981    1     1   19
##  [2,]     2   6.4  0.4 1.070    1     1   15
##  [3,]     3   5.7  0.3 0.950    2     1   15
##  [4,]     4   3.9  0.2 1.050    1     1   10
##  [5,]     5   8.7  0.5 1.060    2     1   18
##  [6,]     6   8.7  0.5 1.060    2     1   18
##  [7,]     7   1.4  0.1 1.080    1     1   13
##  [8,]     8   5.3  0.3 1.060    4     1   17
##  [9,]     9   5.7  0.3 0.950    2     1   15
## [10,]    10   3.7 -0.2 0.949    1     1    4
## [11,]    11   7.7 -0.5 0.939    1     1   18
## [12,]    12   1.0 -0.3 0.769    1     1   13
## [13,]    13   1.3 -0.1 0.929    1     1    7
## [14,]    14   5.3  0.3 1.060    4     1   17
## [15,]    15   6.0  0.3 1.050    1     2    9
## [16,]    16   4.9 -0.1 0.980    1     1   17
## [17,]    17   5.0  0.1 1.020    1     1   16
## [18,]    18   8.2  0.5 1.060    1     1   11
## [19,]    19  10.4  0.2 1.020    1     1    1
## [20,]    20   9.3  0.6 1.070    1     2    6
## [21,]    21   5.3  0.3 1.060    4     1   17
## [22,]    22   5.3  0.3 1.060    4     1   17

When you look at the first line, the value of 10.2 has one single closest value which is 10.4, which is located in line number 19 (the column ‘best’ gives the index of the best). Line number 19 points back to line number 1. You can see, that some elements (like 5.7) occure multiple times (line no 3 and 9), multiple occurences are counted in the column ncur. This is why column nbest for line 15 (value =6.0) indicates that it appears twice as closest value nbest.

## Working With Lists (and lists of lists)

### Partial unlist

When input from different places gets collected and combined into a list, this may give a collection of different types of data. The function partUnlist() will to preserve multi-column elements as they are (and just bring down one level):

bb <- list(fa=gl(2,2), ve=31:33, L2=matrix(21:28,ncol=2), li=list(li1=11:14,li2=data.frame(41:44)))
partUnlist(bb)
## $fa ## [1] 1 1 2 2 ## Levels: 1 2 ## ##$ve
## [1] 31 32 33
##
## $L2 ## [,1] [,2] ## [1,] 21 25 ## [2,] 22 26 ## [3,] 23 27 ## [4,] 24 28 ## ##$li
## [1] 11 12 13 14
##
## $li ## X41.44 ## 1 41 ## 2 42 ## 3 43 ## 4 44 partUnlist(lapply(bb,.asDF2)) ##$fa
##   as.character(z)
## 1               1
## 2               1
## 3               2
## 4               2
##
## $ve ## V1 ## 1 31 ## 2 32 ## 3 33 ## ##$L2
##   V1 V2
## 1 21 25
## 2 22 26
## 3 23 27
## 4 24 28
##
## $li ## V1 ## li1 11, 12, 13, 14 ## li2 41, 42, 43, 44 This won’t be possible using unlist(). head(unlist(bb, recursive=FALSE)) ##$fa1
## [1] 1
##
## $fa2 ## [1] 1 ## ##$fa3
## [1] 2
##
## $fa4 ## [1] 2 ## ##$ve1
## [1] 31
##
## $ve2 ## [1] 32 To uniform such data to obtain a list with one column only for each list-element, the function asSepList() provides help : bb <- list(fa=gl(2,2), ve=31:33, L2=matrix(21:28,ncol=2), li=list(li1=11:14,li2=data.frame(41:44))) asSepList(bb) ##$fa
## [1] 1 1 2 2
##
## [[2]]
## [1] 11 12 13 14
##
## [[3]]
## [1] NA
##
## $L2_L21 ## [1] 21 22 23 24 ## ##$L2_L22
## [1] 25 26 27 28

### rbind on lists

When a matrix (or data.frame) gets split into a list, like in the example using by(), as a reverse-function such lists can get joined using lrbind in an rbind-like fashion.

dat2 <- matrix(11:34, ncol=3, dimnames=list(letters[1:8],colnames=LETTERS[1:3]))
lst2 <- by(dat2, rep(1:3,c(3,2,3)), as.matrix)
lst2
## INDICES: 1
##    A  B  C
## a 11 19 27
## b 12 20 28
## c 13 21 29
## ------------------------------------------------------------
## INDICES: 2
##    A  B  C
## d 14 22 30
## e 15 23 31
## ------------------------------------------------------------
## INDICES: 3
##    A  B  C
## f 16 24 32
## g 17 25 33
## h 18 26 34
# join list-elements (back) into single matrix
lrbind(lst2)
##    A  B  C
## a 11 19 27
## b 12 20 28
## c 13 21 29
## d 14 22 30
## e 15 23 31
## f 16 24 32
## g 17 25 33
## h 18 26 34

### Fuse content of list-elements with redundant (duplicated) names

When list-elements have the same name, their content (of named numeric or character vectors) may get fused using fuseCommonListElem() according to the names of the list-elements :

val1 <- 10 +1:26
names(val1) <- letters
(lst1 <- list(c=val1[3:6], a=val1[1:3], b=val1[2:3] ,a=val1[12], c=val1[13]))
## $c ## c d e f ## 13 14 15 16 ## ##$a
##  a  b  c
## 11 12 13
##
## $b ## b c ## 12 13 ## ##$a
##  l
## 22
##
## $c ## m ## 23 ## here the names 'a' and 'c' appear twice : names(lst1) ## [1] "c" "a" "b" "a" "c" ## now, let's fuse all 'a' and 'c' fuseCommonListElem(lst1) ##$c
##  c  d  e  f  m
## 13 14 15 16 23
##
## $a ## a b c l ## 11 12 13 22 ## ##$b
##  b  c
## 12 13

### Filtering lines and/or columns for all list-elements of same size

In a numerb of cases the information stared in various list-elements is somehow related. Eg, in S3-objects produced by limma, or data produced using wrProteo several instances of matrix or data.frame are related. Some matrixes may conatain abundance data while a matrix or data.frame may contain the annotation realated to each line of the abundance data. So one wants to filter, ie remove some lines in from one of these elements, one should do the same with the other list-elements to maintain conventient 1:1 matching of lines.

The function filterLiColDeList() searches if other list-elements have suitable dimensions and will then run the same filtering as in the ‘target’ list-element. In consequence this can be used with the output of wrProteo to remove simultaneously the same lines and/or columns.

lst1 <- list(m1=matrix(11:18,ncol=2), m2=matrix(21:30,ncol=2), indR=31:34, m3=matrix(c(21:23,NA,25:27,NA),ncol=2))
filterLiColDeList(lst1, useLines=2:3)
##  -> filterLiColDeList : successfully filtered 'm1' and 'm3' from 4 to 2 lines
## $m1 ## [,1] [,2] ## [1,] 12 16 ## [2,] 13 17 ## ##$m2
##      [,1] [,2]
## [1,]   21   26
## [2,]   22   27
## [3,]   23   28
## [4,]   24   29
## [5,]   25   30
##
## $indR ## [1] 31 32 33 34 ## ##$m3
##      [,1] [,2]
## [1,]   22   26
## [2,]   23   27
filterLiColDeList(lst1, useLines="allNA", ref=3)
##  -> filterLiColDeList : It appears lst[[ref]] is not matrix (or data.frame) ! Trying to reformat ..
##  -> filterLiColDeList : 'useLines' seems empty, nothing to do ...
## $m1 ## [,1] [,2] ## [1,] 11 15 ## [2,] 12 16 ## [3,] 13 17 ## [4,] 14 18 ## ##$m2
##      [,1] [,2]
## [1,]   21   26
## [2,]   22   27
## [3,]   23   28
## [4,]   24   29
## [5,]   25   30
##
## $indR ## [,1] ## [1,] 31 ## [2,] 32 ## [3,] 33 ## [4,] 34 ## ##$m3
##      [,1] [,2]
## [1,]   21   25
## [2,]   22   26
## [3,]   23   27
## [4,]   NA   NA

### Replacements in list

The function listBatchReplace() works similar to sub() and allows to search & replace exact matches to a character string along all elements of a list.

(lst1 <- list(aa=1:4, bb=c("abc","efg","abhh","effge") ,cc=c("abdc","efg","efgh")))
## $aa ## [1] 1 2 3 4 ## ##$bb
## [1] "abc"   "efg"   "abhh"  "effge"
##
## $cc ## [1] "abdc" "efg" "efgh" listBatchReplace(lst1,search="efg",repl="EFG",silent=FALSE) ##$aa
## [1] "1" "2" "3" "4"
##
## $bb ## [1] "abc" "EFG" "abhh" "effge" ## ##$cc
## [1] "abdc" "EFG"  "efgh"

### Organize values into list and sort by names

Named numeric or character vectors can be organized into lists using listGroupsByNames(), based on their names (only the part before any extensions starting with a point gets considered). Of course, other separators may be defined using the argument sep.

ser1 <- 1:7; names(ser1) <- c("AA","BB","AA.1","CC","AA.b","BB.e","A")

listGroupsByNames(ser1)
## $AA ## AA AA.1 AA.b ## 1 3 5 ## ##$BB
##   BB BB.e
##    2    6
##
## $CC ## CC ## 4 ## ##$A
## A
## 7

If no names are present, the content of the vector itself will be used as name :

listGroupsByNames((1:10)/5)
##  -> listGroupsByNames :  no names found in 'x' !!
## $0 ## 0 0 0 0 ## 0.2 0.4 0.6 0.8 ## ##$1
##   1   1   1   1   1
## 1.0 1.2 1.4 1.6 1.8
##
## $2 ## 2 ## 2 ### Batch-filter list-elements In the view of object-oriented programming several methods produce results integrated into lists or S3-objects (eg limma). The function filterList() aims facilitating the filtering of all elements of lists or S3-objects. List-elements with inappropriate number of lines will be ignored. set.seed(2020); dat1 <- round(runif(80),2) list1 <- list(m1=matrix(dat1[1:40],ncol=8),m2=matrix(dat1[41:80],ncol=8),other=letters[1:8]) rownames(list1$m1) <- rownames(list1$m2) <- paste0("line",1:5) # Note: the list-element list1$other has a length different to that of filt. Thus, it won't get filtered.
filterList(list1, list1$m1[,1] >0.4) # filter according to 1st column of$m1 ...
## $m1 ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## line1 0.65 0.07 0.76 0.54 0.20 0.17 0.96 0.37 ## line3 0.62 0.39 0.83 0.65 0.82 0.75 0.96 0.93 ## line4 0.48 0.00 0.42 0.55 0.94 0.45 0.95 0.52 ## ##$m2
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.99 0.57 0.58 0.00 0.21 0.61 0.61 0.30
## line3 0.86 0.70 0.90 0.22 0.23 0.58 0.39 0.06
## line4 0.88 0.80 0.52 0.54 0.42 0.65 0.47 0.67
##
## $other ## [1] "a" "b" "c" "d" "e" "f" "g" "h" filterList(list1, list1$m1 >0.4) 
## $m1 ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] ## line1 0.65 0.07 0.76 0.54 0.20 0.17 0.96 0.37 ## line3 0.62 0.39 0.83 0.65 0.82 0.75 0.96 0.93 ## line4 0.48 0.00 0.42 0.55 0.94 0.45 0.95 0.52 ## line5 0.14 0.62 0.41 0.27 0.88 0.56 0.00 0.22 ## ##$m2
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## line1 0.99 0.57 0.58 0.00 0.21 0.61 0.61 0.30
## line3 0.86 0.70 0.90 0.22 0.23 0.58 0.39 0.06
## line4 0.88 0.80 0.52 0.54 0.42 0.65 0.47 0.67
## line5 0.62 0.17 0.83 0.49 0.86 0.17 0.53 0.72
##
## $other ## [1] "a" "b" "c" "d" "e" "f" "g" "h" ### Transform columns of matrix to list of vectors (mat1 <- matrix(1:12, ncol=3, dimnames=list(letters[1:4],LETTERS[1:3]))) ## A B C ## a 1 5 9 ## b 2 6 10 ## c 3 7 11 ## d 4 8 12 str(matr2list(mat1)) ## List of 3 ##$ A: Named num [1:4] 1 2 3 4
##   ..- attr(*, "names")= chr [1:4] "A.a" "A.b" "A.c" "A.d"
##  $B: Named num [1:4] 5 6 7 8 ## ..- attr(*, "names")= chr [1:4] "B.a" "B.b" "B.c" "B.d" ##$ C: Named num [1:4] 9 10 11 12
##   ..- attr(*, "names")= chr [1:4] "C.a" "C.b" "C.c" "C.d"

## Working with Arrays

Let’s get stared with a little toy-array:

(arr1 <- array(c(6:4,4:24), dim=c(4,3,2), dimnames=list(c(LETTERS[1:4]),
paste("col",1:3,sep=""),c("ch1","ch2"))))
## , , ch1
##
##   col1 col2 col3
## A    6    5    9
## B    5    6   10
## C    4    7   11
## D    4    8   12
##
## , , ch2
##
##   col1 col2 col3
## A   13   17   21
## B   14   18   22
## C   15   19   23
## D   16   20   24

### CV (coefficient of variance) with arrays

Now we can obtain the CV (coefficient of variance) by splitting along 3rd dimesion (ie this is equivalent to an apply along the 3rd dimension) using arrayCV:

arrayCV(arr1)
##         ch1       ch2
## A 0.3122499 0.2352941
## B 0.3779645 0.2222222
## C 0.4788934 0.2105263
## D 0.5000000 0.2000000
# this is equivalent to
cbind(rowCVs(arr1[,,1]), rowCVs(arr1[,,2]))
##        [,1]      [,2]
## A 0.3122499 0.2352941
## B 0.3779645 0.2222222
## C 0.4788934 0.2105263
## D 0.5000000 0.2000000

Similarly we can split along any other dimension, eg the 2nd dimension :

arrayCV(arr1, byDim=2)
##        col1      col2      col3
## A 0.5210260 0.7713892 0.5656854
## B 0.6698906 0.7071068 0.5303301
## C 0.8187552 0.6527140 0.4991342
## D 0.8485281 0.6060915 0.4714045

### Slice 3-dim array in list of matrixes (or arrays)

This procedure is similar to (re-)organizing an initial array into clusters, here we split along a user-defined factor/vector. If a clustering-algorithm produces the cluster assignments, this function can be used to organize the input data accordingly using cutArrayInCluLike.

cutArrayInCluLike(arr1, cluOrg=c(2,1,2,1))
## $2 ## , , ch1 ## ## col1 col2 col3 ## A 6 5 9 ## C 4 7 11 ## ## , , ch2 ## ## col1 col2 col3 ## A 13 17 21 ## C 15 19 23 ## ## ##$1
## , , ch1
##
##   col1 col2 col3
## B    5    6   10
## D    4    8   12
##
## , , ch2
##
##   col1 col2 col3
## B   14   18   22
## D   16   20   24

Let’s cut by filtering along the 3rd dimension for all lines where column ‘col2’ is >7, and then display only the content of columns ‘col1’ and ‘col2’ (using filt3dimArr):

filt3dimArr(arr1,displCrit=c("col1","col2"), filtCrit="col2", filtVal=7, filtTy=">")
## [[1]]
## col1 col2
##    4    8
##
## [[2]]
##   col1 col2
## A   13   17
## B   14   18
## C   15   19
## D   16   20

## Working with Redundant Data

Semantics : Please note, that there are two ways of interpreting the term ‘unique’ :

• In regular understanding one describes this way an event which occurs only once, and thus does not occur/happen anywhere else.

• The command unique() makes a vector with some redundant entries ‘unique’, ie in the resultant vector all values/content (values) occur only once. However, initially repeated values will still be present. Also, you cannot tell any more which ones were not unique initially ! Thus, the result of unique() does not tell which values were ‘unique’ in the first place.

In some applications (eg proteomics) initial identifiers (IDs) may occur multiple times in the data and we frequently need to isolate events/values that occur only once, as the first meaning of ‘unique’. This package provides functions to easily distinguish values occurring just once (ie unique) from those occurring multiple times. Furthermore, there are functions to rename/remove/combine replicated elements.

### Identify What is Repeated (and Where Repeated do Occur)

## some text toy data
tr <- c("li0","n",NA,NA,rep(c("li2","li3"),2),rep("n",4))

The function table() (from the package base) is very useful get some insights when working with smaller objects, but may be slow to handle very large objects. As mentioned, unique() will make everything unique, and afterwards you won’t know any more who was unique in the first place ! The function duplicated() (also from package base) helps us getting the information who is repeated.

table(tr)
## tr
## li0 li2 li3   n
##   1   2   2   5
unique(tr) 
## [1] "li0" "n"   NA    "li2" "li3"
duplicated(tr, fromLast=FALSE)
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
aa <- c(11:16,NA,14:12,NA,14)
names(aa) <- letters[1:length(aa)]
aa
##  a  b  c  d  e  f  g  h  i  j  k  l
## 11 12 13 14 15 16 NA 14 13 12 NA 14

findRepeated() (from this package) will return the position/index (and content/value) of repeated elements. However, the output in form of a list is not very convenient to the human eye.

findRepeated(aa) 
## $12 ## [1] 2 10 ## ##$13
## [1] 3 9
##
## $14 ## [1] 4 8 12 firstOfRepeated() tells the index of the first instance of repeated elements, which elements you need to make the vector ‘unique’, and which elements get stripped off when making unique. Please note, that NA (no matter if they occure once or more times) are automatically in the part suggested to be removed. firstOfRepeated(aa) ##$indRepeated
## 12 13 14
##  2  3  4
##
## $indUniq ## a b c d e f ## 1 2 3 4 5 6 ## ##$indRedund
##  g  h  i  j  k  l
##  7  8  9 10 11 12
##  a  e  f
## 11 15 16
##
##        1
##  [1,]  1
##  [2,]  4
##  [3,]  7
##  [4,] 10
##  [5,]  2
##  [6,]  5
##  [7,]  8
##  [8,] 11
##  [9,]  3
## [10,]  6
## [11,]  9
## [12,] 12
##
## $b ## 2 ## [1,] 24 ## [2,] 21 ## [3,] 18 ## [4,] 15 ## [5,] 23 ## [6,] 20 ## [7,] 17 ## [8,] 14 ## [9,] 22 ## [10,] 19 ## [11,] 16 ## [12,] 13 ### Non-redundant Lines of matrix mat4 <- matrix(rep(c(1,1:3,3,1),2), ncol=2, dimnames=list(letters[1:6],LETTERS[1:2])) nonRedundLines(mat4) ## A B ## a 1 1 ## c 2 2 ## d 3 3 ## f 1 1 ### Filter for Unique Elements /2 # input: c and dd are repeated : filtSizeUniq(list(A="a", B=c("b","bb","c"), D=c("dd","d","ddd","c")), filtUn=TRUE, minSi=NULL) ## -> filtSizeUniq : 2 out of 8 peptides redundant ##$A
##   A
## "a"
##
## $B ## B.1 B.2 ## "b" "bb" ## ##$D
##   D.1   D.2   D.3
##  "dd"   "d" "ddd"
# here a,b,c and dd are repeated  :
filtSizeUniq(list(A="a", B=c("b","bb","c"), D=c("dd","d","ddd","c")), ref=c(letters[c(1:26,1:3)],
"dd","dd","bb","ddd"), filtUn=TRUE, minSi=NULL)   
##  -> filtSizeUniq : 8 out of 8 peptides redundant
## $A ## character(0) ## ##$B
## character(0)
##
## $D ## character(0) ### Make Non-redundant Matrix t3 <- data.frame(ref=rep(11:15,3),tx=letters[1:15], matrix(round(runif(30,-3,2),1),nc=2), stringsAsFactors=FALSE) # First we split the data.frame in list by(t3,t3[,1],function(x) x) ## t3[, 1]: 11 ## ref tx X1 X2 ## 1 11 a 0.4 -1.1 ## 6 11 f 0.6 1.0 ## 11 11 k 0.1 1.2 ## ------------------------------------------------------------ ## t3[, 1]: 12 ## ref tx X1 X2 ## 2 12 b 2.0 -0.4 ## 7 12 g -0.3 1.8 ## 12 12 l -1.4 0.3 ## ------------------------------------------------------------ ## t3[, 1]: 13 ## ref tx X1 X2 ## 3 13 c 0.8 -1.6 ## 8 13 h 0.8 -2.4 ## 13 13 m 0.9 1.8 ## ------------------------------------------------------------ ## t3[, 1]: 14 ## ref tx X1 X2 ## 4 14 d 1.7 0.7 ## 9 14 i -1.6 -2.6 ## 14 14 n 1.4 -1.1 ## ------------------------------------------------------------ ## t3[, 1]: 15 ## ref tx X1 X2 ## 5 15 e -0.9 0.4 ## 10 15 j -1.7 0.0 ## 15 15 o -1.2 -1.8 t(sapply(by(t3,t3[,1],function(x) x), summarizeCols, me="maxAbsOfRef")) ## [,1] [,2] [,3] [,4] ## 11 11 "a" 0.1 1.2 ## 12 12 "b" -0.3 1.8 ## 13 13 "c" 0.8 -2.4 ## 14 14 "d" -1.6 -2.6 ## 15 15 "e" -1.2 -1.8 (xt3 <- makeNRedMatr(t3, summ="mean", iniID="ref")) ## -> makeNRedMatr : Common summarization method 'mean', run as batch ## -> makeNRedMatr : Summarize redundant based on col 'ref' using method(s) : 'mean', 'mean', 'mean' and 'mean' yielding 4 cols ## ID ref tx X1 X2 nRedLi ## 11 11 11 a 0.3666667 0.3666667 3 ## 12 12 12 b 0.1000000 0.5666667 3 ## 13 13 13 c 0.8333333 -0.7333333 3 ## 14 14 14 d 0.5000000 -1.0000000 3 ## 15 15 15 e -1.2666667 -0.4666667 3 (xt3 <- makeNRedMatr(t3, summ=unlist(list(X1="maxAbsOfRef")), iniID="ref")) ## -> makeNRedMatr : Summarize redundant based on col 'ref' using method(s) : 'maxAbsOfRef' and col 'X1' yielding 4 cols ## ref tx X1 X2 nRedLi ## 11 11 "a" 0.6 1 3 ## 12 12 "b" 2 -0.4 3 ## 13 13 "c" 0.9 1.8 3 ## 14 14 "d" 1.7 0.7 3 ## 15 15 "e" -1.7 0 3 ### Combine/reduce Redundant Lines Based on Specified Column matr <- matrix(c(letters[1:6],"h","h","f","e",LETTERS[1:5]), ncol=3, dimnames=list(letters[11:15],c("xA","xB","xC"))) combineRedBasedOnCol(matr, colN="xB") ## xA xB xC ## 2 "a,d" "f" "A,D" ## 3 "b,c" "h" "B,C" ## 1 "e" "e" "E" combineRedBasedOnCol(rbind(matr[1,],matr), colN="xB") ## xA xB xC ## 2 "a,d" "f" "A,D" ## 3 "b,c" "h" "B,C" ## 1 "e" "e" "E" ### Convert Matrix (eg with Redundant) Row-names to data.frame x <- 1 dat1 <- matrix(1:10,ncol=2) rownames(dat1) <- letters[c(1:3,2,5)] ## as.data.frame(dat1) ... would result in an error convMatr2df(dat1) ## ID X1 X2 ## a a 1 6 ## b_1 b 2 7 ## c c 3 8 ## b_2 b 4 9 ## e e 5 10 convMatr2df(data.frame(a=as.character((1:3)/2), b=LETTERS[1:3], c=1:3)) ## ID a b c ## 1 1 0.5 A 1 ## 2 2 1.0 B 2 ## 3 3 1.5 C 3 tmp <- data.frame(a=as.character((1:3)/2), b=LETTERS[1:3],c=1:3, stringsAsFactors=FALSE) convMatr2df(tmp) ## ID a b c ## 1 1 0.5 A 1 ## 2 2 1.0 B 2 ## 3 3 1.5 C 3 tmp <- data.frame(a=as.character((1:3)/2), b=1:3, stringsAsFactors=FALSE) convMatr2df(tmp)  ## ID a b ## 1 1 0.5 1 ## 2 2 1.0 2 ## 3 3 1.5 3 ### Find and Combine Points Located Very Close in x/y Space set.seed(2013) datT2 <- matrix(round(rnorm(200)+3,1), ncol=2, dimnames=list(paste("li",1:100,sep=""), letters[23:24])) # (mimick) some short and longer names for each line inf2 <- cbind(sh=paste(rep(letters[1:4],each=26),rep(letters,4),1:(26*4),sep=""), lo=paste(rep(LETTERS[1:4],each=26), rep(LETTERS,4), 1:(26*4), ",", rep(letters[sample.int(26)],4), rep(letters[sample.int(26)],4), sep=""))[1:100,] ## We'll use this to test : head(datT2,n=10) ## w x ## li1 2.9 3.7 ## li2 3.8 3.3 ## li3 2.3 3.3 ## li4 4.4 3.1 ## li5 4.5 1.8 ## li6 0.4 2.4 ## li7 3.7 3.3 ## li8 3.3 4.0 ## li9 5.0 4.1 ## li10 1.6 1.1 ## let's assign to each pair of x & y values a 'cluster' (column _clu_, the column _combInf_ tells us which lines/indexes are in this cluster) head(combineOverlapInfo(datT2, disThr=0.03), n=10) ## w x combInf clu isComb ## li1 2.9 3.7 1+16+22+47+91 1 TRUE ## li2 3.8 3.3 2+7+48+54 2 TRUE ## li3 2.3 3.3 3+66+92 3 TRUE ## li4 4.4 3.1 4 52 FALSE ## li5 4.5 1.8 5 53 FALSE ## li6 0.4 2.4 6 54 FALSE ## li7 3.7 3.3 2+7+48+54 2 TRUE ## li8 3.3 4.0 8+100 4 TRUE ## li9 5.0 4.1 9 55 FALSE ## li10 1.6 1.1 10 56 FALSE ## it is also possible to rather display names (eg gene or protein-names) instead of index values head(combineOverlapInfo(datT2, suplI=inf2[,2], disThr=0.03), n=10) ## w x combInf clu isComb ## li1 2.9 3.7 AA1+AP16+AV22+BU47+DM91 1 TRUE ## li2 3.8 3.3 AB2+AG7+BV48+CB54 2 TRUE ## li3 2.3 3.3 AC3+CN66+DN92 3 TRUE ## li4 4.4 3.1 AD4,ww 52 FALSE ## li5 4.5 1.8 AE5,aj 53 FALSE ## li6 0.4 2.4 AF6,nl 54 FALSE ## li7 3.7 3.3 AB2+AG7+BV48+CB54 2 TRUE ## li8 3.3 4.0 AH8+DV100 4 TRUE ## li9 5.0 4.1 AI9,ic 55 FALSE ## li10 1.6 1.1 AJ10,ee 56 FALSE ### Bin and Summarize values according to their names dat <- 11:19 names(dat) <- letters[c(6:3,2:4,8,3)] ## Here the names are not unique. ## Thus, the values can be binned by their (non-unique) names and a representative values calculated. ## Let's make a 'datUniq' with the mean of each group of values : datUniq <- round(tapply(dat, names(dat),mean),1) ## now we propagate the mean values to the full vector getValuesByUnique(dat, datUniq) ## f e d c b c d h c ## 11.0 12.0 15.0 16.3 15.0 16.3 15.0 18.0 16.3 cbind(ini=dat,firstOfRep=getValuesByUnique(dat, datUniq), indexUniq=getValuesByUnique(dat, datUniq,asIn=TRUE)) ## ini firstOfRep indexUniq ## f 11 11.0 5 ## e 12 12.0 4 ## d 13 15.0 3 ## c 14 16.3 2 ## b 15 15.0 1 ## c 16 16.3 2 ## d 17 15.0 3 ## h 18 18.0 6 ## c 19 16.3 2 ### Regrouping Simultaneaously by Two Factors For example, if you wish to create group-labels considering the eye- and hair-color of a small group students (supposed a sort of controlled vocabulary was used), the function combineByEitherFactor() will help. So basically, this is an empiric segmentation-approach for two categorical variables. Please note, that with large data-sets and very disperse data this approach will not provide great results. In the example below we’ll attempt to ‘cluster’ according to columns nn and qq, the resultant cluster number can be found in column grp. nn <- rep(c("a","e","b","c","d","g","f"),c(3,1,2,2,1,2,1)) qq <- rep(c("m","n","p","o","q"),c(2,1,1,4,4)) nq <- cbind(nn,qq)[c(4,2,9,11,6,10,7,3,5,1,12,8),] ## Here we consider 2 columns 'nn' and 'qq' whe trying to regroup common values ## (eg value 'a' from column 'nn' and value 'o' from 'qq') combineByEitherFactor(nq,1,2,nBy=FALSE) ## nn qq grp ## m2 "a" "m" "1" ## q2 "f" "q" "3" ## q1 "d" "q" "3" ## o2 "b" "o" "2" ## p "e" "p" "4" ## o4 "c" "o" "2" ## q4 "g" "q" "3" ## n "a" "n" "1" ## m1 "a" "m" "1" ## q3 "g" "q" "3" ## o1 "b" "o" "2" ## o3 "c" "o" "2" The argument nBy simply allows adding an additional column with the group/cluster-number. ## the same, but including n by group/cluster combineByEitherFactor(nq,1,2,nBy=TRUE) ## nn qq grp nGrp ## m2 "a" "m" "1" "3" ## q2 "f" "q" "3" "4" ## q1 "d" "q" "3" "4" ## o2 "b" "o" "2" "4" ## p "e" "p" "4" "1" ## o4 "c" "o" "2" "4" ## q4 "g" "q" "3" "4" ## n "a" "n" "1" "3" ## m1 "a" "m" "1" "3" ## q3 "g" "q" "3" "4" ## o1 "b" "o" "2" "4" ## o3 "c" "o" "2" "4" ## Not running further iterations works faster, but you may not reach 'convergence' immediately combineByEitherFactor(nq,1,2,nBy=FALSE) ## nn qq grp ## m2 "a" "m" "1" ## q2 "f" "q" "3" ## q1 "d" "q" "3" ## o2 "b" "o" "2" ## p "e" "p" "4" ## o4 "c" "o" "2" ## q4 "g" "q" "3" ## n "a" "n" "1" ## m1 "a" "m" "1" ## q3 "g" "q" "3" ## o1 "b" "o" "2" ## o3 "c" "o" "2" ## another example mm <- rep(c("a","b","c","d","e"),c(3,4,2,3,1)) pp <- rep(c("m","n","o","p","q"),c(2,2,2,2,5)) combineByEitherFactor(cbind(mm,pp),1,2, con=FALSE, nBy=TRUE) ## -> combineByEitherFactor : did not reach convergence at 2nd pass ## mm pp grp nGrp ## m1 "a" "m" "1" "4" ## m2 "a" "m" "1" "4" ## n1 "a" "n" "1" "4" ## n2 "b" "n" "1" "4" ## o1 "b" "o" "2" "4" ## o2 "b" "o" "2" "4" ## p1 "b" "p" "2" "4" ## p2 "c" "p" "2" "4" ## q1 "c" "q" "3" "5" ## q2 "d" "q" "3" "5" ## q3 "d" "q" "3" "5" ## q4 "d" "q" "3" "5" ## q5 "e" "q" "3" "5" ### Batch Replacing of Values or Character-strings The function multiCharReplace() facilitates multiple replacements in a vector, matrix or data.frame. # replace character content x1 <- c("ab","bc","cd","efg","ghj") multiCharReplace(x1, cbind(old=c("bc","efg"), new=c("BBCC","EF"))) ## [1] "ab" "BBCC" "cd" "EF" "ghj" # works also on matrix and/or to replace numeric content : x3 <- matrix(11:16, ncol=2) multiCharReplace(x3, cbind(12:13,112:113)) ## [,1] [,2] ## [1,] 11 14 ## [2,] 112 15 ## [3,] 113 16 Sometimes data get imported using different encoding for what should be interpreted as FALSE and TRUE : # replace and return logical vactor x2 <- c("High","n/a","High","High","Low") multiCharReplace(x2,cbind(old=c("n/a","Low","High"), new=c(NA,FALSE,TRUE)), convTo="logical") ## [1] TRUE NA TRUE TRUE FALSE ### Multi-to-multi Matching of (Concatenated) Terms The function allows to split (if necessary, using strsplit()) two vectors and compare each isolated tag (eg identifyer) from the 1st vector/object against each isolated tag from the second vector/object. This runs like a loop of one to many comparisons. The basic output is a list with indexes of which element of the 1st vector/object has matches in the 2nd vector/object. Since this is not convenient to the human reader, tabular output can be created, too. aa <- c("m","k","j; aa","m; aa; bb; o","n; dd","aa","cc") bb <- c("aa","dd","aa; bb; q","p; cc") ## result as list of indexes (bOnA <- multiMatch(aa, bb, method="asIndex")) # match bb on aa ##$1
## named integer(0)
##
## $2 ## named integer(0) ## ##$3
## aa aa
##  1  3
##
## $4 ## aa aa bb ## 1 3 3 ## ##$5
## dd
##  2
##
## $6 ## aa aa ## 1 3 ## ##$7
## cc
##  4
## more convenient to the human reader
(bOnA <- multiMatch(aa, bb))                     # match bb on aa
##              x x.Ind TagBest y.IndBest y.IndAll   y.Match        y.Adj
## 1            m     1    <NA>        NA     <NA>      <NA>         <NA>
## 2            k     2    <NA>        NA     <NA>      <NA>         <NA>
## 3        j; aa     3      aa         1     1; 3        aa        j; aa
## 4 m; aa; bb; o     4      aa         3  1; 3; 3 aa; bb; q m; aa; bb; o
## 5        n; dd     5      dd         2        2        dd        n; dd
## 6           aa     6      aa         1     1; 3        aa           aa
## 7           cc     7      cc         4        4     p; cc           cc
(bOnA <- multiMatch(aa, bb, method="matchedL"))  # match bb on aa
## $1 ## named integer(0) ## ##$2
## named integer(0)
##
## $3 ## aa aa ## 1 3 ## ##$4
## aa aa bb
##  1  3  3
##
## $5 ## dd ## 2 ## ##$6
## aa aa
##  1  3
##
## $7 ## cc ## 4 ## Search For Similar (Numeric) Values This section addresses values that are not truly identical but may differ only in the very last digit(s) and thus may be in a pragmatic view get considered and treated as ‘about the same’. The simplest approach would be to round values and then look for identical values. The functions presented here (like checkSimValueInSer) offer this type of search in a convenient way. va1 <- c(4:7,7,7,7,7,8:10)+(1:11)/28600 checkSimValueInSer(va1) ## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE cbind(va=va1, simil=checkSimValueInSer(va1)) ## va simil ## [1,] 4.000035 0 ## [2,] 5.000070 0 ## [3,] 6.000105 0 ## [4,] 7.000140 1 ## [5,] 7.000175 1 ## [6,] 7.000210 1 ## [7,] 7.000245 1 ## [8,] 7.000280 1 ## [9,] 8.000315 0 ## [10,] 9.000350 0 ## [11,] 10.000385 0 ### Find similar numeric values of two columns of a matrix The search for similar values may be preformed as absolute distance or as ‘ppm’ (as it is eg usual in proteomics when comparing measured and theoretically expected mass). aA <- c(11:17); bB <- c(12.001,13.999); cC <- c(16.2,8,9,12.5,15.9,13.5,15.7,14.1,5) (cloMa <- findCloseMatch(x=aA, y=cC, com="diff", lim=0.5, sor=FALSE))  ##$x2
##  y4
## 0.5
##
## $x3 ## y4 y6 ## -0.5 0.5 ## ##$x4
##   y6   y8
## -0.5  0.1
##
## $x6 ## y1 y5 y7 ## 0.2 -0.1 -0.3 The result of findCloseMatch() is a list organized by each ‘x’, telling all instances of ‘y’ found within the distance tolerance given by lim. Using closeMatchMatrix() the result obtained above, can be presented in a more convenient format for the human eye. # all matches (of 2d arg) to/within limit for each of 1st arg ('x'); 'y' ..to 2nd arg = cC # first let's display only one single closest/best hit (maAa <- closeMatchMatrix(cloMa, aA, cC, lim=TRUE)) # ## id.aA aA id.cC cC disToPred ppmToPred nByGrp isMin nBest ## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1 ## [2,] 3 13 4 12.5 0.5 40000.0 2 1 2 ## [3,] 3 13 6 13.5 -0.5 -37037.0 2 1 2 ## [4,] 4 14 8 14.1 -0.1 -7092.2 2 1 1 ## [5,] 6 16 5 15.9 0.1 6289.3 3 1 1 Using the argument limitToBest=FALSE we can display all distances within the limits imposed, some values/points may occur multiple times. For example, value number 4 of ‘cC’ (=12.5) or value number 3 of ‘aA’ (=13) now occur multiple times… (maAa <- closeMatchMatrix(cloMa,aA, cC, lim=FALSE,origN=TRUE)) # ## id.aA aA id.cC cC disToPred ppmToPred nByGrp isMin nBest ## [1,] 2 12 4 12.5 -0.5 -40000.0 1 1 1 ## [2,] 3 13 4 12.5 0.5 40000.0 2 1 2 ## [3,] 3 13 6 13.5 -0.5 -37037.0 2 1 2 ## [4,] 4 14 6 13.5 0.5 37037.0 2 0 0 ## [5,] 4 14 8 14.1 -0.1 -7092.2 2 1 1 ## [6,] 6 16 7 15.7 0.3 19108.0 3 0 0 ## [7,] 6 16 5 15.9 0.1 6289.3 3 1 1 ## [8,] 6 16 1 16.2 -0.2 -12346.0 3 0 0 (maAa <- closeMatchMatrix(cloMa, cbind(valA=81:87,aA), cbind(valC=91:99,cC), colM=2, colP=2,lim=FALSE)) ## -> closeMatchMatrix : reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long ## id.pred valA aA id.meas valC cC disToPred ppmToPred nByGrp isMin nBest ## [1,] 2 82 12 4 94 12.5 -0.5 -40000.0 1 1 1 ## [2,] 3 83 13 4 94 12.5 0.5 40000.0 2 1 2 ## [3,] 3 83 13 6 96 13.5 -0.5 -37037.0 2 1 2 ## [4,] 4 84 14 6 96 13.5 0.5 37037.0 2 0 0 ## [5,] 4 84 14 8 98 14.1 -0.1 -7092.2 2 1 1 ## [6,] 6 86 16 7 97 15.7 0.3 19108.0 3 0 0 ## [7,] 6 86 16 5 95 15.9 0.1 6289.3 3 1 1 ## [8,] 6 86 16 1 91 16.2 -0.2 -12346.0 3 0 0 (maAa <- closeMatchMatrix(cloMa, cbind(aA,valA=81:87), cC, lim=FALSE, deb=TRUE)) # ## .. xxidentToMatr2a ## .. xxidentToMatr2c ## -> closeMatchMatrix : reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long ## .. xxidentToMatr2d ## .. xxidentToMatr2e ## .. xxidentToMatr2f ## id.pred aA valA id.meas measMatr disToPred ppmToPred nByGrp isMin nBest ## [1,] 2 12 82 4 12.5 -0.5 -40000.0 1 1 1 ## [2,] 3 13 83 4 12.5 0.5 40000.0 2 1 2 ## [3,] 3 13 83 6 13.5 -0.5 -37037.0 2 1 2 ## [4,] 4 14 84 6 13.5 0.5 37037.0 2 0 0 ## [5,] 4 14 84 8 14.1 -0.1 -7092.2 2 1 1 ## [6,] 6 16 86 7 15.7 0.3 19108.0 3 0 0 ## [7,] 6 16 86 5 15.9 0.1 6289.3 3 1 1 ## [8,] 6 16 86 1 16.2 -0.2 -12346.0 3 0 0 a2 <- aA; names(a2) <- letters[1:length(a2)]; c2 <- cC; names(c2) <- letters[10+1:length(c2)] (cloM2 <- findCloseMatch(x=a2, y=c2, com="diff", lim=0.5, sor=FALSE))  ##$b
##   n
## 0.5
##
## $c ## n p ## -0.5 0.5 ## ##$d
##    p    r
## -0.5  0.1
##
## $f ## k o q ## 0.2 -0.1 -0.3 (maA2 <- closeMatchMatrix(cloM2, predM=cbind(valA=81:87,a2), measM=cbind(valC=91:99,c2), colM=2, colP=2, lim=FALSE, asData=TRUE))  ## -> closeMatchMatrix : reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long ## id.pred valA a2 id.meas valC c2 disToPred ppmToPred nByGrp isMin nBest ## b b 82 12 n 94 12.5 -0.5 -40000.0 1 1 1 ## c_1 c 83 13 n 94 12.5 0.5 40000.0 2 1 2 ## c_2 c 83 13 p 96 13.5 -0.5 -37037.0 2 1 2 ## d_1 d 84 14 p 96 13.5 0.5 37037.0 2 0 0 ## d_2 d 84 14 r 98 14.1 -0.1 -7092.2 2 1 1 ## f_1 f 86 16 q 97 15.7 0.3 19108.0 3 0 0 ## f_2 f 86 16 o 95 15.9 0.1 6289.3 3 1 1 ## f_3 f 86 16 k 91 16.2 -0.2 -12346.0 3 0 0 (maA2 <- closeMatchMatrix(cloM2, cbind(id=names(a2),valA=81:87,a2), cbind(id=names(c2), valC=91:99,c2), colM=3, colP=3, lim=FALSE, deb=FALSE))  ## -> closeMatchMatrix : reset argument 'origNa' to FALSE since names of 'predMatr' and/or 'measMatr' result of formula and would be too long ## id.pred valA a2 id.meas valC c2 disToPred ppmToPred nByGrp ## b "b" "82" "12" "n" "94" "12.5" "-0.5" "-40000" "1" ## c "c" "83" "13" "n" "94" "12.5" "0.5" "40000" "2" ## c "c" "83" "13" "p" "96" "13.5" "-0.5" "-37037" "2" ## d "d" "84" "14" "p" "96" "13.5" "0.5" "37037" "2" ## d "d" "84" "14" "r" "98" "14.1" "-0.0999999999999996" "-7092.2" "2" ## f "f" "86" "16" "q" "97" "15.7" "0.300000000000001" "19108" "3" ## f "f" "86" "16" "o" "95" "15.9" "0.0999999999999996" "6289.3" "3" ## f "f" "86" "16" "k" "91" "16.2" "-0.199999999999999" "-12346" "3" ## isMin nBest ## b "1" "1" ## c "1" "2" ## c "1" "2" ## d "0" "0" ## d "1" "1" ## f "0" "0" ## f "1" "1" ## f "0" "0" ### Find Similar Numeric Values from Two Vectors/Matrixes For comparing two sets of data one may use findSimilarFrom2sets. aA <- c(11:17); bB <- c(12.001,13.999); cC <- c(16.2,8,9,12.5,12.6,15.9,14.1) aZ <- matrix(c(aA,aA+20), ncol=2, dimnames=list(letters[1:length(aA)],c("aaA","aZ"))) cZ <- matrix(c(cC,cC+20), ncol=2, dimnames=list(letters[1:length(cC)],c("ccC","cZ"))) findCloseMatch(cC,aA,com="diff",lim=0.5,sor=FALSE) ##$x1
##   y6
## -0.2
##
## $x4 ## y2 y3 ## -0.5 0.5 ## ##$x5
##  y3
## 0.4
##
## $x6 ## y6 ## 0.1 ## ##$x7
##   y4
## -0.1
findSimilFrom2sets(aA,cC)
##      aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  2           12  4     12.5      -0.5  -40000.0      1     1     1
## [2,]  3           13  4     12.5       0.5   40000.0      2     0     0
## [3,]  3           13  5     12.6       0.4   31746.0      2     1     1
## [4,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
## [5,]  6           16  6     15.9       0.1    6289.3      2     1     1
## [6,]  6           16  1     16.2      -0.2  -12346.0      2     0     0
findSimilFrom2sets(cC,aA)
##      cC predMatr[, ] aA measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  1         16.2  6       16       0.2   12500.0      1     1     1
## [2,]  4         12.5  2       12       0.5   41667.0      2     1     2
## [3,]  4         12.5  3       13      -0.5  -38462.0      2     1     2
## [4,]  5         12.6  3       13      -0.4  -30769.0      1     1     1
## [5,]  6         15.9  6       16      -0.1   -6250.0      1     1     1
## [6,]  7         14.1  4       14       0.1    7142.9      1     1     1
findSimilFrom2sets(aA,cC,best=FALSE)
##      aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  2           12  4     12.5      -0.5  -40000.0      1     1     1
## [2,]  3           13  4     12.5       0.5   40000.0      2     0     0
## [3,]  3           13  5     12.6       0.4   31746.0      2     1     1
## [4,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
## [5,]  6           16  6     15.9       0.1    6289.3      2     1     1
## [6,]  6           16  1     16.2      -0.2  -12346.0      2     0     0
findSimilFrom2sets(aA,cC,comp="ppm",lim=5e4,deb=TRUE)
## xxfindSimilFrom2sets2
## .. xxidentToMatr2a
## .. xxidentToMatr2c
## .. xxidentToMatr2d
## .. xxidentToMatr2e
## .. xxidentToMatr2f
##      aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  2           12  4     12.5      -0.5  -40000.0      1     1     1
## [2,]  3           13  4     12.5       0.5   40000.0      2     0     0
## [3,]  3           13  5     12.6       0.4   31746.0      2     1     1
## [4,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
## [5,]  6           16  6     15.9       0.1    6289.3      2     1     1
## [6,]  6           16  1     16.2      -0.2  -12346.0      2     0     0
## [7,]  7           17  1     16.2       0.8   49383.0      1     1     1
findSimilFrom2sets(aA,cC,comp="ppm",lim=9e4,bestO=FALSE)
##       aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
##  [1,]  2           12  4     12.5      -0.5  -40000.0      2     1     1
##  [2,]  2           12  5     12.6      -0.6  -47619.0      3     0     0
##  [3,]  3           13  4     12.5       0.5   40000.0      2     0     0
##  [4,]  3           13  5     12.6       0.4   31746.0      3     1     1
##  [5,]  3           13  7     14.1      -1.1  -78014.0      3     0     0
##  [6,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
##  [7,]  5           15  7     14.1       0.9   63830.0      3     1     2
##  [8,]  5           15  6     15.9      -0.9  -56604.0      3     1     2
##  [9,]  5           15  1     16.2      -1.2  -74074.0      2     0     0
## [10,]  6           16  6     15.9       0.1    6289.3      3     0     0
## [11,]  6           16  1     16.2      -0.2  -12346.0      2     0     0
## [12,]  7           17  6     15.9       1.1   69182.0      2     1     0
## [13,]  7           17  1     16.2       0.8   49383.0      2     1     2
# below: find fewer 'best matches' since search window larger (ie more good hits compete !)
findSimilFrom2sets(aA,cC,comp="ppm",lim=9e4,bestO=TRUE)      
##      aA predMatr[, ] cC measMatr disToPred ppmToPred nByGrp isMin nBest
## [1,]  2           12  4     12.5      -0.5  -40000.0      2     1     1
## [2,]  3           13  5     12.6       0.4   31746.0      3     1     1
## [3,]  4           14  7     14.1      -0.1   -7092.2      1     1     1
## [4,]  5           15  7     14.1       0.9   63830.0      3     1     2
## [5,]  5           15  6     15.9      -0.9  -56604.0      3     1     2
## [6,]  7           17  6     15.9       1.1   69182.0      2     1     0
## [7,]  7           17  1     16.2       0.8   49383.0      2     1     2

### Fuse Previously Identified Pairs to ‘Clusters’

When you have already identified the closest neighbour of a set of values, you may want to re-organize/fuse such pairs to a given number of total clusters (using fusePairs).

(daPa <- matrix(c(1:5,8,2:6,9), ncol=2))
##      [,1] [,2]
## [1,]    1    2
## [2,]    2    3
## [3,]    3    4
## [4,]    4    5
## [5,]    5    6
## [6,]    8    9
fusePairs(daPa, maxFuse=4)
## 1 2 3 4 4 5 6 8 9
## 1 1 1 1 2 2 2 3 3

### Eliminate Close (Overlapping) Points (in Bivariate x & y Space)

When visualizing larger data-sets in an x&y space one may find many points overlapping when their values are almost the same.
The function elimCloseCoord() aims to do reduce a bivariate data-set to ‘non-overlapping’ points, somehow similar to human perception.

da1 <- matrix(c(rep(0:4,5),0.01,1.1,2.04,3.07,4.5), ncol=2); da1[,1] <- da1[,1]*99; head(da1)
##      [,1] [,2]
## [1,]    0    0
## [2,]   99    1
## [3,]  198    2
## [4,]  297    3
## [5,]  396    4
## [6,]    0    0
elimCloseCoord(da1)
##  -> elimCloseCoord :  reducing 'x' from 15 to 7 lines
##    [,1] [,2]
## 1     0  0.0
## 2    99  1.0
## 3   198  2.0
## 4   297  3.0
## 5   396  4.0
## 12   99  1.1
## 15  396  4.5

### Mode of (Continuous) Data

Looking for the mode is rather easy with counting data. Looking at the result of table() will get you there quickly. However, with continuous data what people intuitively consider the mode is rather a result of a density estimation and may be more tricky to identify. The most frequent value may be quite distant to the most dense region of data. The function stableMode() presented here has different modes of operation, at this point there is no clear rule which mode may perform most satisfactory on different data.

set.seed(2012); dat <- round(c(rnorm(120,0,1.2), rnorm(80,0.8,0.6), rnorm(25,-0.6,0.05), runif(200)),3)
dat <- dat[which(dat > -2 & dat <2)]
stableMode(dat)
##  -> stableMode : Method='density',  length of x =406, 'bandw' has been set to 28
##   221
## 0.477

Now we can try to show on a plot :

layout(1:2)
plot(1:length(dat), sort(dat), type="l", main="sorted values", las=1)
abline(h=stableMode(dat), lty=2,col=2)
##  -> stableMode : Method='density',  length of x =406, 'bandw' has been set to 28
plot(density(dat, kernel="gaussian", adjust=0.7))
abline(v=stableMode(dat, method="dens"), lty=2, col="red", lwd=2)
##  -> stableMode : Method='density',  length of x =406, 'bandw' has been set to 28
abline(v=stableMode(dat, method="binning"), lty=2, col="green")
##  -> stableMode : Method='binning', value of 'bandw'=21 may be too high for good functioning !
abline(v=stableMode(dat, method="BBmisc"), lty=2, col="blue")  
## Loading required namespace: BBmisc
abline(v=stableMode(dat, method="allModes"), lty=2, col="grey55")  

Please note that plotting data modelled via a Kernell function also relies on strong hypothesis which may not be justified in a number of cases. Looking for the most frequent exact value is not a perfect choice for continous data. Rather due to the rounding to 3 digits in this example the method ‘allModes’ gave partially usable results (dashed grey lines), as in the example above it may give ties.

### Most Frequently Occuring Value (traditional mode)

The function stableMode() can also be used to locate the most frequently occuring exact value of numeric or character vectors. The argument method=“allModes” allows finding all ties (if any).

set.seed(2021)
x <- sample(letters, 50000, replace=TRUE)
stableMode(dat, method="mode")
## [1] 0.173
stableMode(dat, method="allModes")
## [1] 0.173 0.629 0.676

## Working with Regressions

### Best Starting Point For Linear Regressions

In many types of measurments the very low level measures are delicate. Especially when the readout starts with a baseline signal before increasing amounts of the analyte start producing a linear relationship. In such cases some of the very lowest levels of the analyte are masked by the (random) baseline signal. The function linModelSelect() presented here allows omitting some of the lowest analyte measures to focus on the linear part of the dose-response relationship.

li1 <- rep(c(4,3,3:6),each=3) + round(runif(18)/5,2)
names(li1) <- paste0(rep(letters[1:5], each=3), rep(1:3,6))
li2 <- rep(c(6,3:7), each=3) + round(runif(18)/5, 2)
dat2 <- rbind(P1=li1, P2=li2)
exp2 <- rep(c(11:16), each=3)
exp4 <- rep(c(3,10,30,100,300,1000), each=3)

## Check & plot for linear model
linModelSelect("P1", dat2, expect=exp2)
##  -> linModelSelect :  best slope pVal starting at level no 3

## $coef ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -9.523 0.24178213 -39.38670 2.659588e-12 ## conc 0.974 0.01662528 58.58547 5.098121e-14 ## ##$name
## [1] "P1"
##
## $startLev ## [1] 3 linModelSelect("P2", dat2, expect=exp2) ## -> linModelSelect : best slope pVal starting at level no 2 ##$coef
##              Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept) -9.028667 0.099481734 -90.75703 1.320255e-19
## conc         1.006000 0.007069859 142.29421 3.838234e-22
##
## $name ## [1] "P2" ## ##$startLev
## [1] 2

### High Throughput Testing For Linear Regressions

Once we have run multiple linear regressions on differt parts of the data we might wat to compare them in a single plot. Below, we construct 10 series of data that get modeled the same way, ideally one would obtain a slope close to 1.0. We still allow omitting some starting points, if the resulting model would fit better.

set.seed(2020)
x1 <- matrix(rep(c(2,2:5),each=20) + runif(100) +rep(c(0,0.5,2:3,5),20), byrow=FALSE, ncol=10, dimnames=list(LETTERS[1:10],NULL))
## just the 1st regression :
summary(lm(b~a,data=data.frame(b=x1[,1],a=rep(1:5,each=2))))
##
## Call:
## lm(formula = b ~ a, data = data.frame(b = x1[, 1], a = rep(1:5,
##     each = 2)))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.3811 -0.6719 -0.5001  1.3683  2.6876
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   2.7850     1.3668   2.038   0.0759 .
## a             0.5545     0.4121   1.346   0.2153
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.843 on 8 degrees of freedom
## Multiple R-squared:  0.1846, Adjusted R-squared:  0.08263
## F-statistic: 1.811 on 1 and 8 DF,  p-value: 0.2153
## all regressions
x1.lmSum <- t(sapply(lapply(rownames(x1), linModelSelect, dat=x1, expect=rep(1:5,each=2), silent=TRUE, plotGraph=FALSE),
function(x) c(x$coef[2,c(4,1)], startFr=x$startLev)))
x1.lmSum <- cbind(x1.lmSum, medQuantity=apply(x1,1,median))
x1.lmSum[,1] <- log10(x1.lmSum[,1])
head(x1.lmSum)
##    Pr(>|t|)  Estimate startFr medQuantity
## A -4.298797 0.7837628       1    3.781966
## B -4.828403 1.1542815       2    3.756802
## C -5.873269 0.6638477       1    5.883383
## D -6.518075 0.7793624       1    6.703049
## E -5.288792 0.8599269       1    8.725195
## F -5.031322 0.9901120       2    3.286851

Now we can try to plot

wrGraphOK <- requireNamespace("wrGraph", quietly=TRUE)      # check if package is available
if(wrGraphOK) wrGraph::plotW2Leg(x1.lmSum, useCol=c("Pr(>|t|)","Estimate","medQuantity","startFr"), legendloc="topleft", txtLegend="start at")

## Combinatorics Issues

### All Pairwise Ratios

ratioAllComb() calculates all possible pairwise ratios between all individual calues of x and y.

set.seed(2014); ra1 <- c(rnorm(9,2,1),runif(8,1,2))

Let’s assume there are 2 parts of ‘x’ for which we would like to know the representative ratio : The ratio of medians does not well reflect the typical ratio (if each element has the same chance to be picked).

median(ra1[1:9])/median(ra1[10:17])
## [1] 1.327086

Instead, we’ll build all possible ratios and summarize then.

summary( ratioAllComb(ra1[1:9],ra1[10:17]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.359   1.142   1.274   1.290   1.506   2.777
boxplot(list(norm=ra1[1:9],unif=ra1[10:17],rat=ratioAllComb(ra1[1:9],ra1[10:17])))

## Import/Export

Some programs produce a series of csv files, where a large experiment/data-set was split in multiple files. The function readCsvBatch() was designed for reading multiple csv files of exactly the same layout and to join their content. As output a list with the content of each file can be produced (one matrix per file), or the data may be fused into an array, as shown below.

path1 <- system.file("extdata", package="wrMisc")
fiNa <-  c("pl01_1.csv","pl01_2.csv","pl02_1.csv","pl02_2.csv")
datAll <- readCsvBatch(fiNa, path1)
##  -> readCsvBatch :  based on 'excludeFiles': excluding 0 files (out of 4)
##  -> readCsvBatch :  ready to start reading 4 files;  expecting header(s) : TRUE
##  -> readCsvBatch :  csv-format used 'Europe'  with 96 lines & 3 cols
##  -> readCsvBatch :  ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpwpZunV/Rinst3a146fe84fde/wrMisc/extdata/pl01_1.csv
## -> readCsvBatch -> .inspectHeader -> convToNum : length(x) 96   class(x) character    mode(x) character   head pl01_1 pl01_1 pl01_1 pl01_1 pl01_1 pl01_1
## -> readCsvBatch -> .inspectHeader -> convToNum : length(x) 96   class(x) character    mode(x) character   head A01 B01 C01 D01 E01 F01
## -> readCsvBatch -> .inspectHeader -> convToNum : length(x) 96   class(x) character    mode(x) character   head 158808 174272 183176 175752 49272 39888
## -> readCsvBatch -> .removeEmptyCol :  columns no 1, 2 were considered empty and removed
##  -> readCsvBatch :  ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpwpZunV/Rinst3a146fe84fde/wrMisc/extdata/pl01_2.csv
##  -> readCsvBatch :  ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpwpZunV/Rinst3a146fe84fde/wrMisc/extdata/pl02_1.csv
##  -> readCsvBatch :  ..reading file C:/Users/wraff/AppData/Local/Temp/RtmpwpZunV/Rinst3a146fe84fde/wrMisc/extdata/pl02_2.csv
str(datAll)
##  num [1:96, 1:4, 1] 158808 174272 183176 175752 49272 ...
##  - attr(*, "dimnames")=List of 3
##   ..$: chr [1:96] "A01" "B01" "C01" "D01" ... ## ..$ : chr [1:4] "1_1.csv" "1_2.csv" "2_1.csv" "2_2.csv"
##   ..$: chr "StainA" When setting the first argument fileNames to NULL, you can read all files of a given path. ## batch reading of all csv files in specified path : datAll2 <- readCsvBatch(fileNames=NULL, path=path1, silent=TRUE) ### Batch-Reading of Tabulated Files The function readTabulatedBatch() allows fast batch reading of tabulated files. All files specified (or all files from a given directory) will be read into separate data.frames of a list. Default options are US-style comma, automatic testing for head in case the package data.table is available (otheriwse : no header). Furthermore it is possible to design a given (numeric) column and directly filter for all lines passing a given threshold, allowing to get smaller objects. path1 <- system.file("extdata", package="wrMisc") fiNa <- c("a1.txt","a2.txt") allTxt <- readTabulatedBatch(fiNa, path1) str(allTxt) ## List of 2 ##$ a1.txt:'data.frame':  33 obs. of  3 variables:
##   ..$V1: int [1:33] 3697 3626 732 388503 10747 1564 3699 256394 345 3950 ... ## ..$ V2: num [1:33] 4 6.24 6.63 6.71 8 ...
##   ..$V3: num [1:33] 0.621 0.507 0.575 0.502 0.525 ... ##$ a2.txt:'data.frame':  35 obs. of  3 variables:
##   ..$V1: int [1:35] 6414 57381 8404 10580 79611 4739 10252 221395 4256 4811 ... ## ..$ V2: num [1:35] 1.73 5.83 6.71 7.48 9.49 ...
##   ..$: chr [1:5] "Location" "Names" "Names_2" "Names_3" ... ### Converting url for Reading Tabulated Data from GitHub GitHub allows sharing code and (to a lower degree) data. In order to properly read tabulated (txt, tsv or csv) data directly from a given url, the user should switch to the ‘Raw’ view. The function gitDataUrl allows to conventiently switch any url (on git) to the format from ‘Raw view’, suitable for directly reading the data using read.delim() , read.table() or read.csv() etc ). ## An example url with tabulated data : url1 <- "https://github.com/bigbio/proteomics-metadata-standard/blob/master/annotated-projects/PXD001819/PXD001819.sdrf.tsv" gitDataUrl(url1) ## [1] "https://raw.githubusercontent.com/bigbio/proteomics-metadata-standard/master/annotated-projects/PXD001819/PXD001819.sdrf.tsv" dataPxd <- read.delim(gitDataUrl(url1), sep='\t', header=TRUE) str(dataPxd) ## 'data.frame': 27 obs. of 24 variables: ##$ source.name                          : chr  "Sample 1" "Sample 1" "Sample 1" "Sample 2" ...
##  $characteristics.Organism. : chr "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" ... ##$ characteristics.organism.part.       : chr  "not available" "not available" "not available" "not available" ...
##  $characteristics.disease. : chr "not available" "not available" "not available" "not available" ... ##$ characteristics.cell.type.           : chr  "not applicable" "not applicable" "not applicable" "not applicable" ...
##  $characteristics.mass. : chr "2 mg" "2 mg" "2 mg" "2 mg" ... ##$ characteristics.spiked.compound.     : chr  "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group" ...
##  $characteristics.biological.replicate.: int 1 1 1 1 1 1 1 1 1 1 ... ##$ material.type                        : chr  "lysate" "lysate" "lysate" "lysate" ...
##  $assay.name : chr "run 1" "run 2" "run 3" "run 4" ... ##$ technology.type                      : chr  "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" ...
##  $comment.label. : chr "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" ... ##$ comment.instrument.                  : chr  "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" ...
##  $comment.precursor.mass.tolerance. : chr "5 ppm" "5 ppm" "5 ppm" "5 ppm" ... ##$ comment.fragment.mass.tolerance.     : chr  "0.8 Da" "0.8 Da" "0.8 Da" "0.8 Da" ...
##  $comment.cleavage.agent.details. : chr "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" ... ##$ comment.modification.parameters.     : chr  "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" ...
##  $comment.modification.parameters..1 : chr "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" ... ##$ comment.modification.parameters..2   : chr  "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" ...
##  $comment.technical.replicate. : int 1 2 3 1 2 3 1 2 3 1 ... ##$ comment.fraction.identifier.         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $comment.file.uri. : chr "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_12500amol_R1.raw" "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_12500amol_R2.raw" "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_12500amol_R3.raw" "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_125amol_R1.raw" ... ##$ comment.data.file.                   : chr  "UPS1_12500amol_R1.raw" "UPS1_12500amol_R2.raw" "UPS1_12500amol_R3.raw" "UPS1_125amol_R1.raw" ...
##  factor.value.spiked.compound. : chr "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group" ... ## Normalization The main reason of normalization is to remove variability in the data which is not directly linked to the (original/biological) concept of a given experiment. High throughput data from real world measurements may easily contain various deformations due to technical reasons, eg slight temperature variations, electromagnetic interference, instability of reagents etc. In particular, transferring constant amounts of liquids/reagents in highly repeated steps over large experiments is often also very challenging, small variations of the amounts of liquid (or similar) are typically addressed by normalization. However, applying aggressive normalization to the data also brings considerable risk of starting to loose some of the effects one intended to study. At some point it may rather be better to eliminate a few samples or branches of an experiment to avoid too invasive intervention. This shows that quality control can be tightly linked to decisions about data-normalization. In conclusion, normalization may be far more challenging than simply running some algorithms.. In general the use has to assume/define some hypothesis to justify intervention. Sometimes specific elements of an experiment are known to be not affected and can therefore be used to normalize the rest. Eg, if you observe growth of trees in a forest, big blocks of rock on the floor are assumed no to change their location. So one could use them as alignment-marks to superpose pictures taken at slightly different positions. The hypothesis of no global changes is very common : During the course of many biological experiments (eg change of nutrient) one assumes that only a small portion of the elements measured (eg the abundance of all different gene-products) do change, since many processes of a living cell like growth, replication and interaction with neighbour-cells are assumed not to be affected. So, if one assumes that there are no global changes one normalizes the input-data in a way that the average or median across each experiment will give the same value. In analogy, if one takes photographs on a partially cloudy day, most cameras will adjust light settings (sun r clouds) so that global luminosity stays the same. However, if too many of the measured elements are affected, this normalization approach will lead to (additional) loss of information. I suggest the user also to include graphical representation(s) of the data helping to visualize variability in various samples along a given experiment and how different normalization procedures affect these outcomes. Before jumping into normalization it may be quite useful to filter the data first. The overall idea is, that most high-throughput experiments do produce some non-meaningful data (artefacts) and it may be wise to remove such ‘bad’ data first as they may effect normalization (in particular extreme values). A special case of such problematic data concerns NA-values. ### Filter Lines of Matrix to Reduce Content of NAs Frequent NA-values may represent another potential issue. With NA-values there is no general optimal advice. To get started, you should try to investigate how and why NA-values occurred to check if there is a special ‘meaning’ to them. For example, on some measurement systems values below detection limit may be simply reported as NAs. If the lines of your data represent different features quantified (eg proteins), than lines with mostly NA-values represent features that may not be well exploited anyway. Therefore many times one tries to filter away lines of ‘bad’ data. Of course, if there is a column (sample) with an extremely high content of NAs, one should also investigate what might be particular with this column (sample), to see if one might be better of to eliminate the entire column. The function presenceFilt() allows to eliminate lines containing too many NA-values. dat1 <- matrix(1:56,ncol=7) dat1[c(2,3,4,5,6,10,12,18,19,20,22,23,26,27,28,30,31,34,38,39,50,54)] <- NA dat1; presenceFilt(dat1, gr=gl(3,3)[-(3:4)], maxGr=0) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## [1,] 1 9 17 25 33 41 49 ## [2,] NA NA NA NA NA 42 NA ## [3,] NA 11 NA NA 35 43 51 ## [4,] NA NA NA NA 36 44 52 ## [5,] NA 13 21 29 37 45 53 ## [6,] NA 14 NA NA NA 46 NA ## [7,] 7 15 NA NA NA 47 55 ## [8,] 8 16 24 32 40 48 56 ## 1-2 1-3 2-3 ## [1,] TRUE TRUE TRUE ## [2,] FALSE FALSE FALSE ## [3,] FALSE TRUE TRUE ## [4,] FALSE TRUE TRUE ## [5,] TRUE TRUE TRUE ## [6,] FALSE FALSE FALSE ## [7,] TRUE TRUE FALSE ## [8,] TRUE TRUE TRUE presenceFilt(dat1, gr=gl(2,4)[-1], maxGr=1, ratM=0.1) ## -> presenceFilt : correcting 'maxGrpMiss' for group(s) 1 and 2 due to ratMaxNA=0.1 ## 1-2 ## [1,] TRUE ## [2,] FALSE ## [3,] FALSE ## [4,] FALSE ## [5,] TRUE ## [6,] FALSE ## [7,] FALSE ## [8,] TRUE presenceFilt(dat1, gr=gl(2,4)[-1], maxGr=2, rat=0.5) ## 1-2 ## [1,] TRUE ## [2,] FALSE ## [3,] TRUE ## [4,] TRUE ## [5,] TRUE ## [6,] TRUE ## [7,] TRUE ## [8,] TRUE mat3 <- matrix(c(19,20,30, 18,19,28, 16,14,35),ncol=3) cleanReplicates(mat3,nOutl=1) ## -> cleanReplicates : rownames of 'x' either NULL or not unique, replacing by row-numbers ## -> cleanReplicates : removing 1 entries in lines 2 ## [,1] [,2] [,3] ## 1 19 18 16 ## 2 20 19 NA ## 3 30 28 35 Please note, that imputing NA-values represents another option instead of filtering, multiple other packages address this in detail. All decisions of which approach to use should be data-driven. ### The Function normalizeThis() In biological high-throughput data columns typically represent different samples, which may be organized as replicates. During high-throughput experiments thousands of (independent) elements are measured (eg abundance of gene-products), they are represented by rows. Note, that some experiments may produce a considerable amount of missing data (NAs) which require special attention (dedicated developments exist in other R-packages eg in wrProteo). My general advice is to first carefully look where such missing data is observed and to pay attention to replicate measurements where a given element once was measured with a real numeric value and once as missing information (NA). Of course, graphical representations (PCA, MA-plots, etc) are frequently extremely important to identifying abnormalities and potential problems. The package wrGraph offers also complementary options useful in the context of normalization. set.seed(2015); rand1 <- round(runif(300) +rnorm(300,0,2),3) dat1 <- cbind(ser1=round(100:1+rand1[1:100]), ser2=round(1.2*(100:1+rand1[101:200])-2), ser3=round((100:1+rand1[201:300])^1.2-3)) dat1 <- cbind(dat1, ser4=round(dat1[,1]^seq(2,5,length.out=100)+rand1[11:110],1)) dat1[dat1 <1] <- NA ## Let's get a quick overview of the data summary(dat1) ## ser1 ser2 ser3 ser4 ## Min. : 2.00 Min. : 1.00 Min. : 1.0 Min. : 37.5 ## 1st Qu.: 26.75 1st Qu.: 28.00 1st Qu.: 50.0 1st Qu.: 67210.0 ## Median : 49.50 Median : 59.00 Median :109.0 Median : 332524.0 ## Mean : 51.14 Mean : 58.79 Mean :115.1 Mean : 542279.4 ## 3rd Qu.: 76.25 3rd Qu.: 89.50 3rd Qu.:173.5 3rd Qu.: 925759.5 ## Max. :100.00 Max. :121.00 Max. :263.0 Max. :2123191.7 ## NA's :1 ## some selected lines (indeed, the 4th column appears always much higher) dat1[c(1:5,50:54,95:100),] ## ser1 ser2 ser3 ser4 ## [1,] 100 121 251 10000.6 ## [2,] 100 117 244 11500.2 ## [3,] 99 120 263 12948.1 ## [4,] 99 120 242 14885.1 ## [5,] 97 114 236 16382.3 ## [6,] 51 60 111 892534.1 ## [7,] 48 58 109 812490.4 ## [8,] 49 55 108 982907.4 ## [9,] 50 56 107 1188787.7 ## [10,] 45 47 102 915343.6 ## [11,] 3 6 5 206.4 ## [12,] 5 2 7 2570.0 ## [13,] 8 1 3 27125.8 ## [14,] 6 4 5 6975.9 ## [15,] 3 3 1 237.0 ## [16,] 2 1 NA 37.5 Our toy data may be normalized by a number of different criteria. In real applications the nature of the data and the type of deformation detected/expected will largely help deciding which normalization might be the ‘best’ choice. Here we’ll try first normalizing by the mean, ie all columns will be forced to end up with the same column-mean. The trimmed mean does not consider values at extremes (as outliers are frequently artefacts and display extreme values). When restricting even stronger which values to consider one will eventually end up with the median (3rd method used below). no1 <- normalizeThis(dat1, refGrp=1:3, meth="mean") no2 <- normalizeThis(dat1, refGrp=1:3, meth="trimMean", trim=0.4) no3 <- normalizeThis(dat1, refGrp=1:3, meth="median") no4 <- normalizeThis(dat1, refGrp=1:3, meth="slope", quantFa=c(0.2,0.8)) It is suggested to verify normalization results by plots. Note, that Box plots may not be appropriate in some cases (eg multimodal distributions), for more details you may consider using Violin-Plots from packages vioplot or wrGraph, another option might be a (cumulated) frequency plot (eg in package wrGraph). You can see clearly, that the 4th data-set has a problem of range. So we’ll see if some proportional normalization may help to make it more comparable to the other ones. ### Matrix Coordinates of Values/points According to Filtering Sometimes one needs to obtain the coordinates of values/points of a matrix according to a given filtering condition. The standard approach using which() gives only a linearized index but not row/column, which is sufficient for replacing indexed values. If you need to know the true row/column indexes, you may use coordOfFilt(). ## [1] 2 3 6 7 9 14 26 ## row col ## [1,] 2 1 ## [2,] 3 1 ## [3,] 3 2 ## [4,] 1 3 ## [5,] 3 3 ## [6,] 2 5 ## [7,] 2 9 ## Statistical Testing ### Normal Random Number Generation with Close Fit to Expected mean and sd When creating random values to an expected mean and sd, the standard function rnorm() may deviate somehow from the expected mean and sd, in particular with low n. To still produce random values fitting closely to the expected mean and sd you may use the function rnormW(). In case of n=2, there is only one possible result, thus there is no random-component in the vector of 2 values. Otherwise, the random component can be fixed usin the argument seed. ## some sample data : x1 <- (11:16)[-5] mean(x1); sd(x1) ## [1] 13.2 ## [1] 1.923538 ## the standard way for gerenating normal random values ra1 <- rnorm(n=length(x1), mean=mean(x1), sd=sd(x1)) ## In particular with low n, the random values deviate somehow from expected mean and sd : mean(ra1) -mean(x1)  ## [1] -1.103347 sd(ra1) -sd(x1) ## [1] 0.3920622 ## random numbers with close fit to expected mean and sd : ra2 <- rnormW(length(x1), mean(x1), sd(x1)) mean(ra2) -mean(x1)  ## [1] 0 sd(ra2) -sd(x1) # much closer to expected value ## [1] -4.440892e-16 ### Moderated Pair-wise t-Test from limma If you are not familiar with the way data is handled in the Bioconductor package limma and you would like to use some of the tools for running moderated t-tests therein, this will provide easy access using moderTest2grp : set.seed(2017); t8 <- matrix(round(rnorm(1600,10,0.4),2), ncol=8, dimnames=list(paste("l",1:200), c("AA1","BB1","CC1","DD1","AA2","BB2","CC2","DD2"))) t8[3:6,1:2] <- t8[3:6,1:2]+3 # augment lines 3:6 for AA1&BB1 t8[5:8,5:6] <- t8[5:8,5:6]+3 # augment lines 5:8 for AA2&BB2 (c,d,g,h should be found) t4 <- log2(t8[,1:4]/t8[,5:8]) fit4 <- moderTest2grp(t4, gl(2,2)) ## now we'll use limma's topTable() function to look at the 'best' results if("list" %in% mode(fit4)) { # if you have limma installed we can look further library(limma) topTable(fit4, coef=1,n=5) # effect for 3,4,7,8 fit4in <- moderTest2grp(t4, gl(2,2), testO="<") if("list" %in% mode(fit4in)) topTable(fit4in, coef=1,n=5) } ## -> moderTest2grp : testing alternative hypothesis: true difference in means is less than 0 (ie focus on 101 results with A less than B) ## logFC AveExpr t P.Value adj.P.Val B ## l 7 -0.4975806 -0.2436786 -8.712092 3.994695e-17 7.989390e-15 30.668381 ## l 4 0.4020373 0.1890232 7.039234 1.000000e+00 1.000000e+00 17.723883 ## l 8 -0.3735170 -0.2259811 -6.539873 9.417239e-11 9.417239e-09 14.392733 ## l 3 0.3508834 0.1488240 6.143585 1.000000e+00 1.000000e+00 11.923522 ## l 27 -0.1348878 -0.1011609 -2.361738 9.333949e-03 6.222633e-01 -3.878176 ### Multiple Moderated Pair-wise t-Tests from limma If you want to make multiple pair-wise comparisons using moderTestXgrp : grp <- factor(rep(LETTERS[c(3,1,4)],c(2,3,3))) set.seed(2017); t8 <- matrix(round(rnorm(208*8,10,0.4),2), ncol=8, dimnames=list(paste(letters[],rep(1:8,each=26),sep=""), paste(grp,c(1:2,1:3,1:3),sep=""))) t8[3:6,1:2] <- t8[3:6,1:2] +3 # augment lines 3:6 (c-f) t8[5:8,c(1:2,6:8)] <- t8[5:8,c(1:2,6:8)] -1.5 # lower lines t8[6:7,3:5] <- t8[6:7,3:5] +2.2 # augment lines ## expect to find C/A in c,d,g, (h) ## expect to find C/D in c,d,e,f ## expect to find A/D in f,g,(h) test8 <- moderTestXgrp(t8, grp) head(test8p.value, n=8) 
##             A-C          A-D          C-D
## a1 8.736828e-02 6.776543e-02 9.397304e-01
## b1 4.384118e-01 5.400019e-01 8.205610e-01
## c1 1.094834e-19 6.344497e-01 2.571471e-21
## d1 2.671725e-13 9.915692e-01 2.858699e-13
## e1 1.802454e-03 2.413137e-08 9.735465e-16
## f1 3.188362e-01 2.527208e-32 2.226490e-22
## g1 1.166242e-29 6.410057e-33 5.484445e-01
## h1 1.141181e-05 1.943795e-05 5.674938e-01

### Transform p-values to local false dicovery rate (lfdr)

To get an introduction into local false discovery rate estimations you may read Strimmer 2008. An overview of R packages in this context was prepared by the strimmerlab. A convenient way to get lfdr values is available suing the function pVal2lfdr.

Note, that the toy-example used below is too small for estimating really meaningful lfdr values. For this reason the function fdrtool() from package fdrtool will issue warnings.

set.seed(2017); t8 <- matrix(round(rnorm(160,10,0.4),2), ncol=8, dimnames=list(letters[1:20],
c("AA1","BB1","CC1","DD1","AA2","BB2","CC2","DD2")))
t8[3:6,1:2] <- t8[3:6,1:2]+3   # augment lines 3:6 (c-f) for AA1&BB1
t8[5:8,5:6] <- t8[5:8,5:6]+3   # augment lines 5:8 (e-h) for AA2&BB2 (c,d,g,h should be found)
head(pVal2lfdr(apply(t8, 1, function(x) t.test(x[1:4],x[5:8])$p.value))) ## Warning in fdrtool::fdrtool(z, statistic = "pvalue", plot = FALSE, verbose ## = !silent): There may be too few input test statistics for reliable FDR ## calculations! ## a b c d e f ## 1.0000000 0.5753562 0.5753562 1.0000000 1.0000000 1.0000000 ### Extract Groups of Replicates from Pair-wise Column-Names When running multiple pairwise tests (using moderTestXgrp()) the column-names are concatenated group-names. To get the index of which group has been used in which pair-wise set you may use the function matchSampToPairw(), as shown below. ## make example if limma is not installed if(!requireNamespace("limma", quietly=TRUE)) test8 <- list(FDR=matrix(1,nrow=2,ncol=3,dimnames=list(NULL,c("A-C","A-D","C-D")))) matchSampToPairw(unique(grp), colnames(test8$FDR)) 
##     le ri
## A-C  2  1
## A-D  2  3
## C-D  1  3

### Extract Numeric Part of Column-Names

When running multiple pairwise tests (using moderTestXgrp()) the results will be in adjacent columns and the group-names reflected in the column-names. In the case measurements from multiple levels of a given variable are compared it is useful to extract the numeric part, the function numPairDeColNames() provides support to do so. When extracting just the numeric part, unit names will get lost, though. Note, if units used are not constant (eg seconds and milliseconds mixed) the extracted numeric values do not reflect the real quantitative context any more.

mat1 <- matrix(1:8, nrow=2, dimnames=list(NULL, paste0(1:4,"-",6:9)))
numPairDeColNames(mat1)
##  -> numPartDeColNames : PROBLEM ? : 'stripTxt' does REMOVE the separator 'sep' ! Select a different separator or 'stripTxt' strategy to resolve pairwise combinations !
##      index log2rat conc1 conc2
## [1,]     1   2.585     1     6
## [2,]     2   1.807     2     7
## [3,]     3   1.415     3     8
## [4,]     4   1.170     4     9

## Working With Clustering

Multiple concepts for clustering have been deeveloped, most of them allow extracting a vector with the cluster-numbers. Here some functions helping to work with the output of such clustering results are presented.

### Prepare Data for Clustering

The way how to prepare data for clustering may be as important as the choice of the actual clustering-algorithm …

Many clustering algorithms are available in R (eg see also CRAN Task View: Cluster Analysis & Finite Mixture Models), many of them require the input data to be standardized. The regular way of standardizing sets all elements to mean=0 and sd=1. To do so, the function scale() may be used.

dat <- matrix(2*round(runif(100),2), ncol=4)
mean(dat); sd(dat)
## [1] 1.034
## [1] 0.5377882
datS <- scale(dat)
apply(datS, 2, sd)
## [1] 1 1 1 1
# each column was teated separately
mean(datS); sd(datS); range(datS)
## [1] 2.202828e-18
## [1] 0.9847319
## [1] -2.018697  1.814272
# the mean is almost 0.0 and the sd almost 1.0

datB <- scale(dat, center=TRUE, scale=FALSE)
mean(datB); sd(datB); range(datB)              # mean is almost 0
## [1] 4.438724e-18
## [1] 0.5291826
## [1] -0.9696  0.9904

However, if you want the entire data-set and not each column sparately, you may use standardW(). Thus, relative differences visible within a line will be conserved. Furthermore, in case of 3-dim arrays, this function returns also the same dimensions as the input.

datS2 <- standardW(dat)
apply(datS2, 2, sd)
## [1] 1.0490828 1.1363627 0.8732244 0.9163109
summary(datS2)
##        V1                V2                 V3                 V4
##  Min.   :-1.7367   Min.   :-1.88550   Min.   :-1.69955   Min.   :-1.1417
##  1st Qu.:-1.1417   1st Qu.:-1.25328   1st Qu.:-0.50949   1st Qu.:-0.5467
##  Median :-0.3979   Median :-0.10041   Median : 0.04835   Median : 0.4202
##  Mean   :-0.2298   Mean   :-0.08256   Mean   : 0.06322   Mean   : 0.2492
##  3rd Qu.: 0.6434   3rd Qu.: 0.86651   3rd Qu.: 0.68057   3rd Qu.: 0.9037
##  Max.   : 1.4987   Max.   : 1.75906   Max.   : 1.64749   Max.   : 1.7591
mean(datS2); sd(datS2)
## [1] -4.904497e-17
## [1] 1
datS3 <- standardW(dat, byColumn=TRUE)
apply(datS3, 2, sd)
## [1] 0.9532136 0.8800007 1.1451811 1.0913326
summary(datS3)
##        V1                V2                 V3                V4
##  Min.   :-1.5782   Min.   :-1.58659   Min.   :-2.2316   Min.   :-2.3296
##  1st Qu.:-0.5325   1st Qu.:-0.62115   1st Qu.:-0.9753   1st Qu.:-1.1323
##  Median : 0.1234   Median :-0.01571   Median :-0.1874   Median :-0.3815
##  Mean   : 0.2191   Mean   : 0.07265   Mean   :-0.0724   Mean   :-0.2719
##  3rd Qu.: 1.0450   3rd Qu.: 0.83518   3rd Qu.: 0.9199   3rd Qu.: 0.6737
##  Max.   : 1.8958   Max.   : 1.62062   Max.   : 1.9420   Max.   : 1.6478
mean(datS3); sd(datS3)
## [1] -0.01314899
## [1] 1.035144

Sometimes it is sufficient to only set the minimum and maximum to a given range.

datR2 <- apply(dat, 2, scaleXY, 1, 100)
summary(datR2); sd(datR2)
##        V1               V2               V3               V4
##  Min.   :  1.00   Min.   :  1.00   Min.   :  1.00   Min.   :  1.00
##  1st Qu.: 19.21   1st Qu.: 18.17   1st Qu.: 36.20   1st Qu.: 21.31
##  Median : 41.97   Median : 49.49   Median : 52.70   Median : 54.31
##  Mean   : 47.11   Mean   : 49.97   Mean   : 53.14   Mean   : 48.47
##  3rd Qu.: 73.83   3rd Qu.: 75.76   3rd Qu.: 71.40   3rd Qu.: 70.81
##  Max.   :100.00   Max.   :100.00   Max.   :100.00   Max.   :100.00
## [1] 29.74361

### Characterize Clustering Results

Here very basic clustering example…

nGr <- 3
irKm <- stats::kmeans(iris[,1:4], nGr, nstart=nGr*4)             # no need to standardize
table(irKm$cluster, iris$Species)
##
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         48        14
##   3      0          2        36
   #wrGraph::plotPCAw(t(as.matrix(iris[,1:4])), sampleGrp=irKm,colBase=irKm$cluster,useSymb=as.numeric(as.factor(iris$Species)))

Using the function reorgByCluNo() we can now ‘apply’ the clustering result to the initial data to obtain other information.

## sort results by cluster number
head(reorgByCluNo(iris[,-5], irKm$cluster)) ## Sepal.Length Sepal.Width Petal.Length Petal.Width index geoMean cluNo ## 118 7.7 3.8 6.7 2.2 118 4.557146 1 ## 110 7.2 3.6 6.1 2.5 110 4.458884 1 ## 132 7.9 3.8 6.4 2.0 132 4.427465 1 ## 136 7.7 3.0 6.1 2.3 136 4.242945 1 ## 119 7.7 2.6 6.9 2.3 119 4.221922 1 ## 106 7.6 3.0 6.6 2.1 106 4.216232 1 tail(reorgByCluNo(iris[,-5], irKm$cluster))
##    Sepal.Length Sepal.Width Petal.Length Petal.Width index  geoMean cluNo
## 23          4.6         3.6          1.0         0.2    23 1.349033     3
## 33          5.2         4.1          1.5         0.1    33 1.337272     3
## 38          4.9         3.6          1.4         0.1    38 1.253593     3
## 10          4.9         3.1          1.5         0.1    10 1.228605     3
## 13          4.8         3.0          1.4         0.1    13 1.191578     3
## 14          4.3         3.0          1.1         0.1    14 1.091429     3

Let’s calculate the median and sd values for each cluster:

## median an CV
ir2 <- reorgByCluNo(iris[,-5],irKm\$cluster,retList=TRUE)

sapply(ir2, function(x) apply(x,2,median))
##                       1         2         3
## Sepal.Length   6.700000  5.900000  5.000000
## Sepal.Width    3.000000  2.800000  3.400000
## Petal.Length   5.650000  4.500000  1.500000
## Petal.Width    2.100000  1.400000  0.200000
## index        124.000000 83.500000 25.500000
## geoMean        3.971677  3.212628  1.506499
## cluNo          1.000000  1.000000  1.000000
sapply(ir2, colSds)
##                       1          2          3
## Sepal.Length  0.4941550  0.4664101  0.3524897
## Sepal.Width   0.2900924  0.2962841  0.3790644
## Petal.Length  0.4885896  0.5088950  0.1736640
## Petal.Width   0.2798725  0.2974997  0.1053856
## index        19.8860910 25.7669866 14.5773797
## geoMean       0.2601698  0.3173265  0.2102904
## cluNo         0.0000000  0.0000000  0.0000000

Besides, we have alreaady seen in section ‘Working with Arrays’ the function cutArrayInCluLike().

## Tree-Like Structures

### Filter Lists of Connected Nodes, Extension of Networks as ‘Sandwich’

When interogating network-databases (like String for proteins or the coexpressionDB for gene co-expression) typically a (semi-)quantitatve value is supplied with the connection of node ‘A’ to node ‘B’.
In many cases, it may be useful to filter the initial query-output to retain only strong interactions. Furthermore, it may be of interest to expand such networks by nodes allowing to (further) inter-connect initial query-nodes (so called ‘Sandwich’ nodes as they are surouded by initial nodes), for such nodes a separate (eg even more stringent) threshold can be applied.

Here let’s suppose nodes have 3-digit names (ie numbers). 7 nodes of an initial query gave 1 to 7 conected nodes, the results are presented as list of data.frames where the 1st column is the connected node and the 2nd column the quality score of the connection (edge). Furthemore, let’s assume that here lower scores are better.

lst2 <- list('121'=data.frame(ID=as.character(c(141,221,228,229,449)),11:15),
'131'=data.frame(ID=as.character(c(228,331,332,333,339)),11:15),
'141'=data.frame(ID=as.character(c(121,151,229,339,441,442,449)),c(11:17)),
'151'=data.frame(ID=as.character(c(449,141,551,552)),11:14),
'161'=data.frame(ID=as.character(171),11),
'171'=data.frame(ID=as.character(161),11),
'181'=data.frame(ID=as.character(881:882),11:12) )

Now, we’d like to keep the core network consisting of all (dirctly) interconnected nodes with scores below 20 :

(nw1 <- filterNetw(lst2, limInt=20, sandwLim=NULL, remOrphans=FALSE))
##  -> filterNetw : 2 element(s) had no data remaining after filtering ...
## -> filterNetw -> .filterNetw : Removing 3 (reverse) redundant mappings
##   Node1 Node2 edgeScore toSandw
## 1   121   141        11   FALSE
## 2   141   151        12   FALSE
## 3   161   171        11   FALSE

In the resulting output the 1st column now represents the query-nodes, the 2nd column all connected nodes based on filtering scores for edges, and the 3rd colum the score for the edges.

Let’s also remove all nodes not connected to a backbone at least 3 nodes long, ie remove orphan pairs of nodes :

(nw2 <- filterNetw(lst2, limInt=20, sandwLim=NULL, remOrphans=TRUE))
##  -> filterNetw : 2 element(s) had no data remaining after filtering ...
## -> filterNetw -> .filterNetw : Removing 3 (reverse) redundant mappings
##   Node1 Node2 edgeScore toSandw
## 1   121   141        11   FALSE
## 2   141   151        12   FALSE

If you want to expand this network by nodes allowing to further interconnect the nodes from above, we can add all ‘sandwich’ nodes (let’s use a threshold of inferior/equal to 14 which will use only the better ‘sandwich’-edges) :

(nw3 <- filterNetw(lst2, limInt=20, sandwLim=14, remOrphans=TRUE))
##  -> filterNetw : 1 element(s) had no data remaining after filtering ...
## -> filterNetw -> .filterNetw : Removing 3 (reverse) redundant mappings
##   Node1 Node2 edgeScore toSandw
## 1   121   141        11   FALSE
## 2   121   228        13    TRUE
## 3   121   229        14    TRUE
## 4   131   228        11    TRUE
## 5   141   151        12   FALSE
## 6   141   229        13    TRUE

### Characterize Individual Contribution of Single Edges in Tree-Structures

path1 <- matrix(c(17,19,18,17, 4,4,2,3), ncol=2,
dimnames=list(c("A/B/C/D","A/B/G/D","A/H","A/H/I"), c("sumLen","n")))
contribToContigPerFrag(path1)
##   sumLe n.frag len.rat
## A    19      4   1.000
## B    19      4   1.000
## C    17      4   0.895
## D    19      4   1.000
## G    19      4   1.000
## H    18      2   0.947
## I    17      3   0.895

### Count Same Start- and End- Sites of Edges (or Fragments)

If you have a set of fragments from a common ancestor and the fragment’s start- and end-sites are marked by index-positions (integers), you can make a simple graphical display :

frag1 <- cbind(beg=c(2,3,7,13,13,15,7,9,7, 3,3,5), end=c(6,12,8,18,20,20,19,12,12, 4,5,7))
rownames(frag1) <- letters[1:nrow(frag1)]
simpleFragFig(frag1)

Now we can make a matrix telling if some fragments do start or end at exactely the same position.

countSameStartEnd(frag1)
##   beg end beg.n beg.rat end.n end.rat
## a   2   6    NA      NA    NA      NA
## b   3  12     3  0.2500     3  0.2500
## c   7   8     3  0.2500    NA      NA
## d  13  18     2  0.1667    NA      NA
## e  13  20     2  0.1667     2  0.1667
## f  15  20    NA      NA     2  0.1667
## g   7  19     3  0.2500    NA      NA
## h   9  12    NA      NA     3  0.2500
## i   7  12     3  0.2500     3  0.2500
## j   3   4     3  0.2500    NA      NA
## k   3   5     3  0.2500    NA      NA
## l   5   7    NA      NA    NA      NA

## Support for Graphical Output

### Convenient paste-collapse

The function pasteC allows adding quotes and separating the last element by specific text (eg ‘and’).

pasteC(1:4)
## [1] "1, 2, 3 and 4"

### Transform Numeric Values to Color-Gradient

By default most color-gradients end with a color very close to the beginning.

set.seed(2015); dat1 <- round(runif(15),2)
plot(1:15, dat1, pch=16, cex=2, las=1, col=colorAccording2(dat1),
main="Color gradient according to value in y")

# Here we modify the span of the color gradient
plot(1:15, dat1, pch=16, cex=2, las=1,
col=colorAccording2(dat1,nStartO=0,nEndO=4,revCol=TRUE), main="blue to red")

# It is also possible to work with scales of transparency
plot(1:9, pch=3, las=1)
points(1:9,1:9,col=transpGraySca(st=0,en=0.8,nSt=9,trans=0.3), cex=42, pch=16)

### Assign New Transparency to Given Colors

For this purpose you may use convColorToTransp.

col0 <- c("#998FCC","#5AC3BA","#CBD34E","#FF7D73")
col1 <- convColorToTransp(col0,alph=0.7)
layout(1:2)
pie(rep(1,length(col0)), col=col0, main="no transparency")
pie(rep(1,length(col1)), col=col1, main="new transparency")

## Other Convenience Functions

### Writing Compact Dates (more options …)

Many times it may be useful to add the date to filenames when saving data or plots as files. The built-in functions date(), Sys.Date() and Sys.Time() are a good way to start.

Generally I like to use abbreviated month-names since the order of writing the month is different in Europe compared to the USA, so this may help avoiding mis-interpreting dates insetad of writing the number of the Month. For example, 2021-03-05 means in Europe March 5th while in other places it means May 3rd.

The R-functions mentioned above use local language settings, so I finally made the function sysDate to produce compact versions of current the date, independent to local language settings (or not -if you prefer), ie locale-specific, (yes, in some languages like French the first 3 letters of the month do give ambiguous results !) and to avoid white space ’ ’ (which I prefer to avoid in file-names). Please look at the help-page for all available options.

## To get started
Sys.Date()
## [1] "2021-10-28"
## Compact English names, no matter what your local settings are :
sysDate() 
## [1] "28oct21"

The table below shows a number of options to write the date in English or using local month-names :

tabD <- cbind(paste0("univ",1:6), c(sysDate(style="univ1"), sysDate(style="univ2"), sysDate(style="univ3"),
sysDate(style="univ4"), as.character(sysDate(style="univ5")), sysDate(style="univ6")),
paste0("   local",1:6), c(sysDate(style="local1"), sysDate(style="local2"), sysDate(style="local3"),
sysDate(style="local4"), sysDate(style="local5"), sysDate(style="local6")))
kable(tabD, caption="Various ways of writing current date")
 univ1 28oct21 local1 28oct21 univ2 28Oct21 local2 28Oct21 univ3 28October2021 local3 28Oct.2021 univ4 28october2021 local4 28oct.2021 univ5 2021-10-28 local5 28-oct.-2021 univ6 2021-301 local6 2021oct.28

## Session-Info

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C                   LC_CTYPE=French_France.1252
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
## [5] LC_TIME=French_France.1252
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] limma_3.48.3 knitr_1.36   wrMisc_1.7.0
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7        plyr_1.8.6        bslib_0.3.1       compiler_4.1.1
##  [5] pillar_1.6.3      jquerylib_0.1.4   highr_0.9         tools_4.1.1
##  [9] digest_0.6.28     jsonlite_1.7.2    evaluate_0.14     lifecycle_1.0.1
## [13] tibble_3.1.4      checkmate_2.0.0   gtable_0.3.0      pkgconfig_2.0.3
## [17] rlang_0.4.11      wrGraph_1.3.0     DBI_1.1.1         yaml_2.2.1
## [21] xfun_0.26         fastmap_1.1.0     dplyr_1.0.7       stringr_1.4.0
## [25] generics_0.1.0    vctrs_0.3.8       sass_0.4.0        tidyselect_1.1.1
## [29] grid_4.1.1        qvalue_2.24.0     glue_1.4.2        data.table_1.14.2
## [33] R6_2.5.1          fansi_0.5.0       fdrtool_1.2.16    rmarkdown_2.11
## [37] reshape2_1.4.4    purrr_0.3.4       ggplot2_3.3.5     magrittr_2.0.1
## [41] splines_4.1.1     backports_1.2.1   BBmisc_1.11       scales_1.1.1
## [45] htmltools_0.5.2   ellipsis_0.3.2    assertthat_0.2.1  colorspace_2.0-2
## [49] utf8_1.2.2        stringi_1.7.4     munsell_0.5.0     crayon_1.4.1