# R Packages

# Primary Packages #
library(tcpl)
library(tcplfit2)
# Data Formatting Packages #
library(data.table)
library(dplyr)
library(magrittr)
library(reshape2)
# Plotting Packages #
library(ggplot2)
library(gridExtra)
library(RColorBrewer)
library(colorspace)
library(viridis)
# Table Packages #
library(htmlTable)
library(kableExtra)

# Introduction

This vignette explains in the first section how to upload and process the newly-registered assay data through the data analysis pipeline using a small subset of ToxCast data. The tcpl R package provides three functions for adding new data:

tcplRegister – to register a new assay element or chemical

tcplUpdate – to change or add additional information for existing assay or chemical ids

tcplWriteLvl0 – to load formatted source data

Before writing any data to the tcpl database, the user has to register the assay and chemical information, i.e. assay identifiers (and minimal assay annotations) and chemical sample identifiers need to be available in the database before writing level 0 data to the database. All processing occurs by assay component or assay endpoint, depending on the processing type (single-concentration or multiple-concentration) and level. No data is stored at the assay or assay source level. The “assay” and “assay_source” tables store annotations to help in the processing and down-stream understanding of the data. Additional details for registering each assay element and updating annotations are provided within the assay registration vignette.

## Chemicals

With the minimal assay information registered, the next step is to register the necessary chemical and sample information with tcplRegister . The “chdat” dataset included in the package contains the sample and chemical information for the data that will be loaded. The following shows an example of how to load chemical information. Similar to the order in registering assay information, the user must first register chemicals, then register the samples that map to the corresponding chemical.

## Obtain the Data ##
# Load the 'chdat' data from the package.
data(chdat, package = "tcpl")
# Convert 'chdat' object to a data.table.
setDT(chdat)
# View the first 6 rows of the table.
head(chdat)
##            spid       casn                        chnm dsstox_substance_id
## 1: Tox21_400088    80-05-7                 Bisphenol A       DTXSID7020182
## 2: Tox21_303655   521-18-6  5alpha-Dihydrotestosterone       DTXSID9022364
## 3: Tox21_110011   150-30-1               Phenylalanine       DTXSID9023463
## 4: Tox21_400081 22224-92-6                  Fenamiphos       DTXSID3024102
## 5:         DMSO    67-68-5          Dimethyl sulfoxide       DTXSID2021735
## 6: Tox21_400037    95-83-0 4-Chloro-1,2-diaminobenzene       DTXSID5020283
##         code  chid
## 1:    C80057 20182
## 2:   C521186 22364
## 3:   C150301 23463
## 4: C22224926 24102
## 5:    C67685 21735
## 6:    C95830 20283
## Register the Chemicals ##
# Obtain chemicals already registered in the database.
# Find chemicals in 'chdat' that are not registered yet.
chdat.register <- chdat[!(chdat$code %in% cmap$code)]
# Register the chemicals not yet in the database.
tcplRegister(what = "chid",
flds = chdat.register[,
unique(.SD),
.SDcols = c("casn", "chnm", "dsstox_substance_id", "code", "chid")])
## [1] TRUE

The “chdat” dataset contains a map of sample to chemical information, but chemical and sample information have to be registered separately because a chemical could potentially have multiple samples. Registering chemicals only takes a chemical CAS registry number ($$\mathit{casn}$$) and name ($$\mathit{chnm}$$). In the above example, only the unique chemicals were loaded. The $$\mathit{casn}$$ and $$\mathit{chnm}$$ fields have unique constraints; trying to register multiple chemicals with the same name or CAS registry number is not possible and will result in an error. With the chemicals registered and loaded, the samples can be registered by mapping the sample ID ($$\mathit{spid}$$) to the chemical ID. Note, the user needs to load the chemical information to get the chemical IDs then merge the new chemical IDs with the sample IDs from the original file by chemical name or CASRN.

## Register Sample IDs (spids) ##
tcplRegister(what = "spid",
flds = merge(chdat[ , list(spid, casn)],
chdat.register[ , list(casn, chid)],
by = "casn")[ , list(spid, chid)])
## [1] TRUE

Optionally, the user can subdivide the chemical IDs into different groups or libraries. For illustration, the chemical IDs will be arbitrarily divided into two chemical libraries, with the even numbered chemical IDs in group 1 and the odd numbered chemical IDs in group 2.

## Register Chemical Libraries ##
# Subdivide even numbered chemicals in group 1.
grp1 <- cmap[chid %% 2 == 0, unique(chid)]
# Subdivide odd numbered chemicals in group 2.
grp2 <- cmap[chid %% 2 == 1, unique(chid)]
# Register the group 1 chemicals.
tcplRegister(what = "clib",
flds = list(clib = "group_1", chid = grp1))
# Register the group 2 chemicals.
tcplRegister(what = "clib",
flds = list(clib = "group_2", chid = grp2))

Chemical IDs can belong to more than one library, and will be listed as separate entries when loading chemical library information. The tcplLoadClib function provides more information about the ToxCast chemical library used for sample generation, and is only relevant to the MySQL version of invitrodb.

tcplRegister(what = "clib",
flds = list(clib = "other", chid = 1:2))
tcplLoadClib(field = "chid", val = 1:2)

## Source Data

After registering the chemical and assay information, the data can be loaded into the tcpl local directory. The package includes two datasets from the ToxCast program, “scdat” and “mcdat”, with a subset of single- and multiple-concentration data, respectively. The single- and multiple-concentration processing require the same level 0 fields; more information about level 0 pre-processing is in the Introduction vignette.

## Obtain the Data ##
# Load the multi-concentration data from the package.
data(mcdat, package = 'tcpl')
# Convert 'mcdat' to a data.table.
setDT(mcdat)

After being pre-processed into the standard format as described in the tcpl introductory vignette, the data are now ready to be loaded into the database with the tcplWriteLvl0 function. The type argument is used throughout the package to distinguish the type of data/processing required: “sc” indicates single-concentration; “mc” indicates multiple-concentration.

# Write/load the 'mcdat' into the database.
tcplWriteLvl0(dat=mcdat, type ="mc")

The tcplLoadData function can be used to load the data from the MySQL database into the current R session. Furthermore, the tcplPrepOtpt function can be used in conjunction with tcplLoadData to prepare the data into a readable format as well as provide additional chemical and assay annotation information by querying the database. We refer the reader to the tcpl data retrieval vignette for further details.

# Load the level 0 data from the database to R.
tcplLoadData(lvl=0, fld="acid", val=1, type = "mc")

Notice in the loaded data, the $$\mathit{acsn}$$ is replaced by the correct $$\mathit{acid}$$ and the $$\mathit{m0id}$$ field is added. The “m#” fields in the multiple-concentration data are the primary keys for each level of data. These primary keys link the various levels of data. All of the keys are auto-generated and will change anytime data are reloaded or processed. Note, the primary keys only change for the levels affected, e.g. if the user reprocesses level 1, the level 0 keys will remain the same.

# Data Processing

This section is intended to help the user understand the general aspects of how the data are processed before diving into the specifics of each processing level for both screening paradigms (i.e. processing types). The details of the two screening paradigms are provided in later sections.

All processing in the tcpl package occurs at the assay component or assay endpoint level. There is no capability within either screening paradigm to do any processing which combines data from multiple assay components or assay endpoints. Any combining of data must occur before or after the pipeline processing. For example, a ratio of two values could be processed through the pipeline if the user calculated the ratio during the level 0 pre-processing and uploaded a single “component”.

Once the data are uploaded, data processing occurs through the tcplRun function for both single- and multiple-concentration screening. The tcplRun function can either take a single ID ($$\mathit{acid}$$ or $$\mathit{aeid}$$, depending on the processing type and level) or an $$\mathit{asid}$$. If given an $$\mathit{asid}$$, the tcplRun function will attempt to process all corresponding components/endpoints. When processing by $$\mathit{acid}$$ or $$\mathit{aeid}$$, the user must know which ID to give for each level (Table 1).

The processing is sequential, and every level of processing requires successful processing at the antecedent level. Any processing changes will cause a “delete cascade,” removing any subsequent data affected by the processing change to ensure complete data fidelity at any given time. For example, processing level 3 data for multiple-concentration data will cause the data from levels 4 through 6 to be deleted for the corresponding IDs. Level 6 is related to inspecting and flagging curve fits, though this processing step is not currently included in tcpl v3.0 it may be included in future versions. Changing any method assignments will also trigger a delete cascade for any corresponding data (more on method assignments below).

The user must give a start and end level when using the tcplRun function. If processing more than one assay component or endpoint, the function will not stop if one component or endpoint fails. If a component or endpoint fails while processing multiple levels, the function will not attempt to process the failed component/endpoint in subsequent levels. When finished processing, the tcplRun function returns a list indicating the processing success of each id. For each level processed, the list will contain two elements: (1) “l#” a named Boolean vector where TRUE indicates successful processing, and (2) “l#_failed” containing the names of any ids that failed processing, where “#” is the processing level.

The processing functions print messages to the console indicating the four steps of the processing. These steps include (1) data for the given assay component ID are loaded, (2) the data are processed, (3) data for the same ID in subsequent levels are deleted, and (4) the processed data is written to the database. The ‘outfile’ parameter in the tcplRun function gives the user the option of printing all the output text to a file.

The tcplRun function can execute processing of multiple assay components/endpoints simultaneously. This is done with the internal utilization of the mclapply function from the parallel package. Parallel processing is done by id. Depending on the system environment and memory constraints, the user may wish to use more or less processing power. Users pipelining on a Windows operating system the default is $$mc.cores = 1$$, unless otherwise specified. Users pipelining on a Unix-based operating system the default is $$mc.cores = NULL$$ indicating the utilization of all cores except for one, which is necessary for ‘overhead’ processes. The user can specify more or less processing power by setting the “mc.cores” parameter in the tcplRun function to the desired level of processing power. Note, the specification should meet the following criteria $$1 \space \leq \space \mathit{mc.cores} \space \leq \space \mathit{detectCores()}-1$$.

 Type Level InputID MethodID Table 1: Processing checklist. SC Lvl1 acid aeid SC Lvl2 aeid aeid MC Lvl1 acid N/A MC Lvl2 acid acid MC Lvl3 acid aeid MC Lvl4 aeid N/A MC Lvl5 aeid aeid MC Lvl6 aeid aeid The Input ID column indicates the ID used for each processing step; Method ID indicates the ID used for assigning methods for data processing, when necessary. SC = single-concentration; MC = multiple-concentration. Level 6 processing will not be possible in tcpl v3.0, but may be included in future versions.

The processing requirements vary by screening paradigm and level, which later sections cover in details. However, in general, many of the processing steps require specific methods to accommodate different experimental designs or data processing approaches.

Notice from Table 1 that level 1 single-concentration processing (SC1) requires an $$\mathit{acid}$$ input (Table 1), but the methods are assigned by $$\mathit{aeid}$$. The same is true for MC3 processing. SC1 and MC3 are the normalization steps and convert $$\mathit{acid}$$ to $$\mathit{aeid}$$. (Only MC2 has methods assigned by $$\mathit{acid}$$.) The normalization process is discussed in the following section.

## Methods

To promote reproducibility, all method assignments must occur through the database and should come from the available list of methods for each processing level. In general, methods data are stored in the “_methods” and “_id” tables that correspond to the data-storing tables. For example, the “sc1” table is accompanied by the “sc1_methods” table which stores the available methods for SC1, and the “sc1_aeid” table which stores the method assignments and execution order.

The tcpl package provides three functions to easily modify and load the method assignments for the given assay components or endpoints:

tcplMthdAssign – assigns methods to specified id(s)

tcplMthdClear – clears method assignments to specified id(s)

tcplMthdLoad – queries database and returns the method assignments for specified id(s)

tcplMthdList – queries database and returns available methods at specified level(s)

The following code blocks will give some examples of how to use the method-related functions.

## Methods Assignment ##
# For illustrative purposes, assign level 2 MC methods to ACIDs 97, 98, and 99.
# First check for available methods.
mthds <- tcplMthdList(lvl = 2, type = "mc")
mthds[1:2]
# Assign some methods to ACID 97, 98, and 99.
tcplMthdAssign(lvl = 2,
id = 97:99,
mthd_id = c(3, 4, 2),
ordr = 1:3,
type = "mc")
# Check the assigned methods for ACID 97, 98, and 99 in the database.
tcplMthdLoad(lvl = 2, id = 97:99, type = "mc")

# Methods can be cleared one at a time for the given id(s)
tcplMthdClear(lvl = 2, id = 99, mthd_id = 2, type = "mc")
# Check the assigned methods for the single id updated, namely ACID 99.
tcplMthdLoad(lvl = 2, id = 99, type = "mc")

# Or all methods can be cleared for the given id(s)
tcplMthdClear(lvl = 2, id = 97:98, type = "mc")
# Check the assigned methods for the all updated ids, namely ACID 97 and 98.
tcplMthdLoad(lvl = 2, id = 97:98, type = "mc")

Later sections in this vignette provide examples for level specific methods assignment and details on the methods assigned prior to data processing with tcplRun at a particular level. It should be noted that most examples reflect “default” methods assigned, but one should consider the data at hand and all methods available for a specific level prior to assigning methods.

## Data Normalization

Data normalization occurs in both single- and multiple-concentration processing paradigms - levels 1 and 3, respectively. While the two paradigms use different methods, the normalization approach is the same for both single- and multiple-concentration processing. Data normalization does not have to occur within the package, and normalized data can be loaded into the database at level 0. However, data must be zero-centered.

The tcpl package supports fold-change and a percent of control approaches to normalization. All data must be zero-centered. Therefore, in general fold-change data is log-transformed. Log-scale transformations for fold-change data is typically base 2 ($$log_2$$), but in some circumstances other bases may be more appropriate.

Normalizing to a control requires three normalization methods:
1. one to define the baseline value,
2. one to define the control value, and
3. one to calculate percent of control (“resp.pc”).

Normalizing to fold-change also requires three methods:
1. one to define the baseline value,
2. one to calculate the fold-change, and
3. one to log-transform the fold-change values (“resp.fc”).

The ($$\mathit{cval}$$) is the corrected response value for a test well. Methods defining a baseline value ($$\mathit{bval}$$) have the “bval” prefix, methods defining the control value ($$\mathit{pval}$$) have the “pval” prefix, and methods that calculate or modify the final response value have the “resp” prefix. For example, “resp.log2” does a log-transformation of the response value using a base value of 2. The formulae for calculating the percent of control and fold-change response values are listed in equations 1 and 2, respectively.

$resp.pc = \frac{cval - bval}{pval - bval}*100$

$resp.fc = \frac{cval}{bval}$

Order matters when assigning normalization methods. The $$\mathit{bval}$$, and $$\mathit{pval}$$ if normalizing as a percent of control, need to be calculated prior to calculating the response value. Table 2 shows some possible normalization schemes.

 Fold-Change   %Control Table 2: Example normalization method assignments. Scheme 1 bval.apid.nwlls.med resp.fc bval.apid.lowconc.med bval.apid.pwlls.med resp.log2 resp.mult.neg1 resp.pc resp.multneg1 Scheme 2 bval.apid.lowconc.med resp.fc bval.spid.lowconc.med pval.apid.mwlls.med resp.log2 resp.pc Scheme 3 none resp.log10 none resp.multneg1 resp.blineshift.50.spid

If the data does not require any normalization, the “none” method must be assigned for normalization. The “none” method simply copies the input data to the response field. Without assigning “none”, the response field will not get generated and the processing will not complete.

The tcpl_v2 package only modeled responses in the positive direction. Therefore, a signal in the negative direction needed to be transformed to the positive direction during normalization. Creating multiple endpoints for one component was one way to enable multiple normalization approaches. Multiple normalization approaches were necessary when the assay component detected a signal in both positive and negative directions. Negative direction data was inverted by multiplying the final response values by $${-1}$$ (see the “resp.mult.neg” methods in Table 2). tcpl_v3 onward will utilize the tcplFit2 package, which allows for bidirectional fitting, therefore the “resp.mult.neg” method is now only required in special use cases.

In addition to the required normalization methods, the user can add additional methods to transform the normalized values. For example, the third fold-change example in Table 2 includes “resp.blineshift.50.spid”, which corrects for baseline deviations by $$\mathit{spid}$$. A complete list of available methods, by processing type and level, can be listed with tcplMthdList . More information is available in the package documentation, and can be found by running ??tcpl::Methods .

## Single-concentration Screening Data

This section will cover the tcpl process for handling single-concentration data1. The goal of single-concentration processing is to identify potentially active compounds from a broad screen at a single concentration. After the data is loaded into the tcpl database, the single-concentration processing consists of 2 levels (Table 3).

 Level Description Table 3: Summary of the tcpl single-concentration pipeline. Lvl 0 Pre-processing: Vendor/dataset-specific pre-processing to organize heterogeneous raw data to the uniform format for processing by the tcpl package† Lvl 1 Normalize: Apply assay endpoint-specific normalization listed in the ‘sc1_aeid’ table to the raw data to define response Lvl 2 Activity Call: Collapse replicates by median response, define the response cutoff based on methods in the ‘sc2_aeid’ table, and determine activity † Level 0 pre-processing occurs outside the package and specifics are covered in the Introduction vignette.

### Level 1

Level 1 processing converts the assay component to assay endpoint(s) and defines the normalized-response value field ($$\mathit{resp}$$), logarithm-concentration field ($$\mathit{logc}$$), and optionally, the baseline value ($$\mathit{bval}$$) and positive control value ($$\mathit{pval}$$) fields. The purpose of level 1 is to normalize the raw values to either the percentage of a control or the fold-change from baseline. The normalization process is discussed in greater detail in the Data Normalization section.

#### Methods Assignment

Before assigning the methods below, the user needs to register the data for the single-concentration assay, as shown in the previous Register and Upload New Data section.

Before beginning the normalization process, all wells with well quality ($$\mathit{wllq}$$) equal to 0 are removed.

The first step in beginning the processing is to identify which assay endpoints stem from the assay component(s) being processed.

# Load the 'aeid' values for 'acid = 2'.
tcplLoadAeid(fld = "acid", val = 2)

With the corresponding endpoints identified, the appropriate methods can be assigned.

# Assign the level 1 methods to AEID 1 & 2.
tcplMthdAssign(lvl = 1,  # processing level
id = 1:2, # assay endpoint ID's to assign methods
mthd_id = c(1, 11, 13), # method(s) to be assigned
ordr = 1:3, # order the method(s) should be applied
type = "sc") # the data/processing type

Above, methods 1, 11, and 13 were assigned for both endpoints. The method assignments instruct the processing to: (1) calculate $$\mathit{bval}$$ for each assay plate ID by taking the median of all data where the well type equals “n”; (2) calculate a fold-change with respect to the $$\mathit{bval}$$ (i.e. $$\frac{resp}{\mathit{bval}}$$); (3) log-transform the fold-change values with base 2.

If a user needs to add a method to the end of a normalization sequence, as shown above, then the user can use a second method assignment statement. For example, suppose for AEID 2 only we need to indicate a change of directionality in the response and multiply all response values by $$-1$$. Then, the following code can be executed. Reminder, the order of methods assignment matters, particularly in the normalization step.

# Assign a fourth step to the normalization processing - for AEID 2 only.
tcplMthdAssign(lvl = 1, # processing level
id = 2, # assay endpoint ID's to assign methods
mthd_id = 16, # method(s) to be assigned
ordr = 1, # order the method(s) should be applied
type = "sc") # the data/processing type

For a complete list of normalization methods see tcplMthdList(lvl = 1, type = “sc”) or ?SC1_Methods .

Note, though level 1 methods are assigned at the assay endpoint ID, processing takes an assay component ID as the input ID. As mentioned previously, the user must assign normalization methods by assay endpoint, then do the processing by assay component. We will demonstrate in a later section how to run the processing step. When level 1 processing is performed tcpl will attempt to process all endpoints in the database for a given component. If one endpoint fails for any reason (e.g., does not have the appropriate methods assigned), the processing for the entire component fails.

### Level 2

Level 2 processing defines the baseline median absolute deviation ($$\mathit{bmad}$$), collapses any replicates by sample ID, and determines the activity.

#### Methods Assignment

Before the data are collapsed by sample ID, the $$\mathit{bmad}$$ is calculated as the median absolute deviation of all treatment wells ($$\mathit{wllt} = t$$ - default option) or blank wells ($$\mathit{wllt} = n$$). The calculation to define $$\mathit{bmad}$$ is done once across the entire assay endpoint. If additional data is added to the database for an assay component, the $$\mathit{bmad}$$ values for all associated assay endpoints will change. Note, this $$\mathit{bmad}$$ definition is different from the $$\mathit{bmad}$$ definition used for multiple-concentration screening.

$bmad_{sc} = 1.4826*median(\big | y_{i} - \tilde{y} \big |)$

Where $$y_i$$ is the $$i^{th}$$ observation of all wells within a well type in the assay component and $$\tilde{y}$$ is the median across all $$y_i$$’s (see Table 4 for further details). The constant value, $$1.4826$$, is the default adjustment value used in the underlying R function to ensure $$bmad$$ is a consistent estimator of the standard deviation ($$\sigma$$) assuming the sample size ($$N$$) of the observations is large and they are normally distributed (i.e. Gaussian), see mad() in R and unbiased mad for further details.

 Methods Description Observations ID Details Table 4: Single concentration baseline estimation by method assignment in Level 2 processing. 1 Median absolute deviation (MAD) of all treatment wells across the assay component (acid). $$y_{i} = y_{(s,w)}$$ $$s \in \{1,...,n_{acid}\}$$, $$w = t$$ $$s$$ indicates the sample id within an ‘acid’ & $$w$$ indicates the well type 2 Median absolute deviation (MAD) of all blank wells across the assay component (acid). $$y_{i} = y_{(s,w)}$$ $$s \in \{1,...,n_{acid}\}$$, $$w = n$$ $$s$$ indicates the sample id within an ‘acid’ & $$w$$ indicates the well type

To collapse the data by sample ID, the median response value of replicates within a concentration is calculated. (Note, in most cases each $$\mathit{spid}$$ will only have a single concentration level.) Data are then further collapsed by taking the maximum of those median values ($$\mathit{max\_med}$$).

Once the data are collapsed, such that each assay endpoint-sample pair only has one value, the activity is determined. For a sample to get an active hit call, the $$\mathit{max\_med}$$ must be greater than a specified efficacy cutoff. The efficacy cutoff is determined by the level 2 methods. The efficacy cutoff value ($$\mathit{coff}$$) is defined as the maximum of all values given by the assigned level 2 methods. Failing to assign a level 2 method will result in every sample being called active. For a complete list of level 2 methods, see tcplMthdList(lvl = 2, type = “sc”) or ?SC2_Methods.

# Assign a cutoff value of log2(1.2).
tcplMthdAssign(lvl = 2, # processing level
id = 1,  # assay endpoint ID's to assign methods
mthd_id = 3, # method(s) to be assigned
type = "sc") # the data/processing type

In the example above, we are assigning level 2 methods such that the cutoff value is $$log_2(1.2)$$. Thus, if the maximum median value ($$\mathit{max\_med}$$) is greater than or equal to the efficacy cutoff ($$\mathit{coff} = log_2(1.2)$$), then the sample ID is considered active and the hit call ($$\mathit{hitc}$$) is set to 1.

## Multiple-concentration Screening Data

This section will cover the tcpl process for handling multiple-concentration data2. The goal of multiple-concentration processing is to estimate the activity, potency, efficacy, and other parameters for sample-assay pairs. After the data is loaded into the tcpl database, the multiple-concentration processing consists of six* levels (Table 5).

 Level Description Table 5: Summary of the tcpl multiple-concentration pipeline. Lvl 0 Pre-processing: Vendor/dataset-specific pre-processing to organize heterogeneous raw data to the uniform format for processing by the tcpl package† Lvl 1 Index: Define the replicate and concentration indices to facilitate all subsequent processing Lvl 2 Transform: Apply assay component (acid) specifc transformations listed in the ‘mc2_acid’ table to the raw data to define the corrected data Lvl 3 Normalize: Apply assay endpoint (aeid) specifc normalization listed in the ‘mc3_aeid’ table to the corrected data to define response Lvl 4 Fit: Model the concentration-response data utilizing ten objective curve-fitting functions from tcplfit2: (1) constant, (2) hill, (3) gain-loss, (4) polynomial-linear, (5) polynomial-quadratic, (6) power, (7) exponential-2, (8) exponential-3, (9) exponential-4, (10) exponential-5 Lvl 5 Model Selection/Acitivty Call: Select the winning model, define the response cutoff based on methods in the ‘mc5_aeid’ table, and determine activity Lvl 6 Flag: Flag potential false positive and false negative endings based on methods in the ‘mc6_aeid’ table‡ † Level 0 pre-processing occurs outside the package and specifics are covered in Introduction vignette. ‡Level 6 processing will not be possible in tcpl v3.0, but may be included in future versions.

### Level 1

Level 1 processing defines the replicate and concentration index fields to facilitate downstream processing. Because of cost, availability, physicochemical, and technical constraints, screening-level efforts utilize numerous experimental designs and test compound (sample) stock concentrations. The resulting data may contain an inconsistent number of concentration groups, concentration values, and technical replicates. To enable quick and uniform processing, level 1 processing explicitly defines concentration and replicate indices, giving integer values $$1 \dots N$$ to increasing concentrations and technical replicates, where $$1$$ represents the lowest concentration or first technical replicate.

To assign replicate and concentration indices, we assume one of two experimental designs. The first design assumes samples are plated in multiple concentrations on each assay plate, such that the concentration series all falls on a single assay plate. The second design assumes samples are plated in a single concentration on each assay plate, such that the concentration series falls across many assay plates.

For both experimental designs, data are ordered by source file ($$\mathit{srcf}$$), assay plate ID ($$\mathit{apid}$$), column index ($$\mathit{coli}$$), row index ($$\mathit{rowi}$$), sample ID ($$\mathit{spid}$$), and concentration ($$\mathit{conc}$$). Concentration is rounded to three significant figures to correct for potential rounding errors. After ordering the data, we create a temporary replicate ID, identifying an individual concentration series. For test compounds in experimental designs with the concentration series on a single plate and all control compounds, the temporary replicate ID consists of the sample ID, well type ($$\mathit{wllt}$$), source file, assay plate ID, and concentration. The temporary replicate ID for test compounds in experimental designs with concentration series that span multiple assay plates is defined similarly, but does not include the assay plate ID.

Once the data are ordered, and the temporary replicate ID is defined, the data are scanned from top to bottom and increment the replicate index ($$\mathit{repi}$$) every time a replicate ID is duplicated. Then, for each replicate, the concentration index ($$\mathit{cndx}$$) is defined by ranking the unique concentrations, with the lowest concentration starting at 1.

The resulting level 1 data can be loaded, after processing is complete, to check the processing was done appropriately.

## Evaluate Level 1 Indexing ##
# Load the level 1 data from the database.
fld = "acid",
val = 1,
type = "mc")
# Prepare the data into a readable format.
m1dat <- tcplPrepOtpt(m1dat)
# Sort the data based on the concentration and replicate inidices.
setkeyv(m1dat, c("repi", "cndx"))
# Display the 'cndx' and 'repi' values.
m1dat[chnm == "Bisphenol A",
list(chnm, conc, cndx, repi)]

The package also contains a function, namely tcplPlotPlate , for visualizing the data at the assay plate level. This function can be used to visualize the data at levels 1 to 3.

tcplPlotPlate(dat = m1dat, apid = "TP0001915")

In the generated figure, the row and column indices are printed along the respective edges of the plate, with the raw observed values in each well represented by color. While the plate does not give sample ID information, the letter/number codes in the wells indicate the well type and concentration index, respectively. The plate display also shows the wells with poor quality (as defined by the well quality, $$\mathit{wllq}$$, field at level 0) with an “X.” Plotting plates in subsequent levels of wells with poor quality will appear empty. The title of the plate display lists the assay component/assay endpoint and the assay plate ID ($$\mathit{apid}$$).

### Level 2

Level 2 processing removes data where the well quality ($$\mathit{wllq}$$) equals 0 and defines the corrected value ($$\mathit{cval}$$) field. This level of processing also allows for any transformation of the raw values at the assay component level. Examples of transformation methods could range from basic logarithmic transformations, to complex spatial noise reduction algorithms. Currently, the tcpl package only consists of basic transformations, but could be expanded in future releases. Level 2 processing does not include normalization methods; normalization should be assigned and occur during level 3 processing.

#### Methods Assignment

Every assay component needs at least one transformation method assigned to complete level 2 processing. For the example data used in this vignette, no transformations are necessary at level 2. To allow level 2 processing to complete without performing any transformation, assign the “none” method to the assay component(s).

## Methods Assignment ##
# Assign the level 2 transformation method 'none' to ACID 1.
tcplMthdAssign(lvl = 2, # processing level
id =1, # assay component ID's to assign methods
mthd_id = 1, # method(s) to be assigned
ordr = 1, # order of the method(s) should be assigned
type = "mc") # the data/processing type

For the complete list of level 2 transformation methods currently available, see tcplMthdList(lvl = 2, type = “mc”) or ?MC2_Methods for more details. The coding methodology used to implement the methods is beyond the scope of this vignette, but, in brief, the method names in the database correspond to a function name in the list of functions returned by mc2_mthds() (the mc2_mthds function is not exported, and not intended for use by the user). Each of the functions in the list given by mc2_mthds only return expression objects that the processing function called by tcplRun executes in the local function environment to avoid making additional copies of the data in memory. We encourage suggestions for new methods.

### Level 3

Level 3 processing converts the assay component to assay endpoint(s) and defines the normalized-response value field ($$\mathit{resp}$$); log-scale concentration field ($$\mathit{logc}$$); and optionally, the baseline value ($$\mathit{bval}$$) and positive control value ($$\mathit{pval}$$) fields. The purpose of level 3 processing is to normalize the corrected values to either the percentage of a control or to fold-change from baseline. The normalization process is discussed in greater detail in the Data Normalization section. A primary distinction between level 2 and level 3 processing is that level 2 processes assay components and level 3 processes assay endpoints. The user must assign level 3 processing to assay endpoint identifiers.

#### Methods Assignment

The user first needs to check which assay endpoints stem from the the assay component queued for processing.

# Look at the assay endpoints for acid 1.
tcplLoadAeid(fld = "acid", val = 1)

Now that we have the corresponding assay endpoints listed, the normalization methods can be assigned.

## Methods Assignment ##
# Assign the baseline calculation and normalization methods to AEID's 1 & 2.
tcplMthdAssign(lvl = 3, # processing level
id = 1:2, # assay endpoint ID to assign methods
mthd_id = c(17, 9, 7), # method(s) to be assigned
ordr = 1:3, # order the method(s) should be applied
type = "mc") # the data/processing type

Above, methods 17, 9, and 7 were assigned for both endpoints. The method assignments instruct the level 3 processing to: (1) calculate $$\mathit{bval}$$ for each assay plate ID by taking the median of all data where the well type equals “n” or the well type equals “t” and the concentration index is 1 or 2; (2) calculate a fold-change over $$\mathit{bval}$$; (3) log-transform the fold-change values with base 2.

For a complete list of normalization methods see tcplMthdList(lvl = 3, type = “mc”) or ?MC3_Methods .

Notice that level 3 processing takes an assay component ID, not an assay endpoint ID, as the input ID. Though as mentioned in previous sections, the user must assign the normalization methods by assay endpoint. Then, the processing is done by assay component. The level 3 processing will attempt to process all endpoints in the database for a given component. If one endpoint fails for any reason (e.g., does not have appropriate methods assigned), the processing for the entire component fails.

NOTE:

If the starting level (slvl) is less than 4, then ‘id’ is interpreted by tcplRun as an assay component id (acid). When slvl is greater than or equal to 4 the ‘id’ is interpreted as an assay endpoint id (aeid). The user must provide either an assay source id (asid) or ‘id’. If an ‘id’ fails, no results are loaded into the database and the ‘id’ is not included in the cue for subsequent processing levels.

### Level 4

Level 4 processing splits the data into concentration-response series by sample and assay endpoint, then models the activity (i.e. response) of each series. Each series is modeled using the bi-directional fitting methods available in the tcplFit2 R package (Sheffield et al., 2021). Bi-directional fitting directly models the activity in the concentration series without performing curve inversion (i.e. observed responses producing a ‘negative/decreasing’ trend with increasing concentration groups are multiplied by $$-1$$ to induce a ‘positive/increasing’ trend) as used in previous versions of tcpl .

#### Pre-Modeling Processes

Level 4 processing starts with the removal of well types with only one concentration (e.g. ‘wllt = b’; empty/blank wells). Next, a noise-band is established for the assay endpoint using the baseline median absolute deviation ($$\mathit{bmad}$$). Here, the $$\mathit{bmad}$$ is calculated by the baseline response values which are assumed to be either untreated control wells (e.g. ‘wllt = n’; neutral solvent wells like DMSO) or test samples from the two lowest concentrations (i.e. ‘wllt = t’ & concentration index is 1 or 2). The calculation to define $$\mathit{bmad}$$ is done once across the entire assay endpoint. If additional data is added to the database for an assay component, the $$\mathit{bmad}$$ values for all associated assay endpoints will change.

$bmad_{mc} = 1.4826*median(\big | y_{i} - \tilde{y} \big |)$ Where $$y_{i}$$ is the $$i^{th}$$ baseline observation, defined based on the assigned method - see Table 6 for details, and $$\tilde{y}$$ is the median of all the baseline observations. The constant value, $$1.4826$$, is the default adjustment value used in the underlying R function to ensure $$bmad$$ is a consistent estimator of the standard deviation ($$\sigma$$) assuming the sample size ($$N$$) of the baseline observations is large and they are normally distributed (i.e. Gaussian), see mad() in R and unbiased mad for further details.

In tcpl_v3 onward, the new mc4 method “onesd.aeid.lowconc.twells” must also be assigned in order to calculate one standard deviation of baseline response, which is necessary for the calculation of the benchmark response (BMR) used both in the curve-fitting and hit-calling functions from tcplFit2 (i.e.  tcplfit_core and tcplhit2_core , respectively). One standard deviation of the baseline response for an assay endpoint is estimated using the two lowest concentration groups of test samples across all chemicals. It should be noted that the benchmark response (BMR) as defined will change for a single sample in a given assay endpoint as additional screening data are pipelined. Thus, the benchmark dose (BMD) estimate may change over time.

 Method Description Observations ID Details Table 6: Multiple concentration baseline variability estimation by method assignment in Level 4 processing. 1 Median absolute deviation (MAD) of all observations in the lowest two concentrations of across samples (spid) in the assay endpoint (aeid). $$y_{i} = y_{(s,w,d)}$$ $$s \in \{1,...,n_{aeid}\}$$, $$w = t$$, $$d \in \{ 1,2 \}$$ $$s$$ indicates the sample id within an ‘aeid’, $$w$$ indicates the well type, & $$d$$ indicates the concentration group index 2 Median absolute deviation (MAD) of all observations in the solvent/untreated control observations across samples (spid) in the assay endpoint (aeid). $$y_{i} = y_{(s,w)}$$ $$s \in \{1,...,n_{aeid}\}$$, $$w = n$$ $$s$$ indicates the sample id within an ‘aeid’, $$w$$ indicates the well type 3 Standard deviation (SD) of all observations in the lowest two concentrations of across samples (spid) in the assay endpoint (aeid). $$y_{i} = y_{(s,w,d)}$$ $$s \in \{1,...,n_{aeid}\}$$, $$w = t$$, $$d \in \{ 1,2 \}$$ $$s$$ indicates the sample id within an ‘aeid’, $$w$$ indicates the well type, & $$d$$ indicates the concentration group index

Before the model parameters are estimated, a set of summary values are calculated for each concentration series:

• Minimum and maximum observed responses (resp_min & resp_max, respectively).
• Minimum and maximum concentrations (on the log-scale) (logc_min & logc_max, respectively).
• The total number of concentration groups (nconc).
• Total number of observed responses (i.e. data points in the concentration series) (npts).
• Number of replicates in concentration groups (nrep).
• The maximum mean and median responses along with the concentration at which they occur (max_mean, max_med, max_mean_conc, & max_med_conc, respectively).
• The number of median responses greater than $$3\mathit{bmad}$$ (nmed_gtbl).

When referring to the concentration series , the “mean” and “median” response values are defined as the mean or median of all observed response values at each concentration. In other words, the maximum median is the maximum of all median response values across the concentration groups in the series.

The following code demonstrates how a user, when connected to a database with tcpl pipelined data, can load the level 3 multiple concentration data for a single assay endpoint (aeid).

## Evaluate the Level 3 Data ##
# Load the level 3 data from the database.
type = 'mc',
fld = 'aeid',
val = 80)
# Prepare the data into a readable format.
mc3 <- tcplPrepOtpt(mc3)
# Display the data.
mc3

For demonstration purposes, we provide the mc_vignette R data object within the tcpl package since the vignette is not directly connected to such a database. The mc_vignette object contains a subset of data from level 3 through 5 from the most recent version of the invitroDB database. The following code demonstrates how to load the example data object and accessing/storing the level 3 data in the mc3 object.

# Load the example data from the tcpl package.
data(mc_vignette,package = 'tcpl')

# Allocate the level 3 example data to mc3.
mc3 <- mc_vignette[['mc3']]

Plot the concentration-response series for one of the example spid’s in the mc3 object along with the summary estimates mentioned previously in the bullet list above.

## Obtain Data ##
# Obtain the level 4 example data.
mc4 <- mc_vignette[["mc4"]]
# Obtain the minimum response observed and the 'logc' group - 'resp_min'.
level3_min <- mc3 %>%
dplyr::group_by(spid,chnm) %>%
dplyr::filter(resp == min(resp)) %>%
dplyr::filter(spid == "01504209")
# Obtain the maximum response observed and the 'logc' group - 'resp_max'.
level3_max <- mc3 %>%
dplyr::group_by(spid,chnm) %>%
dplyr::filter(resp == max(resp)) %>%
dplyr::filter(spid == "01504209")
# Obtain the level 3 data and 'center' estimates for responses per 'logc' group.
level3_summary <- mc3 %>%
dplyr::filter(spid == "01504209") %>%
dplyr::select(.,c(spid,chnm,logc,resp)) %>%
dplyr::group_by(spid,chnm,logc) %>%
dplyr::summarise(mean_resp = mean(resp),med_resp = median(resp))
## Generate Individual Summary Plots ##
# Plot the mean responses for each log-concentration group.
A <- mc3 %>%
dplyr::filter(spid == "01504209") %>%
ggplot(data = .,aes(logc,resp))+
geom_point(pch = 1,size = 2)+
geom_point(data = level3_summary,
aes(x = logc,y = mean_resp,
col = 'mean responses'),
alpha = 0.75,size = 2)+
scale_color_manual(values = 'paleturquoise3',
aesthetics = 'col')+
labs(lty = "",colour = "")+
xlab(expression(paste(log[10],"(Concentration) ",mu,"M")))+
ylab(expression(paste(log[2],"(Fold Induction)")))+
ggtitle("Mean Responses")+
theme_bw()+
theme(legend.position = 'bottom')
# Plot the median responses for each log-concentration group.
B <- mc3 %>%
dplyr::filter(spid == "01504209") %>%
ggplot(data = .,aes(logc,resp))+
geom_point(pch = 1,size = 2)+
geom_point(data = level3_summary,
aes(x = logc,y = med_resp,
col = 'median response'),
alpha = 0.75,size = 2)+
scale_color_manual(values = 'hotpink',
aesthetics = 'col')+
labs(lty = "",colour = "")+
xlab(expression(paste(log[10],"(Concentration) ",mu,"M")))+
ylab(expression(paste(log[2],"(Fold Induction)")))+
ggtitle("Median Responses")+
theme_bw()+
theme(legend.position = 'bottom')
# Plot the maximum mean & median responses at the related log-concentration -
#   'max_mean' & 'max_mean_conc'.
C <- mc3 %>%
dplyr::filter(spid == "01504209") %>%
ggplot(data = .,aes(logc,resp))+
geom_point(pch = 1,size = 2)+
geom_point(data = dplyr::filter(mc4,spid == "01504209"),
aes(x = max_mean_conc,y = max_mean,
col = 'maximum mean response'),
alpha = 0.75,size = 2)+
scale_color_manual(values = 'paleturquoise3',
aesthetics = 'col')+
labs(lty = "",colour = "")+
xlab(expression(paste(log[10],"(Concentration) ",mu,"M")))+
ylab(expression(paste(log[2],"(Fold Induction)")))+
ggtitle(label = "Maximum Mean Response")+
theme_bw()+
theme(legend.position = 'bottom')
# Plot the maximum mean & median responses at the related log-concentration -
#   'max_med' & 'max_med_conc'.
D <- mc3 %>%
dplyr::filter(spid == "01504209") %>%
ggplot(data = .,aes(logc,resp))+
geom_point(pch = 1,size = 2)+
geom_point(data = dplyr::filter(mc4,spid == "01504209"),
aes(x = max_med_conc,y = max_med,
col = "maximum median response"),
alpha = 0.75,size = 2)+
scale_color_manual(values = 'hotpink',
aesthetics = 'col')+
labs(lty = "",colour = "")+
xlab(expression(paste(log[10],"(Concentration) ",mu,"M")))+
ylab(expression(paste(log[2],"(Fold Induction)")))+
ggtitle(label = "Maximum Median Response")+
theme_bw()+
theme(legend.position = 'bottom')
# Plot the minimum & maximum observed responses.
E <- mc3 %>%
dplyr::filter(spid == "01504209") %>%
ggplot(data = .,aes(logc,resp))+
geom_point(pch = 1,size = 2)+
geom_point(data = level3_min,
aes(x = logc,y = resp,
col = "minimum response"),
alpha = 0.75,size = 2)+
geom_point(data = level3_max,
aes(x = logc,y = resp,
col = "maximum response"),
alpha = 0.75,size = 2)+
scale_color_manual(values = c('red','blue'),
aesthetics = 'col')+
labs(lty = "",colour = "")+
xlab(expression(paste(log[10],"(Concentration) ",mu,"M")))+
ylab(expression(paste(log[2],"(Fold Induction)")))+
ggtitle(label = "Minimum & Maximum\nResponses")+
theme_bw()+
theme(legend.position = 'bottom')
# Plot the minimum & maximum experimental log-concentration groups -
#   'logc_min' & 'logc_max'.
G <- mc3 %>%
dplyr::filter(spid == "01504209") %>%
ggplot(data = .,aes(logc,resp))+
geom_point(pch = 1,size = 2)+
geom_vline(data = dplyr::filter(mc4,spid == "01504209"),
aes(xintercept = logc_min,
col = 'minimum concentration'),
lty = "dashed")+
geom_vline(data = dplyr::filter(mc4,spid == "01504209"),
aes(xintercept = logc_max,
col = 'maximum concentration'),
lty = "dashed")+
scale_color_manual(values = c('red','blue'),
aesthetics = 'col')+
labs(lty = "",colour = "")+
xlab(expression(paste(log[10],"(Concentration) ",mu,"M")))+
ylab(expression(paste(log[2],"(Fold Induction)")))+
ggtitle(label = "Minimum & Maximum\nConcentrations")+
theme_bw()+
theme(legend.position = 'bottom')
## Compile Summary Plots in One Figure ##
gridExtra::grid.arrange(
A,B,C,D,E,G,
nrow = 3,ncol = 2,
top = mc3[which(mc4[,spid] == "01504209"),aenm]
)

Figure 1: The plots in this figure illustrate the summary estimates calculated as part of the level 4 processing, which occurs prior to dose-response modeling. Each of the plots depict the observed concentration-response data with white circles, where the x-axis is base 10 log-transformed concentration values. The mean response values for each concentration group are depicted as turquoise circles in the upper left plot. Similarly, the upper right plot depicts the median response values for each concentration group with hot-pink circles. The middle two plots again depict mean and median responses, but only shows the maximum mean ($$\mathit{max\_mean}$$) and median ($$\mathit{max\_med}$$) response estimates (left and right plots, respectively). The bottom left plot depicts the minimum observed response value ($$\mathit{min\_resp}$$) with a blue circle and the maximum observed response value ($$\mathit{max\_resp}$$) with a red circle. Finally, the minimum ($$\mathit{min\_logc}$$) and maximum $$\mathit{max\_logc}$$ log10-scale concentrations are depicted, respectively, with blue and red vertical dashed lines.

#### Concentration-Response Modeling Details

After the summary values are obtained for each concentration-response series, all ten parametric models available in tcplFit2 are used to fit each series. See Table 7 for model details.

 Model Abbreviation Equations OutputParameters Details Table 7: tcplfit2 model details. Constant cnst $$f(x) = 0$$ Parameters always equals ‘er’. Linear poly1 $$f(x) = ax$$ a (y-scale) Quadratic poly2 $$f(x) = a(\frac{x}{b}+(\frac{x}{b})^{2})$$ a (y-scale), b (x-scale) Power pow $$f(x) = ax^p$$ a (y-scale), p (power) Hill hill $$f(x) = \frac{tp}{1 + (\frac{ga}{x})^{p}}$$ tp (top), ga (gain AC50), p (gain-power) Concentrations are converted internally to log10 units and optimized with f(x) = tp/(1 + 10^(p*(gax))), then ga and ga_sd are converted back to regular units before returning. Gain-Loss gnls $$f(x) = \frac{tp}{(1 + (\frac{ga}{x})^{p} )(1 + (\frac{x}{la})^{q} )}$$ tp (top), ga (gain AC50), p (gain power), la (loss AC50), q (loss power) Concentrations are converted internally to log10 units and optimized with f(x) = tp/[(1 + 10^(p(gax)))(1 + 10^(q(x-la)))], then ga, la, ga_sd, and la_sd are converted back to regular units before returning. Exponential 2 exp2 $$f(x) = a*(exp(\frac{x}{b}) - 1)$$ a (y-scale), b (x-scale) Exponential 3 exp3 $$f(x) = a*(exp((\frac{x}{b})^{p}) - 1)$$ a (y-scale), b (x-scale), p (power) Exponential 4 exp4 $$f(x) = tp*(1-2^{\frac{-x}{ga}})$$ tp (top), ga (AC50) Exponential 5 exp5 $$f(x) = tp*(1-2^{-(\frac{x}{ga})^{p}})$$ tp (top), ga (AC50), p (power) Model descriptions are pulled from tcplFit2 manual at https://cran.r-project.org/package=tcplfit2/tcplfit2.pdf.

Each of these models assumes the background response is zero and the absolute response (or initial response) is increasing. Upon completion of the model fitting, each model gets a success designation: 1 if the model optimization converges, 0 if the optimization fails, and NA if ‘nofit’ was set to TRUE within tcplfit2_core function from tcplFit2 . Similarly, if the Hessian matrix was successfully inverted then 1 is returned to indicate a successful covariance calculation (cov); otherwise 0 is returned. Finally, in cases where ‘nofit’ was set to TRUE (within tcplFit2::tcplfit2_core ) or the model fit failed the Akaike information criterion (aic), root mean squared error (rme), model estimated responses (modl), model parameters (parameters), and the standard deviation of model parameters (parameter sds) are set to NA. A complete list of model output parameters is provided in Table 8 below.

 FitParameters Description Table 8: Model output descriptions. er Error term success Success of Fit/Model Convergenece cov Success of Covariance aic Aikaike Information Criteria rme Root Mean Squared Error modl Vector of Model Estimated Values at Given Concentrations parameters Model Parameter Values parameters sds Standard deviation of Model Parameter Values pars Vector of Parameter Names sds Vectors of Parameter Standard Deviation Names Model output descriptions are pulled from tcplFit2 manual at https://cran.r-project.org/package=tcplfit2/tcplfit2.pdf.

All models assume the error follows a Student’s t-distribution with four degrees of freedom. Heavier (i.e. wider) tails in the t-distribution diminish the influence of outlier values, and produce more robust estimates than the more commonly used normal distribution. Robust model fitting removes the need to eliminate potential outliers prior to fitting. Maximum likelihood estimation is utilized in the model fitting algorithm to estimate model parameters for all available models in tcplFit2 .

Let $$t(z,\nu)$$ be the Student’s t-distribution with $$\nu$$ degrees of freedom, $$y_{i}$$ be the observed response at the $$i^{th}$$ observation, and $$\mu_{i}$$ be the estimated response at the $$i^{th}$$ observation. We calculate $$z_{i}$$ as:

$z_{i} = \frac{y_{i} - \mu_{i}}{exp(\sigma)},$

where $$\sigma$$ is the scale term. Then the log-likelihood is

$\sum_{i=1}^{n} [\ln\left(t(z_{i}, 4)\right) - \sigma]\mathrm{,}$

where $$n$$ is the number of observations.

The following set of plots provides a set of simulated concentration-response curves to illustrate the general curve shapes captured by the models included in tcplFit2. It should be noted, when fitting ‘real-world’ experimental data the resulting curve shapes are meant to minimize the error between the observed data and the concentration-response curve. Thus, the shape for each model fit may or may not reflect what is illustrated in Figure 2.

## Example Data ##
# example fit concentration series
ex_conc <- seq(10^(min(mc3[,logc])),10^(max(mc3[,logc])),length.out = 100)

## Obtain the Continuous Fit of Level 4 Model Estimates ##
fits <- data.frame(
# log-scale concentrations
logc = log10(ex_conc),
# parametric model fits from tcplfit2
constant = tcplfit2::cnst(ps = c(er = 0.1),ex_conc),
poly1 = tcplfit2::poly1(ps = c(a = 3.5,er = 0.1),x = ex_conc),
poly2 = tcplfit2::poly2(ps = c(a = 0.13,b = 2,er = 0.1),x = ex_conc),
power = tcplfit2::pow(ps = c(a = 1.23,p = 1.45,er = 0.1),x = ex_conc),
hill = tcplfit2::hillfn(ps = c(tp = 750,ga = 5,p = 1.76,er = 0.1),
x = ex_conc),
gnls = tcplfit2::gnls(ps = c(tp = 750,ga = 15,p = 1.45,la = 50,q = 1.34,
er = 0.1),
x = ex_conc),
exp2 = tcplfit2::exp2(ps = c(a = 0.45,b = 13.5,er = 0.1),
x = ex_conc),
exp3 = tcplfit2::exp3(ps = c(a = 1.67,b = 12.5,p = 0.87,er = 0.1),
x = ex_conc),
exp4 = tcplfit2::exp4(ps = c(tp = 895,ga = 15,er = 0.1),x = ex_conc),
exp5 = tcplfit2::exp5(ps = c(tp = 793,ga = 6.25,p = 1.25,er = 0.1),
x = ex_conc)
) %>%
reshape2::melt(data = .,measure.vars = c(
"constant",
"poly1","poly2","power",
"hill","gnls","exp2","exp3","exp4","exp5"
))

## Updated Colors ##
fit_cols <-
# choose 10 distinct colors
viridis::magma(n = 10,direction = 1) %>%
# darken the original colors to make them more visible
colorspace::darken(.,amount = 0.2)

## Plot ##
fits %>%
ggplot()+
geom_line(aes(x = logc,y = value,lty = variable,colour = variable))+
facet_wrap(facets = "variable")+
theme_bw()+
labs(lty = "Models",colour = "Models")+
scale_colour_manual(values = fit_cols)+
ggtitle("General Shape of Models Included in tcplfit2")+
xlab(expression(paste(log[10],"(Concentration) ",mu,"M")))+
ylab("Response")

Figure 2: This figure contains simulated concentration-response curves to illustrate the general underlying curve shape covered by each of the models included in the tcplFit2 package and used on the back-end of the level 4 data processing in tcpl. Each sub-plot in the figure corresponds to a single parametric model included in the model fitting process and has a corresponding color and line type to accompany it. All sub-plots are plotted such that the x-axis represents the log-transformed concentration ($$base=10$$) and the y-axis represents the response values.

#### Methods Assignment

Level 4 does not utilize any assay endpoint-specific methods; the user only needs to run the tcplRun function. Level 4 processing and all subsequent processing is done by assay endpoint, not assay component. The previous section showed how to find the assay endpoints for an assay component using the tcplLoadAeid function.

The example we include in this vignette for demonstrating the assignment of level 4 methods instructs the processing to (1) estimate the $$\mathit{bmad}$$ (mthd_id = 1) and (2) estimate one standard deviation [required] (mthd_id = 3), mentioned in the pre-modeling process section, for the specified set of assay endpoints. Here, method ‘1’ specifies that the BMAD is estimated using the two lowest concentration groups of treatment wells.

## Methods Assignment ##
# Assign the Level 4 Processing Methods to AEID 80.
tcplMthdAssign(
lvl = 4, # processing level
id = 80, # assay endpoint ID(s) to assign method(s)
mthd_id = c(1,3), # method(s) to be assigned
ordr = 1:2, # order the method(s) should be applied
type = "mc") # the data/processing type

After level 4 data is processed, the user can load the model fit information from database.

## Explore the Level 4 Data ##
# Load the level 4 data.
type = 'mc',
fld = 'aeid',
val = 80,
# Prepare the data into a readable format.
mc4 <- tcplPrepOtpt(mc4)
# Display the data.
mc4

A subset of this data is already available within the mc_vignette object.

# Allocate the level 4 data in mc_vignette to the mc4 object.
mc4 <- mc_vignette[["mc4"]]

The level 4 data includes fields for each of the ten model fits as well as the ID fields. A complete list of level 4 fields is available in the Introduction vignette. The level 4 data fields with model fit information are prefaced by the model abbreviations (e.g. $$\mathit{cnst}$$, $$\mathit{hill}$$, $$\mathit{gnls}$$, $$\mathit{poly1}$$, etc.). The fields ending in $$\mathit{success}$$ indicates the convergence status of the model, where a value of 1 means the model converged and a value of 0 otherwise. NA values indicate the fitting algorithm did not attempt to fit the model.

In Figure 3 we display the smoothed model fit results from level 4 with respect to the observed concentration-response series.

## Create a Sequence of Concentration Values within Observed Range ##
X <- seq(
10^(mc4[which(mc4[,spid] == "01504209"),logc_min]),
10^(mc4[which(mc4[,spid] == "01504209"),logc_max]),
length.out = 100
)
## Obtain the Continuous Fit of Level 4 Model Estimates ##
# Apply each model fit to continous concentration values (X) and estimated
# parameters from 'tcplfit2'.
estDR <- mc4 %>%
dplyr::filter(spid == "01504209") %>%
dplyr::summarise(
cnst  = tcplfit2::cnst(.[,c(cnst_er)],X),
poly1 = tcplfit2::poly1(.[,c(poly1_a,poly1_er)],X),
poly2 = tcplfit2::poly2(.[,c(poly2_a,poly2_b,poly2_er)],X),
power = tcplfit2::pow(.[,c(pow_a,pow_p,pow_er)],X),
hill  = tcplfit2::hillfn(.[,c(hill_tp,hill_ga,hill_p)],X),
gnls  = tcplfit2::gnls(.[,c(gnls_tp,gnls_ga,gnls_p,gnls_la,gnls_q,gnls_er)],
x = X),
exp2  = tcplfit2::exp2(.[,c(exp2_a,exp2_b,exp2_er)],x = X),
exp3  = tcplfit2::exp3(.[,c(exp3_a,exp3_b,exp3_p,exp3_er)],x = X),
exp4  = tcplfit2::exp4(.[,c(exp4_tp,exp4_ga,exp4_er)],x = X),
exp5  = tcplfit2::exp5(.[,c(exp5_tp,exp5_ga,exp5_p,exp5_er)],x = X)
)
# Format data into a data.frame for ease of plotting.
estDR <- cbind.data.frame(X,estDR) %>%
reshape2::melt(data = .,measure.vars = c(
"cnst","poly1","poly2","power","hill","gnls","exp2","exp3","exp4","exp5"
))

## Updated Colors ##
fit_cols <-
# choose 10 distinct colors
viridis::magma(n = 10,direction = 1) %>%
# darken the original colors to make them more visible
colorspace::darken(.,amount = 0.2)

## Plot the Model Fits from Level 4 ##
mc3 %>%
dplyr::filter(spid == "01504209") %>%
ggplot(.,aes(x = logc,y = resp))+
geom_point(pch = 1,size = 2)+
geom_line(data = estDR,
aes(x = log10(X),y = value,colour = variable,lty = variable))+
labs(colour = "Models",lty = "Models")+
scale_colour_manual(values = fit_cols)+
xlab(expression(paste(log[10],"(Concentration) ",mu,"M")))+
ylab(expression(paste(log[2],"(Fold Induction)")))+# )+
ggtitle(
label = paste("Level 4 Model Fits",
mc4[which(mc4[,spid] == "01504209"),dsstox_substance_id],
sep = "\n"),
subtitle = paste("Assay Endpoint: ",
mc4[which(mc4[,spid] == "01504209"),aenm]))+
theme_bw()

Figure 3: This figure illustrates the results from the Level 4 analyses in the tcpl pipeline. The plot depicts the observed concentration-response data with white circles, where the x-axis is base 10 log-transformed concentration values. All the ten model fits are displayed and distinguished by color and line-type.

Each of the models in Figure 3 have two model fit summary estimates, which include the Akaike Information Criterion (AIC) and the root mean square error (RMSE or RME).

For the AIC, let $$log(\mathcal{L}(\hat{\theta}, y))$$ be the log-likelihood of the model $$\hat{\theta}$$ given the observed values $$y$$, and $$K$$ be the number of parameters in $$\hat{\theta}$$, then,

$\mathrm{AIC} = -2\log(\mathcal{L}(\hat{\theta}, y)) + 2K\mathrm{.}$

The RMSE is given by

$\mathrm{RMSE} = \sqrt{\frac{\sum_{i=1}^{N} (y_{i} - \mu_{i})^2}{N}}\mathrm{,}$

where $$N$$ is the number of observations, and $$\mu_{i}$$ and $$y_{i}$$ are the estimated and observed values at the $$i^{th}$$ observation, respectively.

### Level 5

Level 5 processing determines the winning model and activity for the concentration series, bins all of the concentration series into categories, and calculates point-of-departure estimates based on the activity cutoff.

#### Methods Assignment

The model with the lowest AIC value is selected as the winning model ($$\mathit{modl}$$ ), and is used to determine the activity (or hit call) for the concentration series. If two models have equal AIC values, then the simpler model (i.e. model with fewer parameters) wins. Additionally, if the constant model describes the response best out of all the models fit then $$\mathit{modl}$$ is reported as ‘none’. Table 9 provides a side-by-side comparison of AIC values from each of the ten model fits for the concentration-response series we include in the example dataset mc_vignette.

Table 9: AIC values from All Fits in Level 4. Values highlighted in blue indicate the winning model for the compound.
dsstox_id cnst hill gnls poly1 poly2 pow exp2 exp3 exp4 exp5
DTXSID80379721 4.35 -18.979 -14.979 -15.349 -20.872 -21.059 -21.271 -19.128 -7.326 -19.01
DTXSID10379991 -10.12 -26.4 -22.4 -25.132 -27.887 -28.378 -28.368 -26.377 -18.233 -26.4
DTXSID7021106 -15.77 -41.915 -37.915 -42.866 -40.889 -41.092 -40.887 -39.03 -37.163 -41.915
DTXSID1026081 0.014 -27.292 -23.292 -9.634 -22.211 -29.313 -29.301 -27.315 -1.248 -27.305
DTXSID9032589 -6.489 -28.194 -24.194 -27.803 -30.579 -30.406 -30.773 -29.095 -19.544 -28.27

The estimated parameters from the winning model are stored in their respective parameter fields at level 5. The activity of each concentration-response series is determined by calculating a continuous hit-call model which is the product of three probabilities. The first probability is based on at least one median response being greater than the efficacy cutoff. Second, the top (or maximal change in the fitted response) is larger than the cutoff. The last probability reflects whether the AIC of the winning model is less than that of the constant model.

The efficacy cutoff value ($$\mathit{coff}$$) is defined as the maximum of all values given by the methods assigned at level 5. When two or more methods (i.e. cutoff values) are applied for processing, the largest cutoff value is always selected as the cutoff for the endpoint. In the event only one method is applied, then that will serve as the efficacy cutoff for the endpoint. Failing to assign a level 5 method will result in every concentration series being called active. For a complete list of level 5 methods, see tcplMthdList(lvl = 5) or ?MC5_Methods .

The example we include in this vignette for demonstrating the assignment of level 5 methods specifies three different efficacy cutoff estimates for consideration. These efficacy cutoff estimates include $$3*\mathit{bmad}$$, $$log_2(1.2)$$, and $$5*\mathit{bmad}$$, which correspond to $$\mathit{mthd\_id}$$ assignments 1, 3, and 5 respectively. As mentioned previously, the largest of these three values will be selected as the cutoff for processing the endpoint.

## Methods Assignment ##
# Assign the Level 5 Processing Methods to AEID 80.
tcplMthdAssign(
lvl = 5, # processing level
id = 80, # assay endpoint ID(s) to assign method(s)
mthd_id = c(1,3,5), # method(s) to be assigned
ordr = 1:3, # order the method(s) should be assigned
type = "mc") # the data/processing type

As described previously, since the continuous hit call is the product of three probability values, then the resulting value is between 0 and 1. The higher the hitcall (i.e. close to 1) the more probable the concentration-response series indicates true biological activity in the measured response (i.e. ‘active’ hit).

For each concentration series several point-of-departure (POD) estimates are calculated for the winning model. The major estimates include: (1) the activity concentration at the specified benchmark response (BMR) ($$\mathit{bmd}$$), (2) the activity concentration at $$50\%$$ of the maximal response ($$\mathit{ac50}$$), (3) the activity concentration at the efficacy cutoff ($$\mathit{acc}$$), (4) the activity concentration at $$10\%$$ of the maximal response, and (5) the concentration at $$5\%$$ of the maximal response. Though there are several other potency estimates calculated as part of the level 5 pipeline these five are the major POD estimates. The POD estimates mentioned in here are summarized in Figure 4.

After level 5 data processed, the user can load the model fit information from database.

## Explore the Level 5 Data ##
# Load the level 5 data.
type = 'mc',
fld = 'aeid',
val = 80,
# Prepare the data into a readable format.
mc5 <- tcplPrepOtpt(mc5)
# Display the data.
mc5

A subset of this data is already available within mc_vignette object.

# Allocate the level 5 data in mc_vignette to the mc5 object.
mc5 <- mc_vignette[["mc5"]]

The level 5 data includes fields for the best model fit, the POD estimates, other estimates from the best model fit, as well as the ID fields. A complete list of level 5 fields is available in the Introduction vignette.

The user can visualize the model fitting results with tcpl using the tcplPlot functions, when connected to a database. The resulting figure is an interactive plot displaying the concentration-response series data, the ten curve fits from level 4 with the best model distiguished by color, and the AC50 estimate.

## Plot the Interactive Level 5 Figure ##
# Obtain the m4id for AEID 80 and spid 01504209
fld = c('aeid','spid'),
val = list(80,'01504209'))\$m4id
# Report the m4id.
my_id
# Generate the interactive concentration-response plot.
tcplPlot(lvl = 5,val = my_id,output = 'console')

In Figure 4, we provide an alternative visualization of the model fitting results from Level 4 and the best model as well as the potency estimates from Level 5 data.

## Obtain Data ##
# First, we need to obtain the subset of data related to spid = "01504209",
# which is our example spid.
mc3_ss <- mc3 %>% dplyr::filter(spid == "01504209") # Level 3 - conc-resp series
mc4_ss <- mc4 %>% dplyr::filter(spid == "01504209") # Level 4 - model fits
mc5_ss <- mc5 %>% dplyr::filter(spid == "01504209") # Level 5 - best fit & est.
# Next, we need to obtain the smooth curve estimate for the best model found
# in the Level 5 analyses of the tcpl pipeline.
estDR <- estDR %>%
dplyr::mutate(.,best_modl = ifelse(variable == mc5_ss[,modl],
yes = "best model",no = NA))

## Generate a Base Concentration-Response Plot ##
basePlot <- mc3_ss %>%
# Observed Concentration-Response Data
ggplot()+
geom_point(aes(x = logc,y = resp),pch = 1,size = 2)+
# Cutoff Band
geom_rect(data = mc5_ss,
aes(xmin = logc_min,xmax = logc_max,ymin = -coff,ymax = coff),
alpha = 0.15,fill = "skyblue")+
# Best Model Fit
geom_line(data = dplyr::filter(estDR,variable == mc5_ss[,modl]),
aes(x = log10(X),y = value,color = mc5_ss[,modl]))+
scale_colour_manual(values = c("royalblue3"),aesthetics = "color")+
# Other Model Fits
geom_line(data = dplyr::filter(estDR,variable != mc5_ss[,modl]),
aes(x = log10(X),y = value,lty = variable),
alpha = 0.3,show.legend = TRUE)+
# Legend Information
labs(lty = "Other Models",color = "Best Fit")+
# Titles and Labels
xlab(expression(paste(log[10],"(Concentration) ",mu,"M")))+
ylab(expression(paste(log[2],"(Fold Induction)")))+# )+
ggtitle(
label = paste("Level 5 Best Model Fit",
mc4[which(mc4[,spid] == "01504209"),dsstox_substance_id],
sep = "\n"),
subtitle = paste("Assay Endpoint: ",
mc4[which(mc4[,spid] == "01504209"),aenm]))+
# Background Plot Theme
theme_bw()

## Potency Estimate Layers ##
# First, we need to obtain/assign colors for the potency estimates to be
# displayed.
potency_cols <-
# choose 5 distinct colors
viridis::plasma(n = 5,direction = -1) %>%
# darken the original colors to make them more visible
colorspace::darken(.,amount = 0.1)

## Compile the Full Level 5 Plot ##
linePlot <-
# Start with the basePlot object.
basePlot +
# Next, add the various potency layers.
# BMD
geom_hline(
data = mc5_ss,
aes(yintercept = bmr),
col = potency_cols[1]
) +
geom_segment(
data = mc5_ss,
aes(x = log10(bmd), xend = log10(bmd), y = -0.5, yend = bmr),
col = potency_cols[1]
) +
geom_hline(
data = mc5_ss,
aes(yintercept = coff),
col = potency_cols[2]
) +
geom_segment(
data = mc5_ss,
aes(x = log10(acc), xend = log10(acc), y = -0.5, yend = coff),
col = potency_cols[2]
) +
geom_hline(
data = mc5_ss,
aes(yintercept = max_med * 0.5),
col = potency_cols[3]
) +
geom_segment(
data = mc5_ss,
aes(
x = log10(ac50), xend = log10(ac50),
y = -0.5, yend = max_med * 0.5
),
col = potency_cols[3]
) +
geom_hline(
data = mc5_ss,
aes(yintercept = max_med * 0.1),
col = potency_cols[4]
) +
geom_segment(
data = mc5_ss,
aes(
x = log10(ac10), xend = log10(ac10),
y = -0.5, yend = max_med * 0.1
),
col = potency_cols[4]
) +
geom_hline(
data = mc5_ss,
aes(yintercept = max_med * 0.05),
col = potency_cols[5]
) +
geom_segment(
data = mc5_ss,
aes(
x = log10(ac5), xend = log10(ac5),
y = -0.5, yend = max_med * 0.05
),
col = potency_cols[5]
)

# create data table for potency estimate points
mc5_points <- mc5_ss %>%
select(bmd, acc, ac50, ac10, ac5) %>%
tidyr::pivot_longer(everything(), names_to = "Potency Estimates") %>%
mutate(x = log10(value)) %>%
mutate(mc_color = potency_cols) %>%
mutate(Potency Estimates = toupper(Potency Estimates))
yvals <- mc5_ss %>%
select(bmr, coff, max_med) %>%
tidyr::pivot_longer(everything()) %>%
select(value) %>%
mutate(reps = c(1, 1, 3)) %>%
tidyr::uncount(reps) %>%
mutate(y = value * c(1, 1, .5, .1, .05)) %>%
select(y)
mc5_points <- mc5_points %>% cbind(yvals)

# add Potency Estimate Points and set colors
fullPlot <- linePlot + geom_point(
data = mc5_points,
aes(x = x, y = y, fill = Potency Estimates), shape = 21, cex = 2.5
) +
scale_fill_manual(values = mc5_points %>% arrange(Potency Estimates) %>% pull(mc_color))

## Display the Compiled Plot ##
fullPlot

Figure 4: This figure illustrates the results from the Level 5 analyses in the tcpl pipeline package including the best model fit and subsequent point-of-departure (POD) estimates. The light-blue shaded region represents the estimated efficacy cutoff ($$\mathit{coff}$$). Each of the concentration-response models fit in Level 4 are included in the plot, where the blue curve indicates the best model fit for the observed data (white circles) and the rest are depicted by the gray curves. The horizontal lines show the activity responses from which potency estimates of interest are defined, and the vertical lines show the corresponding POD estimates. The black point shows the AC5 (concentration producing $$5 \%$$ of the maximal response), the purple point shows the AC10 (concentration producing $$10 \%$$ of the maximal response), the yellow point shows the BMD (benchmark dose), the orange point shows the ACC (concentration producing a response at the efficacy cutoff), and the pink point shows the AC50 (concentration producing $$50 \%$$ of the maximal response).

In tcpl v3.0.0 , there are no level 6 (mc6) methods to apply since the introduction of new curve-fitting models requires further investigation for behaviors that may warrant caution flags. Caution flags may be developed in future versions of tcpl if such behaviors are identified.

## ‘tcpl’ Compiled Processing Examples

In this section we provide a practical step-by-step example for single- and multiple-concentration data from methods assignment through data processing.

### Single-concentration Data

After level 0 (pre-processed data) is loaded into the database, we need to assign the data processing methods for each step. Once the methods for level 1 and 2 processing are assigned, then we can run the pipeline for processing and storing the data in a database.

In the following code chunk, we demonstrate how to assign the methods for each step.

## Methods Assignment ##
# Level 1 - Normalization
tcplMthdAssign(lvl = 1, # data level
id = 1:2, # assay component
mthd_id = c(1, 11, 13), # method(s) to be applied
ordr = 1:3, # order sequence for applied methods
type = "sc") # data type - single concentration

# Level 2 - Bioactivity Hit Call
tcplMthdAssign(lvl = 2, # data level
id = 1, # assay component
mthd_id = 3, # method(s) to be applied
type = "sc") # data type - single concentration

Now that we have assigned methods for each level we can run pipeline with data with tcpl.

## Pipeline Assay Components (acid's) from level 0 to 2.
sc_res <- tcplRun(id = 1, # assay component id to pipeline
type = "sc", # type of data being processed
slvl = 1, # level to start pipelining on
elvl = 2) # level to end pipelining on

### Multiple-concentration Data

After level 0 multiple-concentration data is loaded into the database, we need to assign the data processing methods for each step. Once the methods for levels 1-5 processing are assigned, then we can run the pipeline for processing and storing the data in a database. Note, if higher levels of processing are included in future versions of the pipeline, then these also should be assigned prior to running the pipeline.

In the following code chunk, we demonstrate how to assign the methods for each step.

## Methods Assignment.
# Level 1 - Indexing
#   * No methods need to be assigned.

# Level 2 - Data Transformation
tcplMthdAssign(lvl = 2, # data level
id = 49, # assay component
mthd_id = c(3,4,2), # method(s) to be applied
ordr = 1:3, # the order method(s) are to be applied
type = "mc") # data type - multiple concentration

# Level 3 - Normalization
tcplMthdAssign(lvl = 3, # data level
id = 80, # assay endpoint
mthd_id = 1, # method(s) to be applied
ordr = 1, # the order method(s) are to be applied
type = "mc") # data type - multiple concentration

# Level 4 - Concentration-Response Modeling
tcplMthdAssign(lvl = 4, # data level
id = 80, # assay endpoint
mthd_id = c(1,3), # method(s) to be applied
ordr = 1:2, # the order method(s) are to be applied
type = "mc") # data type - multiple concentration

# Level 5 - Best Model Fit
tcplMthdAssign(lvl = 5, # data level
id = 80, # assay endpoint
mthd_id = c(1,3,5), # method(s) to be applied
ordr = 1:3, # the order method(s) are to be applied
type = "mc") # data type - multiple concentration

Now that we have assigned methods for each level we can pipeline the data with tcpl. This can be done in two ways (a) from start to finish (i.e. Level 0 to 5) with the assay component ID (acid) or (b) Level 0 to 3 with the assay component ID (acid) and Level 4 to 5 with the assay endpoint ID (aeid). Option A may be helpful in scenarios when the data are new to the pipeline, whereas Option B or derivatives of Option B may be helpful in making pipeline updates or corrections to data already pipelined and stored in the database.

Option A:

## Assign the Number of Processing Cores.
mycores <- 1 # DO NOT leverage parallel computing.
# Users that want to leverage parallel computing set to > 1,
# but less than the total number of cores (i.e. need at least
# mycores <- parallel::detectCores() - 1 # Maximum allowed number of cores.

## Pipeline Assay Components (acid's) from level 0 to 5.
mc_res <- tcpl::tcplRun(
id = atg.aeid[,acid], # assay component id to pipeline
type = "mc", # type of data being processed
slvl = 0L, # level to start pipelining on
elvl = 5L, # level to end pipelining on
mc.cores = mycores # number of computing cores
)

Option B:

## Assign the Number of Processing Cores.
mycores <- 1 # DO NOT leverage parallel computing.
# Users that want to leverage parallel computing set to > 1.

## Pipeline Assay Components (acid's) from level 0 to 3.
mc_res_0.3 <- tcpl::tcplRun(
id = atg.aeid[,acid], # assay component id to pipeline
type = "mc",
slvl = 0L, # level to start pipelining on
elvl = 3L,  # level to end pipelining on
mc.cores = mycores # number of computing cores
)

## Pipeline Assay Endpoints (aeid's) from level 4 to 5.
mc_res_4.5 <- tcpl::tcplRun(
id = atg.aeid[,aeid], # assay endpoint id to pipeline
type = "mc",
slvl = 4L, # level to start pipelining on
elvl = 5L,  # level to end pipelining on
mc.cores = mycores # number of computing cores
)

1. This section assumes a working knowledge of the concepts covered in the Data Processing and Data Normalization sections↩︎

2. This section assumes a working knowledge of the concepts covered in the Data Processing and Data Normalization sections↩︎