Introduction to sendigR

2022-08-16

The purpose of the sendigR package is to extract data from a set of nonclinical studies stored in CDISC SEND format in a database to be used for cross study analysis. It can either be done in a script by execution of a set of functions from the package to extract data for further processing or by execution an encapsulated R Shiny application.

The package supports currently these types of databases:

• SQLite It’s possible to access an existing SEND database or build a database with SEND data imported from SAS transport/xpt files, which then can be accessed.
• Oracle It’s possible to access an already existing SEND database

This vignettes introduces how to use sendigR to:

• initiate the use of the functions.
• build and maintain a SQLite database of SEND data.
• filter and extract SEND data from a database.
• execute the R Shiny application to browse the SEND control data stored in the database.

Familiarity with the CDISC SEND data model and at least basic knowledge about animal studies is necessary to follow this vignette and to use the package.

Getting started

Before we are ready to use the functions in the package, we must ensure that a minimum of prerequisites are fulfilled.

Package dependencies

Beside the required packages listed in the Imports section of DESCRIPTION, one or more additional package(s) must be installed:

• If the SEND databases is stored in an Oracle database:
• ROracle
• If a SEND database are to be build using this package’s functions
• SASxport
• sjlabelled

Database

If an existing SEND database is to be accessed, it must contain a set of tables and columns representing the domains and variables described in the SEND IG version 3.0 or 3.1 - or a union of the two versions.

It is required that all names and basic data types (i.e. numerical or character) are as described the the SEND IG.

Study data to import

If study data are to be imported into a SQLite database, data for each study must be located in a separate folder as SAS transport files - also konw as XPT files.

Multiple studies may be located together in a folder hierarchy as described below.

SEND controlled terminology

Several of the function in the packages uses CDISC SEND terminology code list to verify validity of the SEND data. The package includes the values for the relevant code lists extracted from the newest CT version at build time. If it is relevant to use another version - newer or older - this version must be downloaded in Excel format from NCI or CDISC library archives and saved in any folder which is accessible from the R script(s) using this package.

Function overview.

The package contains 5 sets of functions covering different areas - each of these are described below, including detailed examples.

Initiation and closure of SEND environment

Before any use of the package’s function, a database must be opened and the database must be closed again as the final action.

Function Description
initEnvironment Initialize the environment. It must be executed as the absolute first function before any other package functions. It opens an existing or create a new empty database and returns a handle to the database, which must be used in all following function calls.
disconnectDB Close the open database. It must be called as the absolute last function in a session.

The call of initEnvironment returns a handle with a reference to the open database and pointers to low level functions accessing the specific type of database. This handle must be given as the first input parameter in the execution of all other function in the package. It is possible to open and access multiple databases - also of different types - in the same R session. Just execute initEnvironment for each relevant database and save the returned db handles in each distinct variable.

Examples in the following sections demonstrates the use of these functions in a complete work flow.

Build a SQLite SEND database

It is possible to build an SQLite database with SEND data imported from a set of studies.

Function Description
dbCreateSchema Create a SEND schema, i.e. all the tables to represent the domains documented in SEND IG, in an open and empty database
dbCreateIndexes Create a set of indexes on the tables in a SEND database to optimize performance of extraction of data
dbImportOneStudy Import SEND study data in SAS xport format into a SEND database from a single study folder
dbImportStudies Import SEND study data in SAS xport format into a SEND database from a hierarchy study folders
dbDeleteStudies Delete one or more studies in SEND database

It should be possible to import data into an existing SQLite SEND database using dbImportOneStudy and dbImportStudies, even though the tables are not created by dbCreateSchema. It is possible to delete studies in any SQLite SEND database dbDeleteStudies - whether or not the tables are created by dbCreateSchema.

To make it possible to import data for a set of studies with dbImportStudies, the input data should be saved in folder structure like this:

/path/to/SEND/datasets
+-- study01
|   +-- ts.xpt
|   +-- dm.xpt
|   +-- ex.xpt
|   +-- etc.
+-- study02
|   +-- ts.xpt
|   +-- dm.xpt
|   +-- ex.xpt
|   +-- etc.
+-- proj1234
+-- study11
|   +-- ts.xpt
|   +-- dm.xpt
|   +-- ex.xpt
|   +-- etc.
+-- study12
|   +-- ts.xpt
|   +-- dm.xpt
|   +-- ex.xpt
|   +-- etc.
+-- etc.

It is recommended to download DB Browser for SQLite to look at the data imported into the database.

Examples

Create a new database, create all the SEND domain tables and import data for a set of studies XPT files saved in a folder hierarchy - create a set of indexes to optimize data extraction performance for the sendigR functions:

# The database is created by the call of  initEnvironment with the parameter dbCreate = TRUE:
dbToken <- initEnvironment(dbType='sqlite',
dbPath='/path/to/db/send.db',
dbCreate=TRUE)

# The tables must be created before any study can be imported
dbCreateSchema(dbToken)

status <- dbImportStudies(dbToken, '/path/to/SEND/datasets',
# Print contiously the status for import each study:
verbose = TRUE,
# sand save the status in a log file:
logFilePath = '/path/to/log file')

# Create a set of indexes to increase query performance for the data extraction functions
# - they may be created before of after import of data
dbCreateteIndexes(dbToken)

disconnectDB(dbToken)

Open an existing database and import data for a single study, allow it to replace potential existing data for the the same study in the database. Delete another study from the database:

dbToken <- initEnvironment(dbType='sqlite', dbPath='/path/to/db/send.db')

dbImportOneStudy(dbToken, '/path/to/SEND/datasets/proj1234/study11', overWrite=TRUE)

dbDeleteStudies(dbToken, 'study99')

disconnectDB(dbToken)

Extract and filter SEND data

These functions are divided into different groups:

• Fetch rows from data A single function genericQuery gives the possibility to execute a SQL query against the database, i.e.it executes any user defined select statement and returns the selected rows. It is used internally by the functions described below, but can also be used directly to fetcg rows which ar e not possible with any of the specific extraction functions described below.
• Extract sets of studies
Each of these functions extracts, filter and return studies for different kinds of conditions.
They can all be called with or without a set of studies given as input. In the former case, the extraction/filtering is based on the TS subset of input studies. In the latter case, the extraction/filtering is based on the full TS table.
The naming convention of these function are getStudiesTSPARMCD, where TSPARMCD is the specific TSPARMCD value handled by the function - e.g. STSTDTC.
• Extract full set of control animals
One function, getControlSubj, which extracts and return the set of control animals for a given input set of study id values.
Only ‘negative’ control animals are included in the set of control animals - i.e. animals which are not treated with a compound but only vehicle.
• Extract subsets of animals
Each of these functions take a set of animals as input, extracts and return a subset fulfilling a given condition - e.g. for species or route of administration.
It is possible to execute each of these functions with no conditions for filtering - it will then just return the set of input rows with additional variables for the actual kind of attributes, e.g. sex or species/strain.
The naming convention of these function are getSubjattributes where attributes is the specific attribute handled by the function - e.g. Sex.
• Extract subject level data
One function, getSubjData, which extracts and return all rows for a given input set of animals from a specified subject level table - it can be any SEND table including USUBJID column.
• Extract subsets of findings
Each of these functions take a set of finding rows as input, extracts and return a subset fulfilling a given condition - e.g. for a specific study phase.
It is possible to execute each of these functions with no conditions for filtering - it will then just return the set of input rows with additional variables for the actual kind of attributes, e.g. the study phase.
The naming convention of these function are getFindingsattributes where attributes is the specific attribute handled by the function - e.g. Phase.

The functions rely on the data.table package - and all input/output sets of data are data tables.

The complete list of functions:

Function Description
genericQuery Execute database query and returns fetched rows
getStudiesSDESIGN Extract a list of SEND studies with a specified study design
getStudiesSTSTDTC Extract a list of SEND studies with study start date within a specified interval
getControlSubj Extract a list of control animals for a list of studies
getSubjRoute Extract the set of animals of the specified route of administration - or just add actual route of administration for each animal.
getSubjSpeciesStrain Extract the set of animals of the specified species and strain - or just add the species and strain for each animal.
getSubjSex Extract the set of animals of the specified sex - or just add the sex of each animal.
getSubjData Extract data from a subject level domain.
getFindingsPhase Extract a set of findings for a specified study phase - or just add phase for each animal
getFindingsSubjAge Add the subject age at finding time - and optionally extract the set of findings within a specified range of ages
gen_vocab Create json file for vocabulary mappings
standardize_file Standardizes SEND xpt files using CDISC controlled terminologies

All the functions have an input parameter called inclUncertain, default value is FALSE.
If a function which is doing any kind of filtering is executed with:

• inclUncertain = FALSE
only rows where the filtering condition can be confidently matched are included in the output data set.
E.g. if getSubjSpeciesStrain cannot decide the species for an animal because the value does not match a value in the CDISC SEND CT code list ‘value SPECIES, the animal is excluded even though it could be a simple misspelling like ’RAAT’ instead of ‘RAT’
• inclUncertain = TRUE
both rows where the filtering condition can be confidently matched and rows where it’s not possible to decide that they does not match the filter condition, because the value cannot be decided with certainty, are included.
In the species example above, the animal with species ‘RAAT’ are included.
A variable named UNCERTAIN_MSG is included in the output data set - it contains an explanation of the uncertainty for the uncertain rows.
The format of such a message is: function id:explanation, where function id indicates from which function the message originates from, e.g. SpeciesStrain for function getSubjSpeciesStrain. Multiple messages from same function are separated by &.
If the input data set contains an UNCERTAIN_MSG variable, uncertain messages are added to this variable. Messages from multiple functions are concatenated and separated by |.

As mentioned above, some of the functions may be executed with no filter conditions just to add the relevant attributes to a set of data, e.g. a SEX variable is added to the output data set by execution of getSubjSex. In these cases, the additional variable is populated with the found value for all rows, also if the value cannot be confidently identified - e.g. when a DM.SEX value like ‘Male’ does not match any value in the CDISC SEND CT code list value SEX.
These functions have an input parameter called noFilterReportUncertain, which is equivalent to the inclUncertain parameter described above.
If one of these functions is executed with no filter condition and noFilterReportUncertain = FALSE (default), no explanation of whether the identified values are confident or not is included.
If a function is executed with no filter condition and noFilterReportUncertain = TRUE, a NOT_VALID_MSG variable is included in the output data set with an explanation of the reason for not confident values which may for found.
The format of this variable is equivalent to the UNCERTAIN_MSG variable described above.

The intention with these functions are to build a work flow or pipeline to extract set(s) of animals, findings or other kinds of subject level data to be used for further analysis or similar - see an example below.

Example

Extract data to make this analysis:

What are the most common microscopic findings in control Sprague-Dawley male rats 10 to 14 weeks old at finding time that were dosed by oral or oral gavage. Include parallel studies after 2016 and findings from the treatment phase. Include uncertain records.

db <- initEnvironment(dbType='sqlite',
dbPath='/path/to/db/send.db')

# Extract the relevant set of studies into data.table 'studies'
studies <-
getStudiesSDESIGN(db,
studyDesignFilter = 'PARALLEL',
inclUncertain = TRUE) %>%
getStudiesSTSTDTC(db, .,
fromDTC = '2016',
inclUncertain = TRUE)

# Extract the complete set of control animals for the set of extracted studies
# into data.table 'controlAnimals':
controlAnimals <-
getControlSubj(db, studies,
inclUncertain = TRUE)

# Extract all
#  1: the males
#  2: Sprague-Dawley rats
#  3: animals dosed by oral or oral gavage
# from the set of control animals into data.table 'animals':
animals <-
getSubjSex(db, controlAnimals,
sexFilter = 'M',
inclUncertain = TRUE) %>%
getSubjSpeciesStrain(db, .,
speciesFilter = 'RAT',
strainFilter = 'Sprague-Dawley',
inclUncertain = TRUE) %>%
getSubjRoute(db, .,
routeFilter = c('ORAL', 'ORAL GAVAGE'),
inclUncertain = TRUE)

# Extract microscopic findings
#  1: all MI rows for the extrated set of animals
#  2: keep the MI rows from the treatment phase
#  3: keep the MI rows where the animals are between 10 and 14 weeks at the time
#     of finding
# into data.table 'dataMI':
dataMI <-
getSubjData(db, animals, 'MI') %>%
getFindingsPhase(db, .,
phaseFilter = "treatment",
inclUncertain = TRUE) %>%
getFindingsSubjAge(db, .,
fromAge = '10w',
toAge = '14w',
inclUncertain = TRUE)

disconnectDB(db)

The resulting data.table dataMI contains all the extracted findings fulfilling the wanted conditions and can be used for further analysis.

It includes both findings

• for animals which may have been identified as uncertain by one of the study or subject level functions - this level of uncertainties are described in the animal.UNCERTAIN_MSG column
• and findings which may have been identified as uncertain by the findings level functions- this level of uncertainties are described in the dataMI.UNCERTAIN_MSG column.

Execute sendDashboard app

The sendigR package contains an encapsulated Shiny app - SendDashboard. The app enables the user to search and extract historical control data in a SEND database.

Function Description
execSendDashboard Starts the SendDashboard app

Example

Connect to the database, start the app. Remember to disconnect from database when leaving the app.

library(sendigR)
dbToken <- initEnvironment(dbType='sqlite', dbPath='/path/to/db/send.db')

execSendDashboard(dbToken)

disconnectDB(dbToken)