explore

Roland Krasser

2019-06-16

Interactive data exploration with one line of code or use a easy to remember set of tidy functions for exploratory data analysis. Introduces two main verbs. describe() to describe a variable or table, explore() to grafically explore a variable or table.

explore package on Github: https://github.com/rolkra/explore

As the explore-functions fits well into the tidyverse, we load the dplyr-package as well.

library(dplyr)
library(explore)

Interactive data exploration

Explore your dataset (in this case the iris dataset) in one line of code:

explore(iris)

A shiny app is launched, you can inspect individual variable, explore their relation to a binary target, grow a decision tree or create a fully automated report of all variables with a few “mouseclicks”.

You can choose each variable containng 0/1, FALSE/TRUE or “no”/“yes” as a target. As the iris dataset doesn’t contain a binary target, we create one:

iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris %>% explore()

Report variables

Create a rich HTML report of all variables with one line of code:

# report of all variables
iris %>% report(output_file = "report.html", output_dir = tempdir())

Or you can simply add a target and create the report:

# report of all variables and their relationship with a binary target
iris$is_versicolor <- ifelse(iris$Species == "versicolor", 1, 0)
iris %>% 
  report(output_file = "report.html", 
         output_dir = tempdir(),
         target = is_versicolor)

Grow a decision tree

Grow a decision tree with one line of code:

iris %>% select(-Species) %>% explain_tree(target = is_versicolor)

You can control the growth of the tree using the parameters maxdepth, minsplit and cp.

Explore variables

Explore a variable with one line of code. You don’t have to care if a variable is numerical or categorical.

iris %>% explore(Species)

iris %>% explore(Sepal.Length)

Explore variables with a target

Explore a variable and its relationship with a binary target with one line of code. You don’t have to care if a variable is numerical or categorical.

iris %>% explore(Species, target = is_versicolor)

iris %>% explore(Sepal.Length, target = is_versicolor)

Explore multiple variables

iris %>% 
  select(Sepal.Length, Sepal.Width) %>% 
  explore_all()

iris %>% 
  select(Sepal.Length, Sepal.Width, is_versicolor) %>% 
  explore_all(target = is_versicolor)

iris %>% 
  select(Sepal.Length, Sepal.Width, is_versicolor) %>% 
  explore_all(target = is_versicolor, density = FALSE)

Explore correlation between two variables

Explore correlation between two variables with one line of code:

iris %>% explore(Sepal.Length, Petal.Length)

You can add a target too:

iris %>% explore(Sepal.Length, Petal.Length, target = is_versicolor)

Other options

If you use explore to explore a variable and want to set lower and upper limits for values, you can use the min_val and max_val parameters. All values below min_val will be set to min_val. All values above max_val will be set to max_val.

iris %>% explore(Sepal.Length, min_val = 4.5, max_val = 7)

explore uses auto-scale by default. To deactivate it use the parameter auto_scale = FALSE

You can force the explore-function to use bars with the parameter density = FALSE

iris %>% explore(Sepal.Length, density = FALSE, auto_scale = FALSE)

Describing data

Describe your data in one line of code:

iris %>% describe()
#>        variable type na na_pct unique min mean max
#> 1  Sepal.Length  dou  0      0     35 4.3 5.84 7.9
#> 2   Sepal.Width  dou  0      0     23 2.0 3.06 4.4
#> 3  Petal.Length  dou  0      0     43 1.0 3.76 6.9
#> 4   Petal.Width  dou  0      0     22 0.1 1.20 2.5
#> 5       Species  fct  0      0      3  NA   NA  NA
#> 6 is_versicolor  dou  0      0      2 0.0 0.33 1.0

The result is a data-frame, where each row is a variable of your data. You can use filter from dplyr for quick checks:

# show all variables that contain less than 5 unique values
iris %>% describe() %>% filter(unique < 5)
#>        variable type na na_pct unique min mean max
#> 1       Species  fct  0      0      3  NA   NA  NA
#> 2 is_versicolor  dou  0      0      2   0 0.33   1
# show all variables contain NA values
iris %>% describe() %>% filter(na > 0)
#> [1] variable type     na       na_pct   unique   min      mean     max     
#> <0 rows> (or 0-length row.names)

You can use describe for describing variables too. You don’t need to care if a variale is numerical or categorical. The output is a text.

# describe a numerical variable
iris %>% describe(Species)
#> variable = Species 
#> type     = factor
#> na       = 0 of 150 (0%)
#> unique   = 3
#>  setosa     = 50 (33.3%)
#>  versicolor = 50 (33.3%)
#>  virginica  = 50 (33.3%)
# describe a categorical variable
iris %>% describe(Sepal.Length)
#> variable = Sepal.Length 
#> type     = double 
#> na       = 0 of 150 (0%)
#> unique   = 35
#> min|max  = 4.3 | 7.9
#> q05|q95  = 4.6 | 7.3
#> q25|q75  = 5.1 | 6.4
#> median   = 5.8 
#> mean     = 5.8

Data Dictionary

Create a Data Dictionary of a dataset (Markdown File data_dict.md)

iris %>% data_dict_md(output_dir = tempdir())

Add title, detailed descriptions and change default filename

description <- data.frame(
                  variable = c("Species"), 
                  description = c("Species of Iris flower"))
data_dict_md(iris, 
             title = "iris flower data set", 
             description =  description, 
             output_file = "data_dict_iris.md",
             output_dir = tempdir())

Basic data cleaning

To clean a variable you can use clean_var. With one line of code you can rename a variable, replace NA-values and set a minimum and maximum for the value.

iris %>% 
  clean_var(Sepal.Length, 
            min_val = 4.5, 
            max_val = 7.0, 
            na = 5.8, 
            name = "sepal_length") %>% 
  describe()
#>        variable type na na_pct unique min mean max
#> 1  sepal_length  dou  0      0     26 4.5 5.81 7.0
#> 2   Sepal.Width  dou  0      0     23 2.0 3.06 4.4
#> 3  Petal.Length  dou  0      0     43 1.0 3.76 6.9
#> 4   Petal.Width  dou  0      0     22 0.1 1.20 2.5
#> 5       Species  fct  0      0      3  NA   NA  NA
#> 6 is_versicolor  dou  0      0      2 0.0 0.33 1.0

Connecting to a datawarehouse

The explore package comes with a set easy to remember function to connect, read and write from/to a datawarehouse (dwh) using odbc.

# connect to a dwh(odbc DSN must be defined)
dwh <- dwh_connect("DWH_DSN")

# if you need to pass user and password
dwh <- dwh_connect("DWH_DSN", 
                    user = "myuser",
                    pwd = rstudioapi::askForPassword()
                  )

# read table from a dwh
data <- dwh_read_table(dwh, "db.tablename")

# read data from a dwh using sql
data <- dwh_read_data(dwh, sql = "select * from db.tablename")

# disconnect from dwh
dwh_disconnect(dwh)

To write large data to a dwh you can use dwh_fastload(). It connects to a dwh, writes the data and disconnects.

# connect to a dwh(odbc DSN must be defined)
data  %>% dwh_fastload("DWH_DSN", "db.tablename")