Pivot Table

David Gerbing

Overview

Descriptive Statistics

Descriptive summarize characteristics of the sample data values for one or more variables. The lessR pivot() function serves as a single source for a wide variety and types of descriptive statistics for one or more variables. Statistics are computed either for the entire data set at once or separately for different groups of data. MS Excel refers to the resulting statistical summaries as a pivot table.

Aggregation: Form groups of data according to the levels of one or more categorical variables, then compute some statistical value of a numeric variable, such as a mean, for each group.

As an example of aggregation, compute the mean Salary for each combination of the levels of Gender and Dept (department) of employment. Alternatively, compute the mean Salary for all of the available data.

This pivot() function computes statistics for three classes of variables:

The following table lists many available statistical functions that summarize data. One should be aware of the present and missing (not available) data that underlies each computed statistic, so pivot() automatically provides the count of each group in the aggregation.

Statistic Meaning
sum sum
mean arithmetic mean
median median
min minimum
max maximum
sd standard deviation
var variance
skew skew
kurtosis kurtosis
IQR inter-quartile range
mad mean absolute deviation

The statistics skew and kurtosis have no counterparts in base R, so are provided by lessR . Computations of all other statistics follow from base R functions.

The quantile and table computations return multiple values.

Statistic Meaning
quantile min, quartiles, max
table cell counts or proportions

The table computation applies to an aggregated variable that consists of discrete categories, such as the numbers 1 through 5 for responses to a 5-pt Likert scale. The result is a table of frequencies or proportions, referred to for two or more variables as either a contingency table, a cross-tabulation table, or a joint frequency distribution. Only the table computation applies to non-numeric as well as numeric variables, though only meaningful if the aggregated variable consists of a relatively small set of discrete values, character strings or numeric.

The default quantiles for quantile are quartiles. Specify a custom number of quantiles with the q_num parameter, which has the default value of 4 for quartiles.

Parameters

The following pivot() parameters specify the data, one or more statistics to compute for the aggregation, the variables over which to aggregate, and the corresponding groups that contain the aggregated values. The first three parameter values listed below are required: the data frame, at least one statistic to compute, and at least one variable for which to compute the statistic(s).

  1. data: The data frame that includes the variables of interest.
  2. compute: One or more functions that specify the corresponding statistics to compute.
  3. variable: One or more numerical variables to summarize, either by aggregation over groups or the entire sample as a single group.
  4. by: Specify the optional aggregation according to the categorical variable(s) that define the groups.
  5. by_cols: The optional categorical variable(s) that define the groups or cells for which to compute the aggregated values, listed as columns in a two-dimensional table.

For the given data, compute at least one statistic for at least one variable for each group specified by by and possibly by_cols.

If no categorical (by) variables are selected to define groups, then one or more statistics are computed over multiple variables over the entire data set defined as a single group. For categorical variables that define groups, with by and optionally by_cols, can choose either computer multiple statistics or multiple variable(s), but not both.

Key Idea: Select any two of the three possibilities for multiple parameter values: Multiple compute functions, multiple variables over which to compute, and multiple categorical variables by which to define groups for aggregation.

Specify multiple descriptive statistics to compute, multiple values for which to do the computation, and multiple categorical variables to define groups as vectors such as with the base R c() function.

Output

The output of pivot() is a two-dimensional table, rows and columns. The output table can have multiple rows and multiple columns according to the choice of parameter values. For each numerical variable in the analysis, pivot() displays both the corresponding sample size as n_ and amount of missing or not available data as na_. Prevent displaying sample size information by setting parameter show_n to FALSE.

The output follows one of three general forms.

Examples

To illustrate, use the 37-row Employee data set included with lessR, here read into the d data frame.

d <- Read("Employee")
## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  2    Gender character     37       0       2   M  M  M ... F  F  M
##  3      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  low ... high  low  high
##  6      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------

Two categorical variables in the d data frame are Dept and Gender. Continuous variables include Years worked at the company and annual Salary.

Output as a Long-Form Data Frame

Multiple Grouping Variables

Create the long-form pivot table as a data frame that expresses the mean of Salary aggregated across all combinations of Dept and Salary.

This example includes the parameter names (in red) for emphasis, but if the parameters are entered in this order, listing their names is not necessary. The name of the computed mean of Years by default is Years_mn.

##    Dept Gender n_Years na_Years Years_mn
## 1  ACCT      F       3        0    4.667
## 2  ADMN      F       4        0    7.500
## 3  FINC      F       1        0    7.000
## 4  MKTG      F       5        0    8.200
## 5  SALE      F       5        0    6.600
## 6  ACCT      M       2        0    7.000
## 7  ADMN      M       2        0   15.500
## 8  FINC      M       3        0   11.333
## 9  MKTG      M       1        0   18.000
## 10 SALE      M       9        1   12.333

With no by_cols variables, the output of pivot() is a data frame of the aggregated variables. This output can be saved for further analysis, as the data frame a in this example. Next, perform the same analysis, but with the variable Salary, and list the parameter values in order without the parameter names.

a <- pivot(d, mean, Salary, c(Dept, Gender))
a
##    Dept Gender n_Salary na_Salary Salary_mn
## 1  ACCT      F        3         0  63237.16
## 2  ADMN      F        4         0  81434.00
## 3  FINC      F        1         0  57139.90
## 4  MKTG      F        5         0  64496.02
## 5  SALE      F        5         0  64188.25
## 6  ACCT      M        2         0  59626.19
## 7  ADMN      M        2         0  80963.35
## 8  FINC      M        3         0  72967.60
## 9  MKTG      M        1         0  99062.66
## 10 SALE      M       10         0  86150.97

Visualize the aggregation with a bar chart generated by the lessR function BarChart(). The function can do the aggregation internally, but another option provides a data table that consists of categories with each category paired with a numerical value, that is, the output of pivot(). Each category defines a bar with its height based on the value of the numerical variable (see the lessR vignette for BarChart() for more details).

In this example, plot the pivot table from the analysis of the mean of Salary across levels of Dept and Gender. By default, BarChart() also displays the pivot table from which the bar chart is created.

BarChart(Dept, Salary_mn, by=Gender, data=a)

## Summary Table of Salary_mn 
## -------------------------- 
##  
##        Dept 
## Gender          ACCT        ADMN      FINC       MKTG       SALE 
##   F        63237.163   81434.003 57139.900  64496.022  64188.254 
##   M        59626.195   80963.345 72967.600  99062.660  86150.970

Using the lessR function Read(), can read the original data table from which the pivot table was constructed, such as in the form of an Excel worksheet. For many analyses, easier to read the Excel data into R and do the analysis in R than in Excel. The result can also be written back into an Excel file with the lessR function Write().

In this example, create an Excel file called MyPivotTable.xlsx from the pivot table stored in the a data frame. To avoid creating this file in this example, the function call is commented out with the # symbol in the first column.

#Write("MyPivotTable", data=a, format="Excel")

The abbreviation wrt_x() for the function name simplifies the preceding expression, with the format parameter dropped.

Aggregate with Multiple Statistics

In this next example, specify multiple statistics for which to aggregate for each group for the specified value variable Salary. For each group, compare the mean to the median, and the standard deviation to the interquartile range. By default, each column of an aggregated statistic is the variable name, here Salary, followed by a “_”, then either the name of the statistic or an abbreviation. The respective abbreviations for mean and median are mn and md.

pivot(d, c(mean, median, sd, IQR), Salary, Dept)
##   Dept n_Salary na_Salary Salary_mn Salary_md Salary_sd Salary_IQR
## 1 ACCT        5         0  61792.78  69547.60  12774.61   21379.23
## 2 ADMN        6         0  81277.12  71058.60  27585.15   36120.57
## 3 FINC        4         0  69010.68  61937.62  17852.50   16034.81
## 4 MKTG        6         0  70257.13  61658.99  19869.81   26085.69
## 5 SALE       15         0  78830.07  77714.85  23476.84   28810.28

Also have available two functions that are not part of base R: skew() and kurtosis().

pivot(d, c(mean,sd,skew,kurtosis), Salary, Dept, digits_d=3)
##   Dept n_Salary na_Salary Salary_mn Salary_sd Salary_sk Salary_kt
## 1 ACCT        5         0  61792.78  12774.61    -0.623    -3.032
## 2 ADMN        6         0  81277.12  27585.15     0.835    -1.185
## 3 FINC        4         0  69010.68  17852.50     1.689     2.752
## 4 MKTG        6         0  70257.13  19869.81     0.859    -1.458
## 5 SALE       15         0  78830.07  23476.84     0.863     0.856

Can also specify the variable names of the aggregated statistics with the out_names parameter. Here calculate the mean and median Salary for each group defined by each combination of levels for Gender and Dept.

pivot(d, c(mean, median), Salary, c(Gender,Dept), out_names=c("MeanSalary", "MedianSalary"))
##    Gender Dept n_Salary na_Salary MeanSalary MedianSalary
## 1       F ACCT        3         0   63237.16     71084.02
## 2       M ACCT        2         0   59626.19     59626.19
## 3       F ADMN        4         0   81434.00     71058.60
## 4       M ADMN        2         0   80963.35     80963.35
## 5       F FINC        1         0   57139.90     57139.90
## 6       M FINC        3         0   72967.60     66312.89
## 7       F MKTG        5         0   64496.02     61356.69
## 8       M MKTG        1         0   99062.66     99062.66
## 9       F SALE        5         0   64188.25     56508.32
## 10      M SALE       10         0   86150.97     82442.74

Aggregate Multiple Variables

The pivot() function can also aggregate over multiple variables. Here, aggregate Years and Salary. Round the numerical aggregated results to the nearest integer with the digits_d parameter, which specifies the number of decimal digits in the output. Different variables can have different amounts of missing data, so the sample size, n, and number missing, the number of values Not Available, na, are listed separately for each aggregated variable.

pivot(d, mean, c(Years, Salary), c(Dept, Gender), digits_d=0)
##    Dept Gender n_Years na_Years Years_mn n_Salary na_Salary Salary_mn
## 1  ACCT      F       3        0        5        3         0     63237
## 2  ADMN      F       4        0        8        4         0     81434
## 3  FINC      F       1        0        7        1         0     57140
## 4  MKTG      F       5        0        8        5         0     64496
## 5  SALE      F       5        0        7        5         0     64188
## 6  ACCT      M       2        0        7        2         0     59626
## 7  ADMN      M       2        0       16        2         0     80963
## 8  FINC      M       3        0       11        3         0     72968
## 9  MKTG      M       1        0       18        1         0     99063
## 10 SALE      M       9        1       12       10         0     86151

By default, the names of the aggregated variables are the same as the original variables with the added notation that identifies the aggregated statistic. Customize these names with the out_names parameter. If more than one variable, list the custom names in the same order as the vector of variable names. Here also turn off the display of the sample size and number of missing values for each group.

pivot(d, mean, c(Years, Salary), Dept, digits_d=2,
     out_names=c("YearsMean", "SalaryMean"), show_n=FALSE)
##   Dept YearsMean SalaryMean
## 1 ACCT      5.60   61792.78
## 2 ADMN     10.17   81277.12
## 3 FINC     10.25   69010.68
## 4 MKTG      9.83   70257.13
## 5 SALE     10.29   78830.06

Compute over All Data

Aggregation computes one or more statistics for one or more variables across groups defined by the possible combinations of the levels of one or more categorical variables. A related computation computes the variables for each statistic for all the data. To compute over all the rows of data, do not specify groups, that is, drop the by parameter.

Get the grand mean of Years, that is, for all the data.

pivot(d, mean, Years)
##            n_Years na_Years Years_mn
## Grand_Mean      36        1    9.389

Get the grand mean of Years and Salary. Specify custom names for the results.

pivot(d, mean, c(Years, Salary), digits_d=2, out_names=c("MeanYear", "MeanSalary"))
##            n_Years na_Years MeanYear n_Salary na_Salary MeanSalary
## Grand_Mean      36        1     9.39       37         0   73795.56

Consider an example with more variables. Analyze the 6-pt Likert scale responses to the Mach IV scale that assesses Machiavellianism. Items are scored from 0 to 5, Strongly Disagree to Strongly Agree. The data are included with lessR as the Mach4 data file.

Suppress output when reading by setting quiet to TRUE. Calculate the mean, standard deviation, skew, and kurtosis for all of the data for each of the 20 items on the scale. With this specification, the form of a data frame is statistics in the columns and the variables in the rows. The result are the specified summary statistics for the specified variables over the entire data set.

d <- Read("Mach4", quiet=TRUE)
pivot(d, c(mean,sd,skew,kurtosis), m01:m20)
##       n na  mean    sd   skew kurtosis
## m01 351  0 1.279 1.286  0.984    0.356
## m02 351  0 1.746 1.480  0.475   -0.784
## m03 351  0 2.900 1.450 -0.363   -0.829
## m04 351  0 3.339 1.174 -0.855    0.457
## m05 351  0 2.234 1.583  0.076   -1.095
## m06 351  0 3.074 1.478 -0.454   -0.923
## m07 351  0 2.775 1.473 -0.065   -1.150
## m08 351  0 2.100 1.456  0.133   -1.078
## m09 351  0 4.225 1.155 -1.745    2.743
## m10 351  0 3.991 1.138 -1.154    0.932
## m11 351  0 1.641 1.395  0.641   -0.393
## m12 351  0 1.801 1.625  0.422   -1.116
## m13 351  0 1.385 1.368  0.836   -0.126
## m14 351  0 1.954 1.304  0.201   -0.923
## m15 351  0 2.123 1.367 -0.088   -0.995
## m16 351  0 2.177 1.782  0.237   -1.355
## m17 351  0 2.407 1.604  0.085   -1.142
## m18 351  0 2.915 1.326 -0.559   -0.326
## m19 351  0 1.157 1.425  1.290    0.872
## m20 351  0 0.895 1.351  1.499    1.224

Frequency Tables

Aggregating a statistical computation of a continuous variable over groups with pivot(), such as computing the mean for each combination of Dept and Gender, by default includes the tabulation for each group (cell). A tabulation can be requested with no analysis of a numerical variable, instead only a counting of the available levels of the specified categorical variables.

The table value of compute specifies to compute the frequency table for a categorical aggregated variable across all combinations of the by variables. Specify one categorical variable with the remaining categorical variables specified with the by parameter. Missing values for each combination of the levels of the grouping variables are displayed.

Begin with a one-way frequency table computed over the entire data set.

pivot(d, table, m06)
##   m06 Freq  Prop
## 1   0   18 0.051
## 2   1   47 0.134
## 3   2   63 0.179
## 4   3   44 0.125
## 5   4  121 0.345
## 6   5   58 0.165

In this example, compute a two-way cross-tabulation table with the levels of variable m06 as columns and the levels of the by categorical variable m07 as rows.

pivot(d, table, m06, m07)
##   m07 n_m06 na_m06 0  1  2  3  4  5
## 1   0    17      0 4  3  2  3  3  2
## 2   1    64      0 7 24  7  6 18  2
## 3   2    87      0 4 14 30 13 24  2
## 4   3    43      0 2  1 10 16 12  2
## 5   4    93      0 0  3 13  5 56 16
## 6   5    47      0 1  2  1  1  8 34

To output the data in long form, one tabulation per row, set the table_long parameter to TRUE.

If interested in the inferential analysis of the cross-tabulation table, access the lessR function Prop_test() to obtain both the descriptive and inferential results, though limited to a one- or two-way table.

The default data table is d, but included explicitly in the following example to illustrate the data parameter.

Prop_test(m06, by=m07, data=d)
## variable: m06 
## by: m07 
## 
## >>> Pearson's Chi-squared test  <<< 
## 
## >>> Description
## 
##      m06
## m07     0   1   2   3   4   5 Sum
##   0     4   3   2   3   3   2  17
##   1     7  24   7   6  18   2  64
##   2     4  14  30  13  24   2  87
##   3     2   1  10  16  12   2  43
##   4     0   3  13   5  56  16  93
##   5     1   2   1   1   8  34  47
##   Sum  18  47  63  44 121  58 351
## 
##  Cramer's V: 0.380 
## 
##  Row Col Observed Expected Residual Stnd Res
##    1   1        4    0.872    3.128    3.526
##    1   2        3    2.276    0.724    0.528
##    1   3        2    3.051   -1.051   -0.681
##    1   4        3    2.131    0.869    0.652
##    1   5        3    5.860   -2.860   -1.496
##    1   6        2    2.809   -0.809   -0.542
##    2   1        7    3.282    3.718    2.330
##    2   2       24    8.570   15.430    6.263
##    2   3        7   11.487   -4.487   -1.616
##    2   4        6    8.023   -2.023   -0.844
##    2   5       18   22.063   -4.063   -1.182
##    2   6        2   10.575   -8.575   -3.192
##    3   1        4    4.462   -0.462   -0.259
##    3   2       14   11.650    2.350    0.853
##    3   3       30   15.615   14.385    4.634
##    3   4       13   10.906    2.094    0.782
##    3   5       24   29.991   -5.991   -1.558
##    3   6        2   14.376  -12.376   -4.119
##    4   1        2    2.205   -0.205   -0.151
##    4   2        1    5.758   -4.758   -2.274
##    4   3       10    7.718    2.282    0.968
##    4   4       16    5.390   10.610    5.216
##    4   5       12   14.823   -2.823   -0.967
##    4   6        2    7.105   -5.105   -2.238
##    5   1        0    4.769   -4.769   -2.615
##    5   2        3   12.453   -9.453   -3.357
##    5   3       13   16.692   -3.692   -1.164
##    5   4        5   11.658   -6.658   -2.432
##    5   5       56   32.060   23.940    6.092
##    5   6       16   15.368    0.632    0.206
##    6   1        1    2.410   -1.410   -1.002
##    6   2        2    6.293   -4.293   -1.976
##    6   3        1    8.436   -7.436   -3.037
##    6   4        1    5.892   -4.892   -2.315
##    6   5        8   16.202   -8.202   -2.705
##    6   6       34    7.766   26.234   11.071
## 
## >>> Inference
## 
## Chi-square statistic: 253.103 
## Degrees of freedom: 25 
## Hypothesis test of equal population proportions: p-value = 0.000

Can also aggregate other statistics simultaneously in addition to the frequency table, though, of course, only meaningful if the aggregated variable is numerical. Here, create a 3-way cross-tabulation table with responses to variable m06 in the column and responses to by variables m07 and m10 in the rows, plus the mean and standard deviation of each combination of m07 and m10 across levels of m06.

pivot(d, c(mean,sd,table), m06, c(m07, m10))
##    m07 m10 n_m06 na_m06 m06_mn m06_sd 0 1 2 3  4  5
## 1    0   0     1      0  0.000     NA 1 0 0 0  0  0
## 2    1   0     1      0  1.000     NA 0 1 0 0  0  0
## 3    2   0     0      0     NA     NA 0 0 0 0  0  0
## 4    3   0     1      0  2.000     NA 0 0 1 0  0  0
## 5    4   0     1      0  2.000     NA 0 0 1 0  0  0
## 6    5   0     0      0     NA     NA 0 0 0 0  0  0
## 7    0   1     0      0     NA     NA 0 0 0 0  0  0
## 8    1   1     4      0  0.750  0.500 1 3 0 0  0  0
## 9    2   1     2      0  1.500  0.707 0 1 1 0  0  0
## 10   3   1     1      0  2.000     NA 0 0 1 0  0  0
## 11   4   1     0      0     NA     NA 0 0 0 0  0  0
## 12   5   1     0      0     NA     NA 0 0 0 0  0  0
## 13   0   2     2      0  2.000  1.414 0 1 0 1  0  0
## 14   1   2     9      0  1.222  1.202 2 5 1 0  1  0
## 15   2   2    15      0  2.000  0.845 0 4 8 2  1  0
## 16   3   2     3      0  3.000  0.000 0 0 0 3  0  0
## 17   4   2     3      0  3.333  1.155 0 0 1 0  2  0
## 18   5   2     0      0     NA     NA 0 0 0 0  0  0
## 19   0   3     5      0  1.200  1.304 2 1 1 1  0  0
## 20   1   3    16      0  2.250  1.390 2 3 4 3  4  0
## 21   2   3    13      0  2.769  0.832 0 0 6 4  3  0
## 22   3   3     7      0  2.000  1.528 2 0 2 2  1  0
## 23   4   3     9      0  3.333  1.323 0 1 2 0  5  1
## 24   5   3     0      0     NA     NA 0 0 0 0  0  0
## 25   0   4     7      0  2.714  1.799 1 1 1 1  2  1
## 26   1   4    19      0  2.895  1.449 0 6 1 2  9  1
## 27   2   4    29      0  2.690  1.417 2 4 8 4  9  2
## 28   3   4    18      0  3.278  1.074 0 1 3 6  6  2
## 29   4   4    32      0  3.781  1.008 0 0 6 2 17  7
## 30   5   4     5      0  4.400  0.548 0 0 0 0  3  2
## 31   0   5     2      0  4.500  0.707 0 0 0 0  1  1
## 32   1   5    15      0  2.133  1.685 2 6 1 1  4  1
## 33   2   5    28      0  2.571  1.372 2 5 7 3 11  0
## 34   3   5    13      0  3.154  0.801 0 0 3 5  5  0
## 35   4   5    48      0  3.854  0.922 0 2 3 3 32  8
## 36   5   5    42      0  4.452  1.234 1 2 1 1  5 32

Can also express the frequencies as proportions. To convert the frequencies into proportions, invoke the table_prop parameter. The value of "all" computes cell frequencies across the entire table. The values of "row" and "col" compute the proportions with either row sums or column sums.

In this example of a two-way cross-tabulation table, convert the table counts to row proportions, that is, the proportion of each level of m06 for each combination of levels for m07. The sum of the proportions in each row is 1.0.

pivot(d, table, m06, m07, table_prop="row")
## 
## Proportions computed over row cells
##   m07 n_m06 na_m06     0     1     2     3     4     5
## 1   0    17      0 0.235 0.176 0.118 0.176 0.176 0.118
## 2   1    64      0 0.109 0.375 0.109 0.094 0.281 0.031
## 3   2    87      0 0.046 0.161 0.345 0.149 0.276 0.023
## 4   3    43      0 0.047 0.023 0.233 0.372 0.279 0.047
## 5   4    93      0 0.000 0.032 0.140 0.054 0.602 0.172
## 6   5    47      0 0.021 0.043 0.021 0.021 0.170 0.723

Return to the Employee data set for the remaining examples.

d <- Read("Employee", quiet=TRUE)

Quantiles

One way to understand the characteristics of a distribution of data values of a continuous variable is to sort the values and then split into equal-sized groups. The simplest example is the median, which splits a distribution into two groups, the bottom lowest values and the top highest values. Quartiles divide the distribution into four groups. The first quartile is the smallest 25% of the data values, etc.

Quantiles: Divide a distribution of sorted data values into n groups.

By default, calling the quantile function computes quartiles. Here calculate the quartiles for Years aggregated across levels of Dept and Gender.

pivot(d, quantile, Years, c(Dept, Gender))
##    Dept Gender n_ na_ Years_0 Years_25 Years_50 Years_75 Years_100
## 1  ACCT      F  3   0       2     2.50      3.0     6.00         9
## 2  ADMN      F  4   0       2     3.50      5.0     9.00        18
## 3  FINC      F  1   0       7     7.00      7.0     7.00         7
## 4  MKTG      F  5   0       1     4.00      8.0    10.00        18
## 5  SALE      F  5   0       2     3.00      8.0    10.00        10
## 6  ACCT      M  2   0       5     6.00      7.0     8.00         9
## 7  ADMN      M  2   0       7    11.25     15.5    19.75        24
## 8  FINC      M  3   0      10    10.00     10.0    12.00        14
## 9  MKTG      M  1   0      18    18.00     18.0    18.00        18
## 10 SALE      M  9   1       5     9.00     13.0    14.00        21

To compute other than quantiles, invoke the q_num parameter, the number of quantile intervals. The default value is 4 for quartiles. In the following example, compute the quintiles for Years and Salary, plus the mean and standard deviation. No specification of by, so these descriptive statistics are computed over the entire data set for both specified variables.

pivot(d, c(mean,sd,quantile), c(Years,Salary), q_num=5, digits_d=2)
##         n na     mean       sd      p_0     p_20     p_40     p_60     p_80    p_100
## Years  36  1     9.39     5.72     1.00     4.00     8.00    10.00    14.00     24.0
## Salary 37  0 73795.56 21799.53 46124.97 55737.86 63701.93 72430.04 92415.42 134419.2

Other Features

Drill-Down

One data analysis strategy examines the values of a variable, such as Sales for a business or Mortality for an epidemiology study, at progressively finer levels of detail. Examine by Country or State or City or whatever level of granularity is appropriate.

Data drill down: Examine the values of a variable when holding the values of one or more categorical variables constant.

If drilling down into the data, pivot() indicates the drill-down with a display of all categorical variables with unique values that precedes the primary output. Initiate the drill-down by a previous subset of the data frame, or by pivot() directly. As with other lessR analysis functions, the rows parameter specifies a logical condition for which to subset rows of the data frame for analysis.

In this example, compute the mean of Salary for each level of Dept for just those rows of data with the value of Gender equal to “F”.

pivot(d, mean, Salary, Dept, rows=(Gender=="F"))
## Gender: F
##   Dept n_Salary na_Salary Salary_mn
## 1 ACCT        3         0  63237.16
## 2 ADMN        4         0  81434.00
## 3 FINC        1         0  57139.90
## 4 MKTG        5         0  64496.02
## 5 SALE        5         0  64188.25

The parentheses for the rows parameter are not necessary, but does enhance readability.

Can also drill down by subsetting the data frame with a logical condition directly in the data parameter in the call to pivot(). Here drill down with base R Extract[ ] in conjunction with the lessR function .() to simplify the syntax (explained in the vignette Subset a Data Frame). The Extract[ ] function specifies the rows of the data frame to extract before the comma, and the columns to extract after the comma. Here select only those rows of data with Gender declared as Female. There is no information after the comma, so no columns are specified, which means to retain all columns, the variables in the data frame.

pivot(d[.(Gender=="F"),], mean, Salary, Dept)
## Gender: F
##   Dept n_Salary na_Salary Salary_mn
## 1 ACCT        3         0  63237.16
## 2 ADMN        4         0  81434.00
## 3 FINC        1         0  57139.90
## 4 MKTG        5         0  64496.02
## 5 SALE        5         0  64188.25

Sort Output

Specify the sort as part of the call to pivot() with the parameter sort. This internal sort works for a single value variable, by default the last column in the output data frame. Set to "-" for a descending sort. Set to "+" for an ascending sort.

pivot(d, mean, Salary, c(Dept, Gender), sort="-")
##    Dept Gender n_Salary na_Salary Salary_mn
## 9  MKTG      M        1         0  99062.66
## 10 SALE      M       10         0  86150.97
## 2  ADMN      F        4         0  81434.00
## 7  ADMN      M        2         0  80963.35
## 8  FINC      M        3         0  72967.60
## 4  MKTG      F        5         0  64496.02
## 5  SALE      F        5         0  64188.25
## 1  ACCT      F        3         0  63237.16
## 6  ACCT      M        2         0  59626.19
## 3  FINC      F        1         0  57139.90

Specify the sort_var parameter to specify the name or the column number of the variable to sort.

pivot(d, c(mean, median), Salary, c(Gender,Dept), out_names=c("MeanSalary", "MedianSalary"),
      sort="-", sort_var="MeanSalary")
##    Gender Dept n_Salary na_Salary MeanSalary MedianSalary
## 8       M MKTG        1         0   99062.66     99062.66
## 10      M SALE       10         0   86150.97     82442.74
## 3       F ADMN        4         0   81434.00     71058.60
## 4       M ADMN        2         0   80963.35     80963.35
## 6       M FINC        3         0   72967.60     66312.89
## 7       F MKTG        5         0   64496.02     61356.69
## 9       F SALE        5         0   64188.25     56508.32
## 1       F ACCT        3         0   63237.16     71084.02
## 2       M ACCT        2         0   59626.19     59626.19
## 5       F FINC        1         0   57139.90     57139.90

Because the output of pivot() with no by_cols variables is a standard R data frame, the external call to the lessR function Sort() is available for custom sorting by one or more variables. Sort in the specified direction with the direction parameter.

a <- pivot(d, mean, Salary, c(Dept, Gender))
Sort(a, by=Salary_mn, direction="-")
## 
## Sort Specification
##   5 -->  descending
##    Dept Gender n_Salary na_Salary Salary_mn
## 9  MKTG      M        1         0  99062.66
## 10 SALE      M       10         0  86150.97
## 2  ADMN      F        4         0  81434.00
## 7  ADMN      M        2         0  80963.35
## 8  FINC      M        3         0  72967.60
## 4  MKTG      F        5         0  64496.02
## 5  SALE      F        5         0  64188.25
## 1  ACCT      F        3         0  63237.16
## 6  ACCT      M        2         0  59626.19
## 3  FINC      F        1         0  57139.90

Specify multiple variables to sort with a vector of variable names, and a corresponding vector of "+" and "-" signs of the same length for the directions parameter.

Pipe Operator

The following illustrates as of R 4.1.0 the base R pipe operator |> with pivot(). The pipe operator by default inserts the object on the left-hand side of an expression into the first parameter value for the function on the right-hand side.

In this example, input the d data frame into the first parameter of pivot(), the data parameter. Then direct the output to the data frame a with the standard R assignment statement, though written pointing to the right hand side of the expression.

To avoid problems installing this version of lessR from source with a previous version of R, the code is commented out with a # sign in the first column.

#d |> pivot(mean, Salary, c(Dept, Gender)) -> a
#a

Output as a 2-D Table

Specify up to two by_cols categorical variables to create a two-dimensional table with the specified columns. Specifying one or two categorical variables as by_cols variables moves them from their default position in the rows to the columns, which changes the output structure from a long-form data frame to a cross-tabulation table with categorical variables in the rows and columns.

In this example, specify by by_cols variable, Gender.

pivot(d, mean, Salary, Dept, Gender)
Table: mean of Salary 

 Gender  F          M               
 Dept                        
-------  ---------  ---------
   ACCT   63237.16   59626.20
   ADMN   81434.00   80963.34
   FINC   57139.90   72967.60
   MKTG   64496.02   99062.66
   SALE   64188.25   86150.97

Here two by_cols variables, specified as a vector. There is much missing data for this three-way classification as there is not much data in each group, with many groups having no data.

pivot(d, mean, Salary, Dept, c(Gender, Plan))
Table: mean of Salary 

 Gender  F                                M                                      
 Plan    1          2          3          1          2           3             
 Dept                                                                     
-------  ---------  ---------  ---------  ---------  ----------  ---------
   ACCT         NA   63237.16         NA   69547.60          NA   49704.79
   ADMN         NA   67724.21   122563.4   53788.26   108138.43         NA
   FINC         NA   57139.90         NA   61937.62          NA   95027.55
   MKTG   56772.95   66426.79         NA         NA          NA   99062.66
   SALE   60941.54   66352.73         NA   89393.40    82442.74   80131.91

Missing Data

Several possibilities exist when missing data are present. There are different types of missing data in the aggregation: missing values of one or more by variables and missing values for one or more aggregated variables.

Missing Data Values for Aggregated Variables

The pivot table follows with the listed missing data for each group. The variable Years has one missing value, which occurs in the sales department.

The na_remove parameter specifies the value of the base R parameter na.rm for computations such as for the mean. See ?mean for its definition.

Here, include all the cells, even those with a missing aggregated value. The default value is TRUE, so the na_remove parameter need not be specified if this is the appropriate value.

pivot(d, mean, Years, Dept)
##   Dept n_Years na_Years Years_mn
## 1 ACCT       5        0    5.600
## 2 ADMN       6        0   10.167
## 3 FINC       4        0   10.250
## 4 MKTG       6        0    9.833
## 5 SALE      14        1   10.286

Set na_remove to FALSE to not remove any missing data in a cell with values to be aggregated. The resulting aggregated value will be missing if any of the constituent data values are missing.

pivot(d, mean, Years, Dept, na_remove=FALSE)
##   Dept n_Years na_Years Years_mn
## 1 ACCT       5        0    5.600
## 2 ADMN       6        0   10.167
## 3 FINC       4        0   10.250
## 4 MKTG       6        0    9.833
## 5 SALE      14        1       NA

Missing Data Values for by Variables

To account for missing values of the categorical by variables, first use base R Extract[] with lessR .() to filter some missing data by dropping the one male in the Sales department. Save the result into the dd data frame.

dd <- d[.(!(Gender=="M" & Dept=="SALE")), ]

Explicitly set na_by_show to TRUE, the default value. The group for male sales employees is shown with the values of the computed variables n_ and na_ set to 0 and the values of all other variables necessarily missing.

pivot(dd, c(mean,median), Years, c(Dept, Gender), na_by_show=TRUE)
##    Dept Gender n_Years na_Years Years_mn Years_md
## 1  ACCT      F       3        0    4.667      3.0
## 2  ADMN      F       4        0    7.500      5.0
## 3  FINC      F       1        0    7.000      7.0
## 4  MKTG      F       5        0    8.200      8.0
## 5  SALE      F       5        0    6.600      8.0
## 6  ACCT      M       2        0    7.000      7.0
## 7  ADMN      M       2        0   15.500     15.5
## 8  FINC      M       3        0   11.333     10.0
## 9  MKTG      M       1        0   18.000     18.0
## 10 SALE      M       0        0       NA       NA

Drop the groups from the output with missing data for a by variable. To do so, set na_by_show to FALSE. Now the group for the non-existent male sales employee does not show.

pivot(dd, c(mean,median), Years, c(Dept, Gender), na_by_show=FALSE)
##   Dept Gender n_Years na_Years Years_mn Years_md
## 1 ACCT      F       3        0    4.667      3.0
## 2 ADMN      F       4        0    7.500      5.0
## 3 FINC      F       1        0    7.000      7.0
## 4 MKTG      F       5        0    8.200      8.0
## 5 SALE      F       5        0    6.600      8.0
## 6 ACCT      M       2        0    7.000      7.0
## 7 ADMN      M       2        0   15.500     15.5
## 8 FINC      M       3        0   11.333     10.0
## 9 MKTG      M       1        0   18.000     18.0

lessR pivot() vs Base R aggregate()

The lessR pivot() function relies upon the base R function aggregate() for aggregation. By default, except for the table computation, pivot() generates a long-form data frame pivot table (Excel terminology), which can then be directly input into analysis and visualization functions as a standard data frame. The levels across all the by grouping variables are listed in the rows. If there are specified column grouping variables according to by_cols, pivot() relies upon base R reshape() to form a 2-d table for direct viewing instead of a data table to input into further analysis functions.

pivot() provides additional features than aggregate() provides.

  1. For each value over which to aggregate, the sample size and number of missing values for each group is provided.
  2. Multiple statistical functions can be selected for which to compute the aggregated value for each group.
  3. Extends beyond aggregation to compute statistics over the entire data set instead of groups of data.
  4. Missing data analysis by cell or by the value aggregated.
  5. Aggregation not necessary, so can compute the specified statistic(s) for each variable across the entire data set.
  6. The aggregated computations can be displayed as a 2-d table, not just a long-form data frame.
  7. by variables of type Date, character and integer retain the same variable type in the summary table instead of each converted to a factor (set factors=TRUE to get factors instead).
  8. The list of parameters lists the data parameter first, which facilitates the use of the pipe operator, such as from base R as of Version 4.1.0 or the magrittr package.
  9. Any non-numeric variables with unique values in the submitted data are listed with their corresponding data value, which identifies when drilling down into the data to study relevant rows.

Although the pivot() function does considerably extend the functionality of base R aggregate(), pivot() does rely upon the base R function for most of its computations.