Descriptive analysis of statistical associations with GDAtools

Nicolas Robette

2021-05-31

 

GDAtools package provides some functions dedicated to the description of statistical associations between variables. They are based on effect size measures (also called association measures).

All these measures are built from simple concepts (correlations, proportion of variance explained), they are bounded (between -1 and 1 or between 0 and 1) and are not sensitive to the number of observations.

The measures of global association are the following.

  • For the relationship between two categorical variables: the Cramér’s V which, unlike the chi-square, for example, is not sensitive to the number of observations or the number of categories of the variables. It varies between 0 (no association) and 1 (perfect association). Squared, it can be interpreted as the share of variation shared between two variables.

  • For the relationship between two numerical variables: Kendall’s (tau) or Spearman’s (rho) rank correlations, which detect monotonic relationships between variables, and not only linear ones as is the case with Pearson’s linear correlation. They vary between -1 and 1. An absolute value of 0 indicates no association, an absolute value of 1 a perfect association. The sign indicates the direction of the relationship.

  • For the relationship between a categorical variable and a numerical variable: the square of the correlation ratio (eta²). It expresses the proportion of the variance of the numerical variable “explained” by the categorical variable and varies between 0 and 1.

In addition to measures of global association, we also use measures of local association, i.e. at the level of the categories of the variables.

  • For the relationship between two categorical variables: the phi coefficient measures the attraction or repulsion in a cell of a contingency table. It varies between -1 and 1. An absolute value of 0 indicates an absence of association, an absolute value of 1 a perfect association. There is attraction if the sign is positive, repulsion if the sign is negative. Squared, phi is interpreted as the proportion of variance shared by the two binary variables associated with the categories studied. Unlike the test value, phi is not sensitive to the sample size.

  • For the relationship between a categorical variable and a numerical variable: the point biserial correlation measures the magnitude of the difference between the means of the numerical variable according to whether or not one belongs to the category studied. It varies between -1 and 1. An absolute value of 0 indicates no association, an absolute value of 1 a perfect association. The sign indicates the direction of the relationship. When squared, point biserial correlation can be interpreted as the proportion of variance of the numerical variable “explained” by the category of the categorical variable.

Note that if we code the categories of the categorical variables as binary variables with values of 0 or 1, the phi coefficient and the point biserial correlation are equivalent to Pearson’s correlation coefficient.

For more details on these effect size measurements, see: Rakotomalala R., « Comprendre la taille d’effet (effect size) »
 

In some functions of GDAtools, association measures can be completed by permutation tests, which are part of combinatorial inference and constitute a nonparametric alternative to the significance tests of frequentist inference. A permutation test is performed in several steps.

  1. A measure of association between the two variables under study is computed.

  2. The same measure of association is calculated from a “permuted” version of the data, i.e. by randomly “mixing” the values of one of the variables, in order to “break” the relationship between the variables.

  3. Repeat step 2 a large number of times. This gives an empirical distribution (as opposed to the use of a theoretical distribution by frequentist inference) of the measure of association under the H0 hypothesis of no relationship between the two variables.

  4. The result of step 1 is compared with the distribution obtained in 3. The p-value of the permutation test is the proportion of values of the H0 distribution that are more extreme than the measure of association observed in 1.

If all possible permutations are performed, the permutation test is called “exact”. In practice, the computation time required is often too important and only a part of the possible permutations is performed, resulting in an “approximate” test. In the following examples, the number of permutations is set to 100 to reduce the computation time, but it is advisable to increase this number to obtain more accurate and reliable results (for example nperm=1000).
 

To illustrate the statistical association analysis functions of GDAtools, we will use data on cinema. This is a sample of 1000 films released in France in the 2000s, for which we know the budget, the genre, the country of origin, the “art et essai” label, the selection in a festival (Cannes, Berlin or Venice), the average rating of intellectual critics (according to Allociné) and the number of admissions. Some of these variables are numerical, others are categorical.

library(GDAtools)
data(Movies)
str(Movies)
'data.frame':   1000 obs. of  7 variables:
 $ Budget   : num  3.10e+07 4.88e+06 3.50e+06 1.63e+08 2.17e+07 ...
 $ Genre    : Factor w/ 9 levels "Action","Animation",..: 1 5 7 1 7 5 1 7 5 7 ...
 $ Country  : Factor w/ 4 levels "Europe","France",..: 4 2 2 1 2 2 4 4 2 4 ...
 $ ArtHouse : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 1 1 1 1 ...
 $ Festival : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Critics  : num  3 1 3.75 3.75 3.6 2.75 1 1 1 3 ...
 $ BoxOffice: num  1013509 24241 39376 6996996 493416 ...

Relationship between two variables

The package offers several functions to study the statistical relationship between two variables, depending on the nature (categorical or numerical) of these variables.

Two categorical variables

The function assoc_twocat computes :

  • the contingency table (numbers)
  • the percentages, the row-percentages and the column-percentages
  • the theoretical numbers, i.e. in a situation of independence
  • the chi-square
  • the Cramér’s V and the p-value of the corresponding permutation test
  • the Pearson residuals
  • the phi coefficients and the p-values of the corresponding permutation tests
  • the global and local PEMs (Percentage of Maximum Deviation from Independence, see Cibois 1993)
  • a summary table of these results
assoc.twocat(Movies$Country, Movies$ArtHouse, nperm=100)
$freq
         No  Yes  Sum
Europe   39   33   72
France  212  393  605
Other     6   20   26
USA     257   40  297
Sum     514  486 1000

$prop
          No   Yes   Sum
Europe   3.9   3.3   7.2
France  21.2  39.3  60.5
Other    0.6   2.0   2.6
USA     25.7   4.0  29.7
Sum     51.4  48.6 100.0

$rprop
             No      Yes Sum
Europe 54.16667 45.83333 100
France 35.04132 64.95868 100
Other  23.07692 76.92308 100
USA    86.53199 13.46801 100
Sum    51.40000 48.60000 100

$cprop
               No        Yes   Sum
Europe   7.587549   6.790123   7.2
France  41.245136  80.864198  60.5
Other    1.167315   4.115226   2.6
USA     50.000000   8.230453  29.7
Sum    100.000000 100.000000 100.0

$expected
            No     Yes
Europe  37.008  34.992
France 310.970 294.030
Other   13.364  12.636
USA    152.658 144.342

$chi.squared
[1] 220.1263

$cramer.v
[1] 0.4691762

$permutation.pvalue
[1] 0

$pearson.residuals
               No        Yes
Europe  0.3274474 -0.3367479
France -5.6123445  5.7717531
Other  -2.0143992  2.0716146
USA     8.4449945 -8.6848595

$phi
                No         Yes
Europe  0.01541876 -0.01541876
France -0.40506773  0.40506773
Other  -0.09258656  0.09258656
USA     0.45688150 -0.45688150

$phi.perm.pval
                 No          Yes
Europe 2.970162e-01 2.970162e-01
France 9.125065e-38 0.000000e+00
Other  2.301299e-03 2.301299e-03
USA    0.000000e+00 1.563417e-52

$local.pem
          No   Yes
Europe   5.7  -5.7
France -51.6  51.6
Other  -55.1  55.1
USA     72.3 -72.3

$global.pem
[1] 59.3

$gather
    Var1 Var2 Freq  prop     rprop      cprop expected std.residuals         phi    perm.pval local.pem
1 Europe   No   39 0.039 0.5416667 0.07587549   37.008     0.3274474  0.01541876 2.970162e-01       5.7
2 France   No  212 0.212 0.3504132 0.41245136  310.970    -5.6123445 -0.40506773 9.125065e-38     -51.6
3  Other   No    6 0.006 0.2307692 0.01167315   13.364    -2.0143992 -0.09258656 2.301299e-03     -55.1
4    USA   No  257 0.257 0.8653199 0.50000000  152.658     8.4449945  0.45688150 0.000000e+00      72.3
5 Europe  Yes   33 0.033 0.4583333 0.06790123   34.992    -0.3367479 -0.01541876 2.970162e-01      -5.7
6 France  Yes  393 0.393 0.6495868 0.80864198  294.030     5.7717531  0.40506773 0.000000e+00      51.6
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]


The function ggassoc_crosstab presents the contingency table in graphical form, with rectangles whose area corresponds to the numbers and whose color gradient corresponds to the attractions/repulsions (phi coefficients). The Cramér’s V can be displayed in a corner of the graph. Here, the “art et essai” label is clearly over-represented among French films and under-represented among American films.

ggassoc_crosstab(Movies, ggplot2::aes(x=Country, y=ArtHouse), max.phi=0.8)