Abstract

This vignette is an introduction to the package groupdata2.

groupdata2 is a set of methods for easy grouping, windowing, folding, partitioning, splitting and balancing of data.

We will go through dividing up a time series into windows.

For a more extensive description of groupdata2, please see Description of groupdata2

Contact author at r-pkgs@ludvigolsen.dk

When working with time series, groupdata2 allows us to quickly divide them into groups / windows.

We will use the dataset **austres** for this vignette. It contains numbers (in thousands) of Australian residents measured *quarterly* from March 1971 to March 1994.

Let’s load the data and take a look at the first values.

```
timeSeriesFrame = data.frame('residents' = austres)
# Show structure of data frame
str(timeSeriesFrame)
#> 'data.frame': 89 obs. of 1 variable:
#> $ residents: Time-Series from 1971 to 1993: 13067 13130 13198 13254 13304 ...
# Show head of data
timeSeriesFrame %>% head(12) %>% kable()
```

residents |
---|

13067.3 |

13130.5 |

13198.4 |

13254.2 |

13303.7 |

13353.9 |

13409.3 |

13459.2 |

13504.5 |

13552.6 |

13614.3 |

13669.5 |

A visualization of the data. We see that the number of residents increases quite linearly with time.

Let’s say, that instead of having four measures per year, we want 1 measure every 3 years.

We can do this by making groups of 12 elements each with the ‘greedy’ method and use the means of each group as our measurements.

When using the method ‘greedy’, we specify the desired group size. Every group, except the last, is guaranteed to have this size. The last group gets the elements that are left, i.e. it might be smaller or of the same size as the other groups.

```
ts = timeSeriesFrame %>%
# Group data
group(n = 12, method = 'greedy') %>%
# Find means of each group
dplyr::summarise(mean = mean(residents))
# Show new data
ts %>% kable()
```

.groups | mean |
---|---|

1 | 13376.45 |

2 | 13945.62 |

3 | 14418.36 |

4 | 15022.52 |

5 | 15663.29 |

6 | 16378.30 |

7 | 17151.38 |

8 | 17573.18 |

A visualization of the data.

This procedure has left us with fewer datapoints, which could be useful if we had a very large data frame to start with, or if we just wanted to describe the change in residents every 3rd year (or every year for that matter, by simply changing n to 4).

If we wanted to know which group had the largest increase in residents, we could find the range (difference between the max and min value) within each group instead of taking the mean.

```
ts = timeSeriesFrame %>%
# Group data
group(n = 12, method = 'greedy') %>%
# Find range of each group
dplyr::summarise(range = diff(range(residents)))
# Show new data
ts %>% kable()
```

.groups | range |
---|---|

1 | 602.2 |

2 | 433.0 |

3 | 454.2 |

4 | 650.8 |

5 | 568.0 |

6 | 758.9 |

7 | 614.2 |

8 | 178.9 |

For the fun of it, let’s say we want to make staircased groups inside the greedy groups, we just created.

When using the method ‘staircase’ we specify **step size**. Every group is 1 step larger than the previous group (e.g. with a step size of 5, group sizes would be 5,10,15,…).

By creating subgroups for every greedy group, the group size will ‘start over’ for each greedy group.

When using the staircase method, the last group might not have the size of the second last group + step size. We want to make sure that it does have such size, so we use the helper tool **%staircase%** to find a step size with a remainder of 0.

```
main_group_size = 12
# Loop through a list ranging from 1-30
for (step_size in c(1:30)){
# If the remainder is 0
if(main_group_size %staircase% step_size == 0){
# Print the step size
print(step_size)
}
}
#> [1] 2
#> [1] 4
#> [1] 12
```

So our step size could be 2, 4 or 12. We pick a step size of 2, because it will yield the most subgroups for the example.

Now we will first make the greedy groups like before, then we will create subgroups with the staircase method.

In order not to overwrite the ‘.groups’ column from the first use of **group()**, we will use the **col_name** argument in **group()**.

We will also need to use dplyr’s **do()** when using **group()** on every greedy group inside the pipeline.

```
ts <- timeSeriesFrame %>%
# Group data
group(n = 12, method='greedy') %>%
# Create subgroups
do(group(., n = 2, method='staircase', col_name = '.subgroups'))
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
# Show head of new data
ts %>% head(24) %>% kable()
```

residents | .groups | .subgroups |
---|---|---|

13067.3 | 1 | 1 |

13130.5 | 1 | 1 |

13198.4 | 1 | 2 |

13254.2 | 1 | 2 |

13303.7 | 1 | 2 |

13353.9 | 1 | 2 |

13409.3 | 1 | 3 |

13459.2 | 1 | 3 |

13504.5 | 1 | 3 |

13552.6 | 1 | 3 |

13614.3 | 1 | 3 |

13669.5 | 1 | 3 |

13722.6 | 2 | 1 |

13772.1 | 2 | 1 |

13832.0 | 2 | 2 |

13862.6 | 2 | 2 |

13893.0 | 2 | 2 |

13926.8 | 2 | 2 |

13968.9 | 2 | 3 |

14004.7 | 2 | 3 |

14033.1 | 2 | 3 |

14066.0 | 2 | 3 |

14110.1 | 2 | 3 |

14155.6 | 2 | 3 |

Notice the warning in the previous code.

Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character

On some time series the do() step converts the column ‘.subgroups’ from the type *factor* into the type *character* because of unequal factor levels. This is likely because the last greedy group contains less elements than the other groups, and so we are able to make fewer subgroups. Let’s check the tail of the new data frame.

residents | .groups | .subgroups |
---|---|---|

16833.1 | 7 | 1 |

16891.6 | 7 | 1 |

16956.8 | 7 | 2 |

17026.3 | 7 | 2 |

17085.4 | 7 | 2 |

17106.9 | 7 | 2 |

17169.4 | 7 | 3 |

17239.4 | 7 | 3 |

17292.0 | 7 | 3 |

17354.2 | 7 | 3 |

17414.2 | 7 | 3 |

17447.3 | 7 | 3 |

17482.6 | 8 | 1 |

17526.0 | 8 | 1 |

17568.7 | 8 | 2 |

17627.1 | 8 | 2 |

17661.5 | 8 | 2 |

Sure enough, the last greedy group (8) is smaller. This means that there are only 2 subgroups instead of 3. To solve this we first convert .subgroups to an integer and then to a factor.

We could also get the means of each subgroup. To do this, we first group by .groups and then .subgroups. Then we find the mean number of residents for each subgroup. If we had just grouped by .subgroups, we would have taken the mean of all the datapoints in each subgroup level. This would have left us with (in this case) 3 means, instead of 1 per subgroup level per main group level.

Now that we are at it, we might as well find the ranges for each subgroup as well.

```
ts_means <- ts %>%
# Convert .subgroups to an integer and then to a factor
mutate(.subgroups = as.integer(.subgroups),
.subgroups = as.factor(.subgroups)) %>%
# Group by first .groups, then .subgroups
group_by(.groups, .subgroups) %>%
# Find the mean and range of each subgroup
dplyr::summarise(mean = mean(residents),
range = diff(range(residents)))
# Show head of new data
ts_means %>% head(9) %>% kable()
```

.groups | .subgroups | mean | range |
---|---|---|---|

1 | 1 | 13098.90 | 63.2 |

1 | 2 | 13277.55 | 155.5 |

1 | 3 | 13534.90 | 260.2 |

2 | 1 | 13747.35 | 49.5 |

2 | 2 | 13878.60 | 94.8 |

2 | 3 | 14056.40 | 186.7 |

3 | 1 | 14211.95 | 39.5 |

3 | 2 | 14341.92 | 115.1 |

3 | 3 | 14538.12 | 215.6 |

The differences in range follows the differences in number of measurements per subgroup.

Here is a visualization of the means per subgroup:

Well done, you made it to the end of this introduction to groupdata2! If you want to know more about the various methods and arguments, you can read the Description of groupdata2.

If you have any questions or comments to this vignette (tutorial) or groupdata2, please send them to me at

r-pkgs@ludvigolsen.dk, or open an issue on the github page https://github.com/LudvigOlsen/groupdata2 so I can make improvements.