In this vignette we demonstrate how to define new methods, with varying degrees of completeness.

We will generate a dataset comprising two clusters with a different intercept and slope. Firstly, we configure the default id, time, and response variable, so that these do not need to be provided to the other function calls.

Next, we generate the dataset, with 40 trajectories for cluster A and 60 trajectories for cluster B. Cluster A involves trajectories with a downward slope, whereas cluster B has an upward slope. We define a fixed mean of 1, such that all the cluster trajectories are shifted, placing cluster A at intercept 2, and cluster B at intercept 1.

```
set.seed(1)
casedata <- generateLongData(sizes = c(40, 60),
data = data.frame(Time = 0:10),
fixed = Y ~ 1,
fixedCoefs = 1,
cluster = ~ Time,
clusterCoefs = cbind(c(2, -.1), c(0, .05)),
random = ~ Time,
randomScales = cbind(c(.2, .02), c(.2, .02)),
noiseScales = .05
)
```

We plot the data to get a sense of different trajectories that have been generated. Visually, there is little group separation.

Since we generated the data, we have the luxury of looking at reference cluster trajectories, as stored in the `Mu.cluster`

column. Note that `Mu.fixed`

must be added to obtain the correct values.

Rather than starting with clustering, in some case studies there may be prior knowledge on how to sensibly stratify the trajectories. Either by an observed independent variable (e.g., age or gender), or a certain aspect of the observed trajectory (e.g., the intercept, slope, variability).

Letâ€™s presume in this example that domain knowledge suggests that stratifying by the intercept may provide a sensible grouping of the trajectories. This approach is further supported by the density plot of the trajectory intercepts, which shows a bimodal distribution.

Based on the density plot, we will assign trajectories with an intercept above 1.6 to cluster A, and the remainder to cluster B.

The approach we specified requires the first observation to be available, and is sensitive to noise. A more robust alternative would be to fit a linear model per trajectory, and to use the estimated intercept to stratify the trajectories on.

We can specify this stratification model by defining a function which takes the data of separate trajectories as input. This function should return a single cluster assignment for the respective trajectory. By returning a factor, we can pre-specify the cluster names.

```
stratfun <- function(data) {
int <- coef(lm(Y ~ Time, data))[1]
factor(int > 1.7,
levels = c(FALSE, TRUE),
labels = c("Low", "High"))
}
m2 <- lcMethodStratify(response = "Y", stratify = stratfun, center = mean)
model2 <- latrend(m2, casedata)
clusterProportions(model2)
#> Low High
#> 0.6 0.4
```

In case the linear regression step is time-intensive, a more efficient approach is to save the pre-computed trajectory intercepts as a column in the original data. This column can then be referred to in the expression of the stratification model.

We can take the approach involving the estimation of a linear model per trajectory one step further. Instead of using a pre-defined threshold on the intercept, we use a cluster algorithm on both the intercept and slope to automatically find clusters.

We first define the representation step, which estimates the model coefficients per trajectory, and outputs a `matrix`

with the coefficients per trajectory per row.

```
repStep <- function(method, data, verbose) {
coefdata <- data[, lm(Y ~ Time, .SD) %>% coef() %>% as.list(), keyby = Traj]
coefmat <- subset(coefdata, select = -1) %>% as.matrix()
rownames(coefmat) <- coefdata$Traj
return(coefmat)
}
```

The cluster step takes the coefficient matrix as input. A cross-sectional cluster algorithm can then be applied to the matrix. In this example, we apply \(k\)-means. The cluster step should output a `lcModel`

object.

The `lcModelCustom`

function creates a `lcModel`

object for a given vector of cluster assignments.

```
clusStep <- function(method, data, repMat, envir, verbose) {
km <- kmeans(repMat, centers = 3)
lcModelCustom(response = method$response,
method = method,
data = data,
trajectoryAssignments = km$cluster,
clusterTrajectories = method$center,
model = km)
}
```

We are now ready to create the `lcMethodTwoStep`

method.

```
model.twostep <- latrend(m.twostep, data = casedata)
summary(model.twostep)
#> Longitudinal cluster model using two-step clustering
#> id: "Traj"
#> time: "Time"
#> center: function (x) { mean(x, na.rm = TRUE)}
#> standardize: `scale`
#> clusterStep: `clusStep`
#> representationStep:`repStep`
#> response: "Y"
#>
#> Cluster sizes (K=3):
#> A B C
#> 35 (35%) 25 (25%) 40 (40%)
#>
#> Number of obs: 1100, strata (Traj): 100
#>
#> Scaled residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> NA NA NA NaN NA NA 1
```

The two-step model defined above is hard-coded for a given formula and a fixed number of clusters. In an exploratory setting, it is convenient to define a parameterized method. Here, we change the two functions to take arguments through the `lcMethod`

object in the `method`

variable.

Note that we can introduce new arguments which are not originally part of `lcMethodTwoStep`

(e.g., `nClusters`

) to enable the specification of the number of clusters in our method.

```
repStep.gen <- function(method, data, verbose) {
coefdata <- data[, lm(method$formula, .SD) %>% coef() %>% as.list(), keyby = c(method$id)]
# exclude the id column
coefmat <- subset(coefdata, select = -1) %>% as.matrix()
rownames(coefmat) <- coefdata[[method$id]]
return(coefmat)
}
clusStep.gen <- function(method, data, repMat, envir, verbose) {
km <- kmeans(repMat, centers = method$nClusters)
lcModelCustom(response = "Y",
method = method,
data = data,
trajectoryAssignments = km$cluster,
clusterTrajectories = method$center,
model = km)
}
```

We create a new `lcMethodTwoStep`

instance with the more generic functions. Defining values for `formula`

and `nClusters`

here makes these arguments values act as default values in a call of `latrend`

.

```
m.twostepgen <- lcMethodTwoStep(response = "Y",
representationStep = repStep.gen,
clusterStep = clusStep.gen)
```

However, because we omitted the specification of `formula`

and `nClusters`

, these need to be provided in the `latrend`

call.

```
model.twostepgen <- latrend(m.twostepgen, formula = Y ~ Time, nClusters = 2, casedata)
summary(model.twostepgen)
#> Longitudinal cluster model using two-step clustering
#> nClusters: 2
#> formula: Y ~ Time
#> id: "Traj"
#> time: "Time"
#> center: function (x) { mean(x, na.rm = TRUE)}
#> standardize: `scale`
#> clusterStep: `clusStep.gen`
#> representationStep:`repStep.gen`
#> response: "Y"
#>
#> Cluster sizes (K=2):
#> A B
#> 40 (40%) 60 (60%)
#>
#> Number of obs: 1100, strata (Traj): 100
#>
#> Scaled residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> NA NA NA NaN NA NA 1
```

The use of `lcMethodStratify`

and `lcMethodTwostep`

enables rapid prototyping. Once the desired model has been identified, it may be worthwhile to implement it as a standalone method in the framework. This way the model can be extended to output more representative cluster trajectories, or to extend the model with predictive capabilities such that it can be validated on external data.

To illustrate this, we will implement a basic group-based trajectory model (GBTM), also known as latent-class growth analysis. We make use of the implementation available in the `lcmm`

package through the `lcmm()`

function, which allows for a relatively concise illustration.

`lcMethod`

classFirstly, we create a new method class named `lcMethodSimpleGBTM`

, which extends the `lcMethod`

class.

Note that a `lcMethod`

class has a single slot `call`

, which stores all the arguments to the method. The other relevant aspects of a `lcMethod`

implementation are the `prepareData()`

and `fit()`

functions, which are called when passing a `lcMethod`

object to the `latrend()`

function or any other model estimation function.

```
lcMethodSimpleGBTM <- function(formula = Value ~ Time,
time = getOption("latrend.time"),
id = getOption("latrend.id"),
nClusters = 2,
nwg = FALSE) {
lcMethod.call("lcMethodSimpleGBTM", call = stackoverflow::match.call.defaults())
}
```

Next, we override the `getName()`

and `getShortName()`

functions in our method class to ensure that the model is easily distinguishable from other methods in the summary output. While we return a simple constant character sequence, the naming functions could be improved by generating a more detailed description of the model based on the arguments of the method object.

```
setMethod("getName",
signature("lcMethodSimpleGBTM"), function(object, ...) "simple group-based trajectory model")
setMethod("getShortName", signature("lcMethodSimpleGBTM"), function(object, ...) "sgbtm")
```

Implementing the `prepareData()`

function enables the `lcMethod`

to do data processing or transforming the data or other arguments in a structure which is suitable for the model fitting procedure. The prepare function takes a `lcMethod`

, the data as a `data.frame`

, and the verbosity level as inputs. The processing is passed onto the `fit`

function by returning an `environment`

with the relevant variables.

In this example, we need to ensure that the `Id`

column is an integer.

```
setMethod("prepareData", signature("lcMethodSimpleGBTM"), function(method, data, verbose, ...) {
envir <- new.env()
envir$data <- as.data.frame(data)
envir$data[[method$id]] <- factor(data[[method$id]]) %>% as.integer()
return(envir)
})
```

The `fit()`

function estimates the model and should return an object that extends the `lcModel`

class. In our implementation, we call the `lcmm()`

with the appropriate arguments, and we construct a `lcModelSimpleGBTM`

instance. The class is defined in the subsection below. The `envir`

argument contains the return value of the `prepare()`

function, which would be `NULL`

here.

```
setMethod("fit", signature("lcMethodSimpleGBTM"), function(method, data, envir, verbose, ...) {
args <- as.list(method, args = lcmm::lcmm)
args$data <- envir$data
args$fixed <- method$formula
if (method$nClusters > 1) {
args$mixture <- update(method$formula, NULL ~ .)
} else {
args$mixture <- NULL
}
args$subject <- method$id
args$ng <- method$nClusters
args$returndata <- TRUE
model <- do.call(lcmm::lcmm, args)
new("lcModelSimpleGBTM",
method = method,
data = data,
model = model,
clusterNames = LETTERS[seq_len(method$nClusters)])
})
```

`lcModel`

classWe start off by defining the `lcModelSimpleGBTM`

as an extension of the `lcModel`

class. Extending the base class also enables the addition of new fields to the class, although there is no need for this in the present model.

Our model inherits a couple of slots from the `lcModel`

class, namely:

```
slotNames("lcModelSimpleGBTM")
#> [1] "model" "method" "call" "data"
#> [5] "id" "time" "response" "label"
#> [9] "ids" "clusterNames" "date" "estimationTime"
#> [13] "tag"
```

The `"model"`

slot is free to be used to assign an arbitrary data structure that represents the internal model representation(s). In our implementation, this slot contains the `lcmm`

model. The other slots are assigned in the constructor of the `lcModel`

class, or by the `latrend()`

estimation function. It is best practice to use the relevant getter functions for obtaining these values (via e.g., `idVariable()`

, `getLcMethod()`

, `clusterNames()`

).

Typically, most effort of implementing the model interface goes to the `predict()`

function. While implementing the function is optional, it is used for obtaining the fitted values, fitted trajectories, residuals, and cluster trajectories. Without this function, there is little functionality from a model except for the partitioning of the fitted trajectories.

The default `fitted.lcModel()`

function returns the output from `predict(model, newdata=NULL)`

. As such, it is advisable to implement this special case. Here, we move the logic to the custom `fitted.lcModelSimpleGBTM`

function, as defined below.

```
predict.lcModelSimpleGBTM <- function(object, newdata = NULL, what = "mu", ...) {
if(is.null(newdata)) {
predMat <- fitted(object, clusters = NULL)
} else {
predMat <- lcmm::predictY(object@model, newdata = newdata)$pred %>%
set_colnames(clusterNames(object))
}
transformPredict(pred = predMat, model = object, newdata = newdata)
}
```

The `fitted()`

function should return the fitted values per trajectory per (requested) cluster. If not clusters are provided, the function is expected to return a matrix with the fitted values for each cluster. This logic is handled in the `transformFitted()`

function, which is available in the package.

```
fitted.lcModelSimpleGBTM <- function(object, clusters = trajectoryAssignments(object)) {
predNames <- paste0("pred_m", 1:nClusters(object))
predMat <- as.matrix(object@model$pred[predNames])
colnames(predMat) <- clusterNames(object)
transformFitted(pred = predMat, model = object, clusters = clusters)
}
```

In order for the model implementation to return cluster assignments through the `trajectoryAssignments()`

function, we need to implement the `postprob()`

function to return the posterior probability matrix for the fitted data.

```
setMethod("postprob", signature("lcModelSimpleGBTM"), function(object) {
as.matrix(object@model$pprob)[, c(-1, -2), drop = FALSE]
})
```

Lastly, we override the `converged()`

function to return the convergence code of the internal model.

```
m <- lcMethodSimpleGBTM(formula = Y ~ Time)
show(m)
#> lcMethodSimpleGBTM as "simple group-based trajectory model"
#> formula: Y ~ Time
#> time: getOption("latrend.time")
#> id: getOption("latrend.id")
#> nClusters: 2
#> nwg: FALSE
```

```
sgbtm <- latrend(m, casedata)
#> Be patient, lcmm is running ...
#> The program took 0.17 seconds
summary(sgbtm)
#> Longitudinal cluster model using simple group-based trajectory model
#> nwg: FALSE
#> nClusters: 2
#> id: "Traj"
#> time: "Time"
#> formula: Y ~ Time
#>
#> Cluster sizes (K=2):
#> A B
#> 40 (40%) 60 (60%)
#>
#> Number of obs: 1100, strata (Traj): 100
#>
#> Scaled residuals:
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -2.1126 -0.9295 0.4772 0.0000 0.8352 1.3507
```