This vignette^{1} describes the workflow of the {ino} package
for the likelihood optimization of an hidden Markov model (HMM). For
more technical details about HMMs see, e.g., Zucchini et al. (2016).

The example data set considered throughout this vignette covers a time series of share returns from the stock ‘Deutsche Bank’. The data set is freely accessible via Yahoo Finance.

```
library("fHMM")
<- tempfile()
file ::download_data(symbol = "DBK.DE", from = "2000-01-01", file = file)
fHMM#> Download successful.
#> * symbol: DBK.DE
#> * from: 2000-01-03
#> * to: 2022-09-28
#> * path: C:\Users\loelschlaeger\AppData\Local\Temp\RtmpcpEkAy\filee6c56577e4f
```

```
library("dplyr")
<- read.csv(file) %>%
db_data as_tibble() %>%
summarize(date = as.Date(Date, format = "%Y-%m-%d"),
obs = c(NA, diff(log(Close), lag=1) * 100)) %>%
filter(!is.na(obs)) %>%
print()
#> # A tibble: 5,815 × 2
#> date obs
#> <date> <dbl>
#> 1 2000-01-04 -5.16
#> 2 2000-01-05 -1.30
#> 3 2000-01-06 4.97
#> 4 2000-01-07 2.99
#> 5 2000-01-10 -2.61
#> 6 2000-01-11 0.350
#> 7 2000-01-12 -3.05
#> 8 2000-01-13 3.59
#> 9 2000-01-14 0.656
#> 10 2000-01-17 2.62
#> # … with 5,805 more rows
```

```
library("ggplot2")
ggplot(db_data, aes(x = date, y = obs)) +
geom_point() +
geom_line() +
ylab("log-returns [%]")
```

As the share returns are continuous and can take both negative and
positive values, we consider an HMM with Gaussian state-dependent
distributions. The (log-)likelihood of a Gaussian-HMM is implemented in
the function `f_ll_hmm()`

.

We consider an 2-state (`N = 2`

) Gaussian-HMM with six
parameters (`npar = 6`

) to be estimated:

- two for the transition probability matrix (tpm)
- two for the means of the state-dependent distribution
- two for the standard deviations of the state-dependent distribution

The argument `neg = TRUE`

indicates that we minimize the
negative log-likelihood, and `opt`

selects the optimizer
`nlm()`

.

```
<- setup_ino(
hmm_ino f = f_ll_hmm,
npar = 6,
data = db_data,
N = 2,
neg = TRUE,
opt = set_optimizer_nlm()
)
```

We first use randomly chosen starting values. As the first two starting values belong to the tpm, we sample from a normal distribution with mean -1.5 and standard deviation of 0.5 — as we use the multinomial logit link to ensure that the probabilities are between 0 and 1, a mean of -1.5 correspond to probabilities of staying in state 1 and 2 of about 0.81. For the two means, we draw two random numbers from the standard normal distribution, as the time series above indicates that the most of the returns are fairly close to zero. The starting values for the standard deviations are drawn from a uniform distribution between 0.5 and 2 (note that we exponentiate the standard deviations in the likelihood as they are constrained to be positive, and hence we log transform the starting values).

```
<- function() c(log(stats::runif(2, 0.1, 0.9)),
sampler ::rnorm(2),
statslog(stats::runif(2, 0.5, 2)))
<- random_initialization(hmm_ino, runs = 50, sampler = sampler) hmm_ino
```

For selecting fixed starting values, we consider values that lie in the ranges considered above:

```
<- list(c(-2, -2, 0, 0, log(2), log(3)),
starting_values c(-1.5, -1.5, -2, 2, log(1), log(2)),
c(-1, -1, -3, 3, log(2), log(2)))
for(val in starting_values)
<- fixed_initialization(hmm_ino, at = val) hmm_ino
```

To illustrate the subset initialization strategy, we fit our HMM to
the first 50% of the observation. The starting values for these subsets
are chosen randomly from the same distributions as considered above. The
function `subset_initialization()`

then fits the HMM again to
the entire sample using the estimates obtained from the subsets as
initial values.

```
<- subset_initialization(
hmm_ino arg = "data", how = "first", prop = 0.5,
hmm_ino, initialization = random_initialization(runs = 50, sampler = sampler)
)
```

As the time series of share returns consists of 5815 observations, we can even try to use fewer observations to fit our model. In the next step, we consider only the first 5% of the observations.

```
<- subset_initialization(
hmm_ino arg = "data", how = "first", prop = 0.05,
hmm_ino, initialization = random_initialization(runs = 50, sampler = sampler)
)
```

Selecting the starting values for HMMs is a well-known issue, as poor
starting values may likely result in local maxima. Other R packages
designed to fit HMMs discuss this topic in more detail (see, e.g., https://cran.r-project.org/package=moveHMM). We thus
first evaluate the optimizations by comparing the likelihood values at
convergence, which can be displayed using
`overview_optima()`

:

```
overview_optima(hmm_ino)
#> optimum frequency
#> 1 12877.85 142
#> 2 13774.87 10
#> 3 13454.33 1
```

The frequency table indicates that 142 out of 153 runs converged to the same likelihood value (1.287785^{4}), which appears to be the global optimum (note these are the negative log-likelihood values).

Using `summary`

, we can investigate the computation time
and the resulting optimum for all runs:

```
summary(hmm_ino)
#> # A tibble: 153 × 4
#> .strategy .time .optimum .optimizer
#> <chr> <drtn> <dbl> <chr>
#> 1 random 11.772112 secs 12878. stats::nlm
#> 2 random 9.919681 secs 12878. stats::nlm
#> 3 random 10.021004 secs 12878. stats::nlm
#> 4 random 9.735524 secs 12878. stats::nlm
#> 5 random 19.883488 secs 12878. stats::nlm
#> 6 random 11.398910 secs 12878. stats::nlm
#> 7 random 25.223363 secs 12878. stats::nlm
#> 8 random 20.918431 secs 12878. stats::nlm
#> 9 random 34.802511 secs 12878. stats::nlm
#> 10 random 8.948022 secs 13775. stats::nlm
#> # … with 143 more rows
```

To obtain the starting values that lead to the global optimum, we first check which of the optimization runs actually reached the global optimum:

```
which(summary(hmm_ino)$.optimum < 12878)
#> [1] 1 2 3 4 5 6 7 8 9 11 12 13 14 16 17 19 20 21
#> [19] 22 23 24 25 26 27 28 29 30 31 32 34 35 36 37 38 39 40
#> [37] 41 42 43 44 45 46 47 48 49 51 52 53 54 55 56 57 59 60
#> [55] 61 62 63 64 65 66 67 68 70 71 72 73 74 75 76 77 78 79
#> [73] 80 81 82 83 84 85 86 87 88 89 91 92 94 95 96 97 98 100
#> [91] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
#> [109] 119 120 121 122 123 124 125 126 127 129 130 131 132 133 134 135 136 137
#> [127] 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153
```

We can access the corresponding initial values of, for example, the first of these runs via:

```
get_vars(hmm_ino, runs = which(summary(hmm_ino)$.optimum < 12878)[1])[[1]]$.init
#> [1] -1.2160764 -1.5387768 -0.7074952 0.3645820 0.5118483 0.5282509
```

We use the `plot()`

function to investigate the
computation time. Intuitively, optimization runs with fixed starting
values should be faster than with random starting values, since we have
carefully chosen the fixed starting values by investigating the data.
The boxplots confirm this intuition:

`plot(hmm_ino, by = ".strategy", nrow = 1)`

Using the output provided by `summary()`

and some data
manipulation functions provided by the package {dplyr}, we can also
compute the average time per strategy:

```
summary(hmm_ino) %>%
group_by(.strategy) %>%
summarise(avg_time = mean(.time))
#> # A tibble: 4 × 2
#> .strategy avg_time
#> <chr> <drtn>
#> 1 fixed 8.693801 secs
#> 2 random 14.082786 secs
#> 3 subset(first,0.05) 5.418156 secs
#> 4 subset(first,0.5) 8.008285 secs
```

This comparison can however be considered somehow ‘unfair’, as not all optimization runs lead to the global optimum — optimization runs that lead to local optima may then have less iterations and hence lower computation time. We can first check which runs lead to the apparent global optimum of 1.287785^{4}. For that, we will again use functions from {dplyr}.

```
summary(hmm_ino) %>%
mutate(global_optimum = ifelse(.optimum < 12878, 1, 0)) %>%
group_by(.strategy) %>%
summarise(proportion_global_optimum = mean(global_optimum))
#> # A tibble: 4 × 2
#> .strategy proportion_global_optimum
#> <chr> <dbl>
#> 1 fixed 1
#> 2 random 0.9
#> 3 subset(first,0.05) 0.98
#> 4 subset(first,0.5) 0.9
```

While for the fixed starting values all runs converge to the global optimum, the random initialization and subset initialization strategies do sometimes get stuck in local maxima.

Let’s again compare the average computation time, but only for those runs that lead to the global optimum:

```
summary(hmm_ino) %>%
filter(.optimum < 12878) %>%
group_by(.strategy) %>%
summarise(mean_time = mean(.time))
#> # A tibble: 4 × 2
#> .strategy mean_time
#> <chr> <drtn>
#> 1 fixed 8.693801 secs
#> 2 random 14.833138 secs
#> 3 subset(first,0.05) 5.488308 secs
#> 4 subset(first,0.5) 8.375048 secs
```

For those runs that converged to the global optimum, we can again compare the computation time via boxplots:

```
summary(hmm_ino) %>%
filter(.optimum < 12878) %>%
ggplot(aes(x = "", y = .time)) +
scale_y_continuous() +
geom_boxplot() +
facet_wrap(".strategy", labeller = "label_both", nrow = 1) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
+
) ylab("optimization time")
```

While computation time is lowest for the fixed initialization strategy, subset initialization appears to be a promising approach. In particular, when considering the first 5% of the data in a first step, the computation time is most likely to be lower compared to the random initialization approach.

The vignette was build using R 4.2.1 with the {ino} 0.2.0 package.↩︎