The **spcosa**-package implements algorithms for spatial coverage sampling and for random sampling from compact geographical strata based on the *k*-means algorithm (de Gruijter *et al.*, 2006, Walvoort *et al.*, 2010). Spatial coverage sampling is known to be an efficient sampling method for model-based mapping (kriging). Random sampling from compact geographical strata is recommended for design-based estimation of spatial means, proportions, etc.. In this vignette, the usage of the package will be demonstrated by means of several examples. In addition to spatial coverage sampling also stratified simple random sampling of compact strata will be treated in this vignette.

The **spocsa**-package can be attached by means of

`library(spcosa)`

`Loading required package: rJava`

In addition, we will also attach the **ggplot2** package for modifying plots:

`library(ggplot2)`

Although the implemented optimisation algorithms are deterministic in nature, they use a user-specified number (`nTry`

, see examples below) of random initial configurations to reduce the risk of ending up in an unfavourable local optimum. In order to be able to reproduce the sampling patterns in a later stage, the pseudo random number generator of **R** has to be initialised first:

`set.seed(314)`

The **spcosa**-package depends on the **sp**-package (Pebesma & Bivand, 2005) for storing spatial information, and the **ggplot2**-package (Wickham, 2009) for visualisation. A basic knowledge of the **sp**-package is highly recommended. Knowledge of the **ggplot2**-package is only needed for fine-tuning **spcosa** graphics. Consult the superb ggplot2-website for details and illustrative examples.

The basic idea is to distribute sampling points evenly over the study area by selecting these points in compact spatial strata. Compact strata can be obtained by *k*-means clustering of the cells making up a fine grid representing the study area of interest. Two *k*-means algorithms have been implemented in the **spcosa**-package: a transfer algorithm and a swopping algorithm (Walvoort *et al*, 2010). The transfer algorithm obtains compact clusters (strata) by transferring cells from one cluster to the other, whereas the swopping algorithm achieves this by swopping cells between clusters. The first algorithm results in compact clusters, whereas the second algorithm results in compact clusters of equal size. For reasons of efficiency, both algorithms have been implemented in the Java language and communicate with **R** (R Core Team, 2015) by means of the **rJava**-package (Urbanek, 2013).

In this section, the **spcosa**-package will be demonstrated by means of several examples.

In the first example, spatial coverage sampling will be applied to map clay content in a study area by means of ordinary point kriging.

First, a vector or raster representation of the study area has to be loaded. As an example, the ASCII-grid `demoGrid.asc`

residing in the `maps`

directory of the **spcosa**-package will be read by means of the `readGDAL`

-function of the **rgdal**-package (Bivand *et al.*, 2015)

```
library(rgdal)
<- system.file("maps", "demoGrid.asc", package = "spcosa")
filename <- readGDAL(fname = filename, silent = TRUE) ascii_grid
```

To obtain a uniform distribution of sampling points over the study area, the sampling points will be selected at the centroids of compact spatial strata. Compact strata can be constructed by invoking the `stratify`

method:

`<- stratify(ascii_grid, nStrata = 75, nTry = 10) stratification `

In this example, the study area has been partitioned into 75 compact strata. The resulting stratification can be plotted by means of:

`plot(stratification)`

Each plot can be modified by adding **ggplot2**-functions to the `plot`

-method:

```
plot(stratification) +
scale_x_continuous(name = "Easting (km)") +
scale_y_continuous(name = "Northing (km)")
```

```
Scale for 'x' is already present. Adding another scale for 'x', which will
replace the existing scale.
```

```
Scale for 'y' is already present. Adding another scale for 'y', which will
replace the existing scale.
```

The `spsample`

-method, an overloaded method from the **sp**-package, can be used to select the centroid of each stratum:

`<- spsample(stratification) sampling_pattern `

The `plot`

-method can be used to visualise the resulting sampling pattern:

`plot(sampling_pattern)`

The sampling pattern can also be plotted on top of the stratification:

`plot(stratification, sampling_pattern)`

Sampling locations can be extracted by means of a simple type cast to class `data.frame`

:

```
<- as(sampling_pattern, "data.frame")
sampling_points head(sampling_points)
```

```
x y
1 79.971831 5.253521
2 13.703704 37.802469
3 11.338710 10.435484
4 26.808824 26.323529
5 4.090909 24.038961
6 25.298246 17.298246
```

Next, field work needs to be done to acquire data at these locations.

`Loading required package: gstat`

These data will be stored in a `data.frame`

:

`head(my_data)`

```
clay SOM
1 11.2 5.8
2 10.6 4.8
3 10.3 5.3
4 12.8 6.2
5 10.9 6.0
6 9.1 4.9
```

Model-based inference will be applied to predict clay contents. Model-based inference is not implemented in the **spcosa**-package, because other add-on packages already provide excellent support for it. See for instance, **gstat** (Pebesma, 2004), **spatial** (Venables & Ripley, 2002), and **geoR** (Diggle & Ribeiro, 2007).

As an example, we will use the **gstat** package for predicting clay contents by means of ordinary point kriging. First, the observations and sampling pattern will be combined in a single object of class `SpatialPointsDataFrame`

:

```
#sp_data <- data.frame(sampling_points, my_data)
#coordinates(sp_data) <- ~ x * y
<- SpatialPointsDataFrame(coords = sampling_points, data = my_data) sp_data
```

Next, the sample variogram for clay has to be estimated, and needs to be fitted by a permissible model:

```
<- variogram(clay ~ 1, sp_data)
sample_variogram <- fit.variogram(sample_variogram,
variogram_model model = vgm(psill = 1, model = "Exp", range = 5, nugget = 0))
plot(sample_variogram, model = variogram_model)
```

Finally, ordinary point kriging predictions can be obtained by means of:

```
<- gstat(id = "clay", formula = clay ~ 1, data = sp_data, model = variogram_model)
g <- predict(g, newdata = ascii_grid) y_hat
```

`[using ordinary kriging]`

The resulting predictions and variances of the prediction error are given in the figures below:

Sometimes, samples from previous sampling campaigns are available. In these situations, spatial infill sampling may be performed. This type of spatial coverage sampling aims to distribute new sampling points evenly over the study area, while taking the locations of existing sampling points into account. Suppose a `data.frame`

is available containing the coordinates of 50 existing sampling points:

`head(prior_points)`

```
x y
1 98.2 38.1
2 47.3 38.2
3 23.5 40.0
4 19.9 25.4
5 3.2 32.4
6 28.4 1.3
```

Twenty-five *new* points can be assigned to sparsely sampled regions by means of:

```
coordinates(prior_points) <- ~ x * y
<- stratify(ascii_grid, priorPoints = prior_points, nStrata = 75, nTry = 100)
stratification <- spsample(stratification) sampling_pattern
```

Note that the total number of strata, and therefore the total number of points, equals 50+25=75. In addition, also note that the `nTry`

argument has been set to 100. The algorithm will now use 100 random starting configurations and keeps the best solution to reduce the risk of ending up in an unfavourable local optimum.

`plot(stratification, sampling_pattern)`

Note that prior points and new points are represented by different symbols.

A map of clay contents can be obtained by means of ordinary point kriging. Suppose again, that data have been collected in the field and stored in `data.frame`

`sp_data`

:

Again, we will use **gstat** for model-based inference:

```
<- variogram(clay~1, sp_data)
sample_variogram <- fit.variogram(sample_variogram,
variogram_model model = vgm(psill = 1, model = "Exp", range = 1, nugget = 0))
plot(sample_variogram, model = variogram_model)
```

```
<- gstat(id = "clay", formula = clay ~ 1, data = sp_data, model = variogram_model)
g <- predict(g, newdata = ascii_grid) y_hat
```

`[using ordinary kriging]`

The resulting predictions and variances of the prediction error are given in the figures below:

In this section, the global mean clay and organic matter contents of the study area will be estimated by means of stratified simple random sampling. Again, we will use `ascii_grid`

as a representation of the study area. The study area will be partitioned into 25 compact strata:

`<- stratify(ascii_grid, nStrata = 25, nTry = 10) stratification `

`plot(stratification)`

The `spsample`

-method can be used to randomly sample two sampling units per stratum:

`<- spsample(stratification, n = 2) sampling_pattern `

`plot(stratification, sampling_pattern)`

Sampling points can be extracted by means of a type cast to class `data.frame`

:

```
<- as(sampling_pattern, "data.frame")
sampling_points head(sampling_points)
```

```
x y
1 33.83659 12.15027
2 28.81836 18.51965
3 84.69064 36.65839
4 79.47751 50.19131
5 27.47371 18.48114
6 25.62352 17.01825
```

Next, some field work has to be done. Again, the results will be stored in `data.frame`

`my_data`

:

The spatial mean clay and soil organic matter contents can be estimated by (de Gruijter *et al.*, 2006):

`estimate("spatial mean", stratification, sampling_pattern, my_data)`

```
clay SOM
9.55672 4.81517
```

In estimating the spatial mean, differences in surface area of the strata are taken into account. Note, that the spatial mean is estimated for all columns in `my_data`

. The standard error can be estimated in a similar way:

`estimate("standard error", stratification, sampling_pattern, my_data)`

```
clay SOM
0.2509970 0.1152259
```

The spatial cumulative distribution function (SCDF) (see deGruijter *et al.*, 2006) can be estimated by means of

`<- estimate("scdf", stratification, sampling_pattern, my_data) scdf `

The SCDFs are returned as a list of matrices, *i.e.* one matrix for each property:

`lapply(X = scdf, FUN = head, n = 4)`

```
$clay
value cumFreq
[1,] 5.0 0.0000
[2,] 5.2 0.0205
[3,] 6.2 0.0412
[4,] 6.6 0.0612
$SOM
value cumFreq
[1,] 2.3 0.0000
[2,] 2.9 0.0205
[3,] 3.3 0.0612
[4,] 3.5 0.1025
```

`lapply(X = scdf, FUN = tail, n = 4)`

```
$clay
value cumFreq
[33,] 13.9 0.9204
[34,] 14.7 0.9415
[35,] 14.9 0.9590
[36,] 15.0 0.9796
$SOM
value cumFreq
[26,] 7.1 0.9204
[27,] 7.2 0.9379
[28,] 7.4 0.9590
[29,] 7.6 0.9794
```

The SCDFs for clay and SOM are visualised below.

In this example, the aim is to estimate the global mean clay and organic matter contents of a field. To reduce laboratory costs, the soil aliquots collected at the sampling locations will be bulked into composite samples. The study area is a field near the village of Farmsum, in the North-East of the Netherlands. An ESRI shape file of this field is available in the `maps`

directory of the **spcosa**-package. It can be loaded by means of `readOGR`

, a function in the **rgdal** package (Bivand *et al.*, 2015):

```
<- system.file("maps", package = "spcosa")
directory <- readOGR(dsn = directory, layer = "farmsum", verbose = FALSE) shp_farmsum
```

First, the field will be stratified into, say, 20 compact strata of equal size. Strata of equal size are desirable to simplify fieldwork, *i.e.* equal supports of soil can be collected at the sampling locations (Brus *et al.*, 1999, de Gruijter *et al.*, 2006).

```
<- stratify(shp_farmsum, nStrata = 20, equalArea = TRUE, nTry = 10)
stratification plot(stratification)
```

Next, two sampling units will be selected at random in each stratum. At least two sampling units per stratum are required to estimate the sampling variance of the estimated mean.

```
<- spsample(stratification, n = 2, type = "composite")
sampling_pattern plot(stratification, sampling_pattern)
```

Sampling points can be extracted means of:

```
<- as(sampling_pattern, "data.frame")
sampling_points head(sampling_points)
```

```
composite x1 x2
1 1 259338.4 587256.1
2 2 259317.3 587245.8
3 1 259353.0 587297.2
4 2 259334.0 587289.3
5 1 259340.4 587376.9
6 2 259340.6 587362.0
```

Note that an extra column has been added specifying the sampling units to be bulked into each composite. A composite sample is formed by bulking one aliquot (sampling unit) per stratum. Field work now results in a composite sample of size two. These data will be stored in a `data.frame`

called `my_data`

:

The spatial mean and its standard error can be estimated by means of:

`estimate("spatial mean", stratification, sampling_pattern, my_data)`

```
clay SOM
10.70 5.35
```

`estimate("standard error", stratification, sampling_pattern, my_data)`

```
clay SOM
0.0100 0.0025
```

If we do not want to bulk soil aliquots, the same stratification can be used to select a sample of 20 \(\times\) 2 sampling locations:

```
<- spsample(stratification, n = 2)
sampling_pattern <- as(sampling_pattern, "data.frame")
sampling_points head(sampling_points)
```

```
x1 x2
1 259333.2 587266.0
2 259341.3 587258.7
3 259347.4 587303.3
4 259360.9 587285.2
5 259330.6 587387.2
6 259341.5 587359.6
```

`plot(stratification, sampling_pattern)`

In the examples above, the maps didn’t have projection attributes. At field scale, projection attributes are not really necessary. However, at continental and global scales, for example, projection attributes can’t be ignored. The *spcosa*-package is capable of handling projection attributes of class `CRS`

. More information on projections is available at the `PROJ.4’ project website and in the **rgdal**-package documentation (Bivand *et al.*, 2015).

To illustrate the effect of stratification on smaller spatial scales, consider two (relatively coarse) grids covering the surface of the earth:

```
<- expand.grid(
grd longitude = seq(from = -176, to = 180, by = 4),
latitude = seq(from = -86, to = 86, by = 4)
)gridded(grd) <- ~ longitude * latitude
<- grd
grd_crs proj4string(grd_crs) <- CRS("EPSG:4326")
```

Note that `grd`

is identical to `grd_crs`

, except that `grd_crs`

has projection attributes. If no projection attributes are available, the algorithms in the **spcosa**-package use squared Euclidean distances in the *k*-means algorithms. However, if projection attributes have been set, the coordinates will be transformed to lat/long format (latitude/longitude) and squared great circle distances will be used instead.

Both grids will be partitioned into 50 compact geographical strata:

```
<- stratify(grd, nStrata = 50)
strata <- stratify(grd_crs, nStrata = 50) strata_crs
```

`plot(strata)`

`plot(strata_crs)`

Note that `grd`

seems to have more compact strata near the geographic poles than `grd_crs`

. However, the contrary is true. This becomes evident when both stratifications are projected on a sphere:

Note that the spheres will only show up when you have the `rgl`

-package properly installed. The left figure is based on `grd`

, and the right figure on `grd_crs`

. The strata of `grd_crs`

are clearly more compact than those of `grd`

. In addition, `grd`

suffers from pronounced edge effects near the poles and at 180 degrees longitude. The strata are discontinuous at this meridian, *i.e.*, two points on opposite sides of the meridian are treated as very distant when squared Euclidean distances are used. The strata of `grd_crs`

, on the other hand, have been optimised on a sphere by using squared great circle distances, and don’t suffer from edge effects, *i.e.*, the great circle distance between two nearby points on opposite sides of this meridian is very small.

Although this package is about sampling from compact strata, it can also be used for simple random sampling, by setting `nStrata = 1`

:

```
<- readOGR(
shp_mijdrecht dsn = system.file("maps", package = "spcosa"),
layer = "mijdrecht", verbose = FALSE)
<- stratify(shp_mijdrecht, nStrata = 1, nGridCells = 5000)
stratification <- spsample(stratification, n = 30) sampling_pattern
```

`plot(stratification, sampling_pattern)`

In case of spatial coverage sampling, sampling the centroid of each cluster may become problematic in case of non-convex areas. A centroid may be situated well outside the area of interest. If this happens, the sampling point will be relocated to the nearest grid cell that is part of the target universe. This pragmatic solution usually gives reasonable results. However, in some extreme situations the solution may be less desirable. As an example, consider the ‘doughnut’-shaped field below.

```
<- expand.grid(s1 = -25:25, s2 = -25:25)
doughnut <- with(doughnut, sqrt(s1^2 + s2^2))
d <- doughnut[(d < 25) & (d > 15), ]
doughnut coordinates(doughnut) <- ~ s1 * s2
gridded(doughnut) <- TRUE
```

```
<- stratify(doughnut, nStrata = 2, nTry = 100)
stratification <- spsample(stratification)
sampling_pattern plot(stratification, sampling_pattern)
```

Note that this problem does not arise in random sampling from compact geographical strata.

```
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Arch Linux
Matrix products: default
BLAS: /usr/lib/libblas.so.3.9.0
LAPACK: /usr/lib/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] gstat_2.0-6 rgdal_1.5-19 sp_1.4-4 ggplot2_3.3.3 spcosa_0.3-10
[6] rJava_0.9-13
loaded via a namespace (and not attached):
[1] zoo_1.8-8 tidyselect_1.1.0 xfun_0.20
[4] purrr_0.3.4 lattice_0.20-41 colorspace_2.0-0
[7] vctrs_0.3.6 generics_0.1.0 miniUI_0.1.1.1
[10] htmltools_0.5.0 spacetime_1.2-3 yaml_2.2.1
[13] rlang_0.4.10 manipulateWidget_0.10.1 later_1.1.0.1
[16] pillar_1.4.7 glue_1.4.2 withr_2.3.0
[19] lifecycle_0.2.0 stringr_1.4.0 munsell_0.5.0
[22] gtable_0.3.0 htmlwidgets_1.5.3 evaluate_0.14
[25] labeling_0.4.2 knitr_1.30 fastmap_1.0.1
[28] httpuv_1.5.4 crosstalk_1.1.0.1 xts_0.12.1
[31] Rcpp_1.0.5 xtable_1.8-4 promises_1.1.1
[34] scales_1.1.1 webshot_0.5.2 jsonlite_1.7.2
[37] FNN_1.1.3 farver_2.0.3 mime_0.9
[40] digest_0.6.27 stringi_1.5.3 dplyr_1.0.2
[43] shiny_1.5.0 grid_4.0.3 tools_4.0.3
[46] magrittr_2.0.1 rgl_0.103.5 tibble_3.0.4
[49] crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1
[52] rmarkdown_2.6 R6_2.5.0 intervals_0.15.2
[55] compiler_4.0.3
```

Bivand, R., T. Keitt and B. Rowlingson (2015). rgdal: Bindings for the Geospatial Data Abstraction Library. R package version 0.9-2. https://cran.r-project.org/package=rgdal

Brus, D.J., L. E. E. M. Spätjens, J.J. de Gruijter (1999).A sampling scheme for estimating the mean extractable phosphorus concentration of fields for environmental regulation. Geoderma 89: 129-148

de Gruijter, J., D. Brus, M. Bierkens & M. Knotters (2006). Sampling for Natural Resource Monitoring. Springer Berlin

Diggle, P.J. & Ribeiro Jr, P.J. Model Based Geostatistics Springer, New York, 2007

Pebesma, E.J., 2004. Multivariable geostatistics in S: the gstat package. Computers & Geosciences, 30: 683-691.

Pebesma, E.J., R.S. Bivand, 2005. Classes and methods for spatial data in R. R News 5 (2), https://cran.r-project.org/doc/Rnews/Rnews_2005-2.pdf

R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.r-project.org/.

Urbanek, S. (2013). rJava: Low-level R to Java interface. R package version 0.9-6. https://cran.r-project.org/package=rJava

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0

Walvoort, D. J. J., Brus, D. J., and de Gruijter, J. J. (2010) An R package for spatial coverage sampling and random sampling from compact geographical strata by k-means Computers & Geosciences 36: 1261-1267 Available online at https://dx.doi.org/10.1016/j.cageo.2010.04.005

Wickham, H (2009). ggplot2: elegant graphics for data analysis. Springer New York