# Data Preparation and Visualization

Archaeological data as it can be found “in the wild” rarely conforms to the formats suitable for any kind of statistical analysis. This vignette is intended as a way of demonstrating a possibility of data cleaning on a data set as it can frequently be found in archaeological literature, or put together during archaeological research. The data is prepared for use with the datplot-package. An analysis based on this data can be found as a case study in the paper “datplot: A new R-Package for the Visualization of Date Ranges in Archaeology” (Weissova and Steinmann, n.d.) by the authors of this package.

The inscriptions of Bithynia data set, that is included in this package, was compiled for a publication analyzing the settlement patterns and road systems in ancient Bithynia (Weissova 2019). The vignette is thus meant as guidance for archaeologists looking to employ the datplot-package, and finding themselves in need of reformatting their data without much experience in programming languages. It points to certain problems arising with data sets, which are often incarnated in the form of spreadsheets. The Bithynia data set is a typical example of the structure of data as it is used by many (classical) archaeologists. The process of cleaning it highlights solutions for the issues encountered with such spreadsheets, and may also be adapted to other data as well as fields.

First, we attach the packages to be used in the vignette:

library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.4
library(stringr)
library(forcats)
#> Warning: package 'forcats' was built under R version 4.0.4
library(ggplot2)
library(knitr)
library(datplot)
library(ggridges)
library(reshape2)

# The “Inscriptions of Bithynia” data set

The manually curated data that B. Weissova prepared for her dissertation was saved as an Excel-spreadsheet. As this is very often the case, we continue directly from this original file, which is available in this repositories “inst/extdata/”-sub-directory. It can be loaded into R without further complications using the read_excel()-function from the package readxl. Alternatively, the file is suited for storage as a *.tsv-file, which is also provided in the same directory for better long-term support.

inscriptions <- readxl::read_excel(system.file("extdata",
"Bithynia_Inscriptions.xlsx",
package = "datplot",
mustWork = TRUE))
summary(inscriptions)
#>      ikey             Location            Source          Chronological Frame
#>  Length:2878        Length:2878        Length:2878        Length:2878
#>  Class :character   Class :character   Class :character   Class :character
#>  Mode  :character   Mode  :character   Mode  :character   Mode  :character
#>    Language
#>  Length:2878
#>  Class :character
#>  Mode  :character

A large amount of the initial data was provided by the Searchable Greek Inscriptions Tool of the Packard Humanities Institute and the Epigraphische Datenbank Heidelberg. Those inscriptions (n = 2877) can be referenced in the databases with their ikey. The data set was supplemented with 323 inscriptions that were manually gathered from different sources. The relevant citations are to be found in the Source column. Additional information on the creation and curation of this data set can be found in the publication mentioned above (Weissova 2019).

The original file consists of five columns which each row representing a single inscription: ikey contains the reference to the Searchable Greek Inscriptions Tool of the Packard Humanities Institute, indicated via PH as a prefix, or to the Epigraphische Datenbank Heidelberg, indicated via HD as a prefix. Location refers to the find spot of the inscription. Source states the source of the data. Chronological Frame contains the dating in a verbose format, such as “Roman Imperial Period.” Language records the language in which the inscription was written, which can either be Latin, Greek, or both.

# Data Preparation, Cleaning and Reformatting

The data set is not yet suited for analysis, as some variables, especially the chronological frame, have many inconsistencies. For further processing, we should also be sure to include an identifier-column. As 331 inscriptions do not have an ikey-Value, which might have otherwise been a good candidate for identification, we chose to create a new automatically generated ID, so that every inscription can be individually identifiable.

inscriptions$ID <- paste("I_", 1:nrow(inscriptions), sep = "") Two of the variables of this data set are almost ready for further use, i.e. Location and Language. A look at their unique values reveals only small inconsistencies that can be easily fixed: unique(inscriptions$Location)
#>  [1] "Apamea"                    "Apollonia ad Rhyndacum"
#>  [3] "Caesarea Germanice"        "Chalcedon"
#>  [5] "Claudiopolis"              "Creteia Flaviopolis"
#> [11] "Nicaea"                    "Nicomedia"
#> [13] "Nicomedia (Dakibyza)"      "Prusias ad Hypium"
#> [17] "Prusias ad Olympum"        "Pylai"
#> [19] "Strobilos"                 "Tios"
#> [21] "unknown"                   "unknown (Bolu museum)"
#> [23] "unknown (Lamounia ?)"
unique(inscriptions$Language) #> [1] "Greek" "Gr/Lat" "Latin" "Gr / Lat" Using functions from the tidyverse package family, we can easily transform the columns. We rename the “Chronological Frame” to Dating, as shorter names without spaces are more convenient to work with, and add proper NA-values if there is no chronological assessment. With mutate() and replace() we also clear out the redundant variable values from Location and Language: inscriptions <- inscriptions %>% rename(Dating = Chronological Frame) %>% mutate(Dating = na_if(Dating, "---"), Language = replace(Language, Language == "Gr/Lat", "Greek/Latin"), Language = replace(Language, Language == "Gr / Lat", "Greek/Latin"), Language = factor(Language, levels = c("Greek", "Latin", "Greek/Latin")), Location = replace(Location, str_detect(Location, "unknown"), "unknown"), Location = replace(Location, Location == "Prusias ad Mare (Keramed)", "Prusias ad Mare"), Location = factor(Location)) This conversion leaves us with a more compact overview of the data sets contents: summary(inscriptions) #> ikey Location Source #> Length:2878 Nicaea :761 Length:2878 #> Class :character Nicomedia :501 Class :character #> Mode :character Prusias ad Olympum:344 Mode :character #> Claudiopolis :252 #> Chalcedon :200 #> Prusias ad Hypium :179 #> (Other) :641 #> Dating Language ID #> Length:2878 Greek :2724 Length:2878 #> Class :character Latin : 125 Class :character #> Mode :character Greek/Latin: 29 Mode :character #> #> #> #>  ## Cleaning up the Dating-variable Some of the values in the Dating-variable contain question signs, indicating an uncertainty of the chronological frames assessment. To keep this information, we store it in a new variable uncertain_dating, which contains TRUE if there was a question mark in the original assessment and FALSE if the dating was certain, so that if one wishes the additional information could later be used to select or exclude uncertain values. inscriptions$uncertain_dating <- FALSE
sel <- grep("\\?", inscriptions$Dating) inscriptions$uncertain_dating[sel] <- TRUE
inscriptions$Dating <- gsub("\\?", "", inscriptions$Dating)

### Creating a Concordance for Periods

The next step is sorting out values from the Dating variable that have to be manually entered, such as the “Roman Imperial Period.” We achieve this by excluding all values that contain a number, preparing a table in which we can manually enter the desired dating span and saving it as a .csv-file. After adding the values manually, we reload the file. The manually edited concordance can be found in the “inst/extdata/”-sub-directory of this repository in the file periods_edit.csv and corresponds with the chronological assessment of periods used in the original publication (Weissova 2019, 42).

sel <- grepl("[0-9]", inscriptions$Dating) periods <- data.frame("Dating" = unique(inscriptions$Dating[which(sel == FALSE)]))
periods$DAT_min <- NA periods$DAT_max <- NA
#write.csv(periods, file = "../data-raw/periods.csv", fileEncoding = "UTF-8")
# .... Manual editing of the resulting table, saving it as "periods_edit.csv".
join_dating <- read.csv(file = system.file('extdata', 'periods_edit.csv',
package = 'datplot',
mustWork = TRUE),
row.names = 1,
colClasses = c("character", "character",
"integer", "integer"),
encoding = "UTF-8")

It is of course possible to add the corresponding values in R. Since the process could hardly be automated, this way seemed more efficient to us, though it is - sadly - less transparent in this vignette. The values can be examined by the reader in the csv-Table mentioned above, or by loading the table via system.file() as seen in the code chunk.

### Reformatting of Partially Numerical Dating Values

There remains, however, a large amount of values that is not covered in this concordance. We can easily automate some of the conversions as a series of steps that we store in another data.frame called num_dating, encompassing all the unique values that contain some form of a numerical dating.

num_dating <- data.frame("Dating" = unique(inscriptions$Dating[which(sel == TRUE)])) num_dating$DAT_min <- NA
num_dating$DAT_max <- NA First, there is a number of inscriptions dated to a single year. We select these using a regular expression with grep()1 and can simply delete the character-part of the values so that only the specified year remains and is stored in both of the DAT_-variables. We do this separately for AD and BC-values since BC needs to be stored as a negative number. sel <- grep("^[0-9]{1,3} AD$", num_dating$Dating) num_dating$DAT_min[sel] <- gsub(" AD", "", num_dating$Dating[sel]) num_dating$DAT_max[sel] <- gsub(" AD", "", num_dating$Dating[sel]) sel <- grep("^[0-9]{1,3} BC$", num_dating$Dating) num_dating$DAT_min[sel] <- paste("-", gsub(" BC", "", num_dating$Dating[sel]), sep = "") num_dating$DAT_max[sel] <- paste("-", gsub(" BC", "", num_dating$Dating[sel]), sep = "") As a demonstration, this is the resulting table (num_dating) of up to this point converted values: Dating DAT_min DAT_max 136 196 AD 196 196 215 253 AD 253 253 258 114 AD 114 114 131 223 AD 223 223 98 400 BC -400 -400 257 142 AD 142 142 214 260 AD 260 260 265 241 BC -241 -241 12 585 AD 585 585 55 206 AD 206 206 Since we frequently check the values in num_dating to look for errors, we append the finished rows to join_dating, which we later use as a look-up-table for our data set, and remove the finished rows from num_dating. join_dating <- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ])
num_dating <- num_dating[which(is.na(num_dating$DAT_min)), ] We have to convert the Dating variable from factor to character now, so we can use the strsplit()-function on the values: num_dating$Dating <- as.character(num_dating$Dating) As some of the values are in the format year-year, e.g. “150-160 AD,” we can easily grab both numbers from the string. To achieve this, we select all the rows containing the relevant format, then loop over each of these rows and split the character string along either “-” or spaces, and later “/,” as we chose to treat values of the format “198/198 AD” as “197 - 198 AD.” Selecting the numerical values from the resulting list according to their position gives us the desired values for DAT_min and DAT_max. We need to do the same for the BC-values to make sure they return as negative numbers. # Values like: 92-120 AD sel <- grep("^[0-9]{1,3}–[0-9]{1,3} AD", num_dating$Dating)
for (r in sel) {
split <- strsplit(x = num_dating$Dating[r], split = "–| ") num_dating$DAT_min[r] <- split[[1]][1]
num_dating$DAT_max[r] <- split[[1]][2] } # Values like: AD 92-120 sel <- grep("^AD [0-9]{1,3}–[0-9]{1,3}$", num_dating$Dating) for (r in sel) { split <- strsplit(x = num_dating$Dating[r], split = "–| ")
num_dating$DAT_min[r] <- split[[1]][2] num_dating$DAT_max[r] <- split[[1]][3]
}
# Values like: AD 92 - 120
sel <- grep("^AD [0-9]{1,3} - [0-9]{1,3}", num_dating$Dating) for (r in sel) { split <- strsplit(x = num_dating$Dating[r], split = " - | ")
num_dating$DAT_min[r] <- split[[1]][2] num_dating$DAT_max[r] <- split[[1]][3]
}
sel <- grep("^[0-9]{1,3}/[0-9]{1,3} AD", num_dating$Dating) for (r in sel) { split <- strsplit(x = num_dating$Dating[r], split = "/| ")
num_dating$DAT_min[r] <- split[[1]][1] num_dating$DAT_max[r] <- split[[1]][2]
}
# Values like: 525-75 BC
sel <- grep("^[0-9]{1,3}–[0-9]{1,3} BC", num_dating$Dating) for (r in sel) { split <- strsplit(x = num_dating$Dating[r], split = "–| ")
num_dating$DAT_min[r] <- 0 - as.numeric(split[[1]][1]) num_dating$DAT_max[r] <- 0 - as.numeric(split[[1]][2])
}

Another look at the data set can help us to check for possible errors.

Dating DAT_min DAT_max
276 AD 84 - 96 84 96
153 AD 150 - 200 150 200

Putting aside the finished values again makes it easier to spot errors in the process:

join_dating <- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ]) num_dating <- num_dating[which(is.na(num_dating$DAT_min)), ]

### Reformatting of Inscriptions Dated to Complete Centuries

Next, we separate values that identify complete centuries. As we want to express the dating in absolute numbers, we convert “1st c. AD” to a time span ranging from 0 to 99, and “1st c. BC” to -99 to 0 respectively. The regular expression selects all values, where a single number in the beginning of the string is followed by two letters (i.e. 2nd, 3rd, 1st) and “c. AD” resp. “c. BC.” We subtract one and multiply the number by 100 to get the respective boundaries, again taking care to treat AD and BC differently.

sel <- grep("^[0-9]{1}[a-z]{2} c\\. AD$", num_dating$Dating)
for (r in sel) {
split <- strsplit(x = num_dating$Dating[r], split = "[a-z]{2} c\\.") split <- as.numeric(split[[1]][1]) num_dating$DAT_min[r] <- ((split - 1) * 100)
num_dating$DAT_max[r] <- ((split - 1) * 100) + 99 } sel <- grep("^[0-9]{1}[a-z]{2} c\\. BC$", num_dating$Dating) for (r in sel) { split <- strsplit(x = num_dating$Dating[r], split = "[a-z]{2} c\\.")
split <- as.numeric(split[[1]][1])
num_dating$DAT_min[r] <- 0-(split * 100) + 1 num_dating$DAT_max[r] <- 0-((split - 1) * 100)
}
Dating DAT_min DAT_max
42 6th c. AD 500 599
15 3rd c. BC -299 -200
18 2nd c. AD 100 199
11 5th c. AD 400 499
16 3rd c. AD 200 299
47 4th c. BC -399 -300
41 6th c. BC -599 -500
25 1st c. BC -99 0
116 4th c. AD 300 399
17 2nd c. BC -199 -100

Again, putting aside the finished values makes it easier to spot errors in further processing:

join_dating <- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ]) num_dating <- num_dating[which(is.na(num_dating$DAT_min)), ]

### Reformatting of Imprecise Dating Values

For dates that are around a certain value, i.e. of the format “ca. 190 AD,” we are not able to make a more informed decision than guess about what the researchers providing this assessment had in mind. This might also change from inscription to inscription. While a closer look at the individual inscriptions may yield a more sensible estimate, this seems not feasible as part of the data preparation process. It seems that in most cases, if the dating can be as precise as a span of around 10 years, researchers tend to emphasize this. Therefore, we decided to take a total span of 20 years, i.e. 10 years before and 10 years after the mentioned value, reflecting some uncertainty on the precision and duration of the estimate. “ca. 190 AD” thus becomes “180–200 AD,” with the same mechanism for BC in negative values.

sel <- grep("^ca\\. [0-9]{1,3} AD$", num_dating$Dating)
for (r in sel) {
split <- strsplit(x = num_dating$Dating[r], split = " ") split <- as.numeric(split[[1]][2]) num_dating$DAT_min[r] <- split - 10
num_dating$DAT_max[r] <- split + 10 } sel <- grep("^ca\\. [0-9]{1,3} BC$", num_dating$Dating) for (r in sel) { split <- strsplit(x = num_dating$Dating[r], split = " ")
split <- 0-as.numeric(split[[1]][2])
num_dating$DAT_min[r] <- split - 10 num_dating$DAT_max[r] <- split + 10
}
Dating DAT_min DAT_max
5 ca. 200 AD 190 210
6 ca. 100 AD 90 110
63 ca. 130 AD 120 140
202 ca. 214 AD 204 224
260 ca. 360 BC -370 -350

### Creating a Second Concordance for Verbose Chronological Assessments

Again, saving the finished values in join_dating leaves us with the list of not yet converted values as seen in num_datings Dating-variable.

join_dating <- rbind(join_dating, num_dating[!is.na(num_dating$DAT_min), ]) num_dating <- num_dating[which(is.na(num_dating$DAT_min)), ]
unique(num_dating$Dating)[1:20] #> [1] "end of the 2nd c. AD" #> [2] "end of 1st c. BC – beg. of 1st c. AD" #> [3] "end 2nd c. BC" #> [4] "early 1st c. AD" #> [5] "after 212 AD" #> [6] "AD 531" #> [7] "5th/6th century AD" #> [8] "3rd/4th century AD" #> [9] "2nd – 3rd c. AD" #> [10] "2nd – 1st c. BC" #> [11] "1st – beg. of the 2nd c. AD" #> [12] "1st – 2nd c. AD" #> [13] "2nd century AD" #> [14] "late 3rd – early 2nd c. BC" #> [15] "5th – 6th c. AD" #> [16] "3rd quarter of the 6th c. BC" #> [17] "3rd – 2nd c. BC" #> [18] "340/339 BC" #> [19] "27 BC–14 AD" #> [20] "1st century AD" Due to the heterogeneous nature of these 99 values, we decided to convert them manually again. Our criteria for translating terms like “beginning of” are the following: “beginning of,” “end of,” “early,” “late” are – similarly to ca. – all translated to 20 years, as we assume that more information would have been given if the time span was greater than a quarter century. For the other values, we employed the same criteria as seen above in the automated process. We decided to measure the ‘beginning of a century’ at e.g. -199 for BC values and 100 for AD values, and accordingly identify the ‘end of a century’ with the values e.g. -100 (BC) / 199 (AD). Since the data is epigraphical, in the case of “before” or “after” (i.e. terminus ante/post quem) dates, we assume some connection to a datable event, and therefore add or subtract a span of 10 years, which is, however, still rather arbitrary. As the last step, we switch the – in some cases automatically assigned – year 0, so that it will be treated as either 1 or -1: join_dating$DAT_min[which(join_dating$DAT_min == 0)] <- 1 join_dating$DAT_max[which(join_dating$DAT_max == 0)] <- -1 We then have to reload the corrected data. The values that we put aside in the join_dating data.frame beforehand serve as the basis for our new data.frame that we use as a look-up table. As the variables were treated as character-strings, we need to convert them to a numeric format first. We append the newly loaded manually assigned values to this data.frame. Again, to keep this process as transparent as possible, we included the *.csv-Table in the “inst/extdata/” sub-directory as “num_dating_edit.csv.” #write.csv(num_dating, file = "../data-raw/num_dating.csv", # fileEncoding = "UTF-8") num_dating <- read.csv(file = system.file('extdata', 'num_dating_edit.csv', package = 'datplot', mustWork = TRUE), encoding = "UTF-8", row.names = 1, colClasses = c("character", "character", "integer", "integer")) join_dating <- join_dating %>% mutate(DAT_min = as.integer(DAT_min), DAT_max = as.integer(DAT_max)) %>% rbind(num_dating) ## Joining the Reformatted Data with the Original Data Set left_join() lets us add the DAT_-variables from the look-up table join_dating to our original data.frame. inscriptions <- left_join(inscriptions, join_dating, by = "Dating") The DAT_-variables – as we can now see – contain the desired information: ID Location Dating DAT_min DAT_max I_1350 Nicaea 2nd c. AD 100 199 I_2550 Prusias ad Olympum 2nd c. BC -199 -100 I_388 Claudiopolis 117–138 AD 117 138 I_2713 Prusias ad Olympum 1st – 2nd c. AD 1 199 I_751 Kaisarea Hadrianopolis 6th c. AD 500 599 I_726 Kaisarea Hadrianopolis mid. 3rd c. AD 240 260 I_1320 Nicaea 2nd c. AD 100 199 I_2616 Prusias ad Olympum 2nd c. AD 100 199 I_1702 Nicomedia 284–305 AD 284 305 I_1012 Nicaea Roman Imp. period -31 395 I_2597 Prusias ad Olympum 2nd c. AD 100 199 I_1418 Nicaea 202–205 AD 202 205 I_786 Kaisarea Hadrianopolis 292–305 AD 292 305 I_1117 Nicaea Byzantine period 395 799 I_938 Nicaea Roman Imp. period -31 395 ## Fixing mistakes We can now make a ‘test run’ using datplot with the stepsize-value set to 5 in order to save as much time as possible2 and check for errors first. We have to select exactly 4 columns in the correct order of: Identifier, Grouping Variable, Minimum Dating, Maximum Dating. We transform the output from the pipe operator to a data.frame before handing it to datstepts(), as datplot needs this format. inscr_steps <- inscriptions %>% select(ID, Location, DAT_min, DAT_max) %>% na.omit() %>% as.data.frame() %>% datplot::datsteps(stepsize = 5) #> Warning in datplot::datsteps(., stepsize = 5): Warning: Dating seems to be in #> wrong order at ID I_1162, I_2725 (Index: 637, 1458). Dates have been switched, #> but be sure to check your original data for possible mistakes. #> Warning in get.weights(DAT_mat[, "datmin"], DAT_mat[, "datmax"]): Warning: #> DAT_min and DAT_max at Index: 40, 44, 45, 76, 77, 113, 126, 131, 132, 133, 147, #> 154, 198, 201, 203, 204, 205, 218, 240, 242, 243, 292, 293, 323, 333, 336, 337, #> 338, 339, 340, 341, 343, 346, 349, 350, 351, 354, 355, 356, 640, 881, 882, 886, #> 892, 935, 940, 943, 944, 946, 947, 948, 955, 1024, 1025, 1026, 1031, 1034, 1035, #> 1036, 1037, 1044, 1049, 1050, 1051, 1060, 1065, 1066, 1117, 1120, 1121, 1126, #> 1128, 1133, 1154, 1175, 1183, 1255, 1260, 1456, 1480, 1487, 1494, 1498) have the #> same value! Is this correct? Please check the table for possible errors. #> Warning in create.sub.objects(DAT_mat, stepsize): stepsize is larger than the #> range of the closest dated object at Index = 40, 44, 45, 60, 75, 76, 77, 113, #> 126, 131, 132, 133, 141, 147, 154, 171, 196, 198, 199, 201, 203, 204, 205, #> 214, 218, 240, 242, 243, 269, 292, 293, 323, 333, 334, 335, 336, 337, 338, #> 339, 340, 341, 343, 346, 347, 348, 349, 350, 351, 354, 355, 356, 625, 627, #> 628, 640, 646, 725, 726, 727, 728, 729, 730, 731, 732, 733, 734, 735, 736, 879, #> 881, 882, 883, 884, 885, 886, 888, 889, 890, 892, 893, 935, 936, 939, 940, 941, #> 942, 943, 944, 945, 946, 947, 948, 949, 955, 956, 957, 987, 988, 1016, 1022, #> 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1031, 1034, 1035, 1036, 1037, 1044, #> 1046, 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1057, 1059, 1060, 1065, 1066, #> 1106, 1112, 1114, 1115, 1116, 1117, 1120, 1121, 1126, 1128, 1133, 1154, 1172, #> 1174, 1175, 1183, 1185, 1188, 1189, 1255, 1256, 1260, 1456, 1480, 1487, 1494, #> 1498, 1499, 1500, 1502, 1514, 1515, 1516). For information see documentation of #> get.step.sequence(). There are indeed problems in the data. The first of the three warnings issued by datsteps() informs us that we may have assigned some values wrong: “Warning: Dating seems to be in wrong order at ID I_1162, I_2725 (Index: 637, 1458). Dates have been switched, but be sure to check your original data for possible mistakes.” If dates are in the wrong order, datplot will automatically switch them before proceeding. This, however, might not always be the correct way of handling the situation, as other errors might have occurred as well. Therefore, we should check the data again using the Index-Values or IDs provided by datplots warning: inscriptions %>% select(ID, Location, Dating, uncertain_dating, DAT_min, DAT_max) %>% na.omit() %>% slice(637, 1458) %>% kable() ID Location Dating uncertain_dating DAT_min DAT_max I_1162 Nicaea 62/3 AD FALSE 62 3 I_2725 Prusias ad Olympum 180–192 BC FALSE -180 -192 The problem here is twofold. In the first entry, our automated translation of the dating could not handle the format “62/3” and thus returned a wrong value for DAT_max. If we had a truly large data set, we should correct this in the original reformatting process. In this case, it is more efficient to fix this problem right now in the fastest possible way. inscriptions[which(inscriptions$ID == "I_1162"),"DAT_max"] <- 63

In the other case, the Dating-column provided the BC-values in the wrong order. We can also fix this quickly right here, which is basically the same way datplot would handle both cases internally:

inscriptions[which(inscriptions$ID == "I_2725"),c("DAT_min", "DAT_max")] <- inscriptions[which(inscriptions$ID == "I_2725"),c("DAT_max", "DAT_min")]

Note that datsteps() changes the date and proceeds, but warns you of the possibility of errors. It is therefore recommended to check the data mentioned in that warning. As we ‘fixed’ our data set, we should save it again, so that the same error will not occur twice:

#write.table(inscriptions, file = "../data-raw/inscriptions.csv",
#            fileEncoding = "UTF-8", sep = ";", row.names = FALSE)

## Storing the Prepared Data Set

For later use and publication, we add explanations of the variables and metadata to the R-object and store them in the *.rda file. As a backup, we also save the finished table as a .csv-file, suitable for archiving. The .csv-Table, however, does not contain the additional information added as attributes.

All files are available in this repository and as part of this package for further use by other researchers. We kindly ask you to cite Weissova (2019), the Packard Humanities Institute and this repository at https://github.com/lsteinmann/datplot as sources.

The dataset can be loaded into R by simply calling

#library(datplot)
data("Inscr_Bithynia")

when datplot is loaded.

# Selecting the Data for Further Analysis and using datplot

As our aim is to get an overview of the spatio-temporal distribution of the inscriptions in Bithynia, we can only analyze inscriptions with known Location and Dating. Thus, we have to remove all the rows from the data set that do not contain this information.

inscr_clean <- Inscr_Bithynia %>%
filter(Dating != "NA",
Location != "unknown") %>%
droplevels()

This means that we removed a total of 1380 rows, which did not contain the information needed. The data set suitable for the analysis can be summarized as follows:

summary(inscr_clean)
#>       ID                ikey                             Location
#>  Length:1498        Length:1498        Nicaea                :604
#>  Class :character   Class :character   Prusias ad Olympum    :279
#>  Mode  :character   Mode  :character   Nicomedia             :108
#>                                        Apamea                : 77
#>                                        Chalcedon             : 69
#>                                        (Other)               :274
#>     Source             Dating                 Language    uncertain_dating
#>  Length:1498        Length:1498        Greek      :1392   Mode :logical
#>  Class :character   Class :character   Latin      :  82   FALSE:1327
#>  Mode  :character   Mode  :character   Greek/Latin:  24   TRUE :171
#>
#>
#>
#>
#>     DAT_min          DAT_max           URL
#>  Min.   :-599.0   Min.   :-525.0   Length:1498
#>  1st Qu.: -31.0   1st Qu.: 199.0   Class :character
#>  Median : 100.0   Median : 266.5   Mode  :character
#>  Mean   : 100.3   Mean   : 286.4
#>  3rd Qu.: 200.0   3rd Qu.: 395.0
#>  Max.   : 680.0   Max.   : 799.0
#> 

# datplot

We can now begin using the datplot-package…

library(datplot)

…and can actually try to use datsteps(), first with a stepsize of 25 years:

inscr_steps <- inscr_clean %>%
select(ID, Location, DAT_min, DAT_max) %>%
as.data.frame() %>%
datplot::datsteps(stepsize = 25) %>%
datplot::scaleweight(var = 2)
#> Warning in get.weights(DAT_mat[, "datmin"], DAT_mat[, "datmax"]): Warning:
#> DAT_min and DAT_max at Index: 40, 44, 45, 76, 77, 113, 126, 131, 132, 133, 147,
#> 154, 198, 201, 203, 204, 205, 218, 240, 242, 243, 292, 293, 323, 333, 336, 337,
#> 338, 339, 340, 341, 343, 346, 349, 350, 351, 354, 355, 356, 640, 881, 882, 886,
#> 892, 935, 940, 943, 944, 946, 947, 948, 955, 1024, 1025, 1026, 1031, 1034, 1035,
#> 1036, 1037, 1044, 1049, 1050, 1051, 1060, 1065, 1066, 1117, 1120, 1121, 1126,
#> 1128, 1133, 1154, 1175, 1183, 1255, 1260, 1456, 1480, 1487, 1494) have the same
#> value! Is this correct? Please check the table for possible errors.
#> Warning in create.sub.objects(DAT_mat, stepsize): stepsize is larger than the
#> range of the closest dated object at Index = 6, 12, 14, 18, 19, 20, 21, 22, 39,
#> 40, 41, 44, 45, 60, 74, 75, 76, 77, 79, 80, 83, 110, 113, 114, 125, 126, 131,
#> 132, 133, 141, 145, 147, 154, 161, 162, 163, 167, 168, 171, 173, 174, 175, 194,
#> 196, 198, 199, 201, 203, 204, 205, 206, 207, 208, 212, 214, 217, 218, 223, 231,
#> 232, 233, 234, 235, 236, 237, 238, 240, 242, 243, 244, 245, 246, 265, 269, 271,
#> 272, 273, 276, 292, 293, 323, 324, 331, 332, 333, 334, 335, 336, 337, 338, 339,
#> 340, 341, 342, 343, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356,
#> 579, 582, 583, 584, 585, 617, 618, 619, 620, 621, 622, 623, 625, 627, 628, 629,
#> 630, 631, 632, 633, 634, 635, 637, 640, 646, 723, 724, 725, 726, 727, 728, 729,
#> 730, 731, 732, 733, 734, 735, 736, 873, 874, 875, 876, 877, 878, 879, 880, 881,
#> 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 935, 936, 938, 939,
#> 940, 941, 942, 943, 944, 945, 946, 947, 948, 949, 950, 951, 952, 953, 954, 955,
#> 956, 957, 958, 959, 975, 976, 979, 980, 981, 982, 983, 984, 987, 988, 989, 1002,
#> 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028,
#> 1029, 1030, 1031, 1032, 1033, 1034, 1035, 1036, 1037, 1038, 1044, 1045, 1046,
#> 1047, 1048, 1049, 1050, 1051, 1052, 1053, 1054, 1055, 1057, 1058, 1059, 1060,
#> 1061, 1062, 1063, 1064, 1065, 1066, 1074, 1075, 1078, 1079, 1080, 1081, 1082,
#> 1083, 1084, 1085, 1086, 1087, 1106, 1107, 1112, 1113, 1114, 1115, 1116, 1117,
#> 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130,
#> 1131, 1132, 1133, 1147, 1148, 1149, 1153, 1154, 1172, 1174, 1175, 1176, 1183,
#> 1185, 1186, 1187, 1188, 1189, 1218, 1219, 1220, 1221, 1222, 1223, 1224, 1248,
#> 1250, 1251, 1252, 1253, 1255, 1256, 1260, 1382, 1456, 1457, 1458, 1459, 1460,
#> 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, 1480, 1487, 1494, 1496, 1497,
#> 1498). For information see documentation of get.step.sequence().

datsteps() tells us that a number or objects are dated to one year, asking us if this is correct. We included this output to avoid errors occurring through faulty values, as objects dated to one year have a very high impact on the outcome. It might help to spot problems or discrepancies in datasets as well, as the warning about the order of dating above has already shown.

In a second step, we scale the weights according to our grouping variable (Location), so that the sum of weights in each group equals 1, which is important for displaying the weight correctly in a density plot.

# Visualizing the Output of datsteps()

To get a general impression of the data set and the possibilities of visualization, we explore and recommend different plot methods below. The simplest and fastest way to plot the output with base R is the plot() function in combination with the density() function:

plot(density(inscr_steps$DAT_step)) ## Using ggplot2 A somewhat crowded overview containing the Locations of all Inscriptions can easily be achieved when using ggplot2 and its geom_density()-method, which is based on the same procedure as the density()-function used above. A result may look like this: ggplot(data = inscr_steps, aes(x = DAT_step, fill = variable, weight = weight)) + geom_density(alpha = 0.3) Note that the output without the weights calculated by datsteps() is very different: ggplot(data = inscr_steps, aes(x = DAT_step, fill = variable)) + geom_density(alpha = 0.3) The weights calculated by datplot (internally using the get.weights()-function) are a way of assigning an importance to the objects in question. They relate to the range an object is dated to (more detailed information on the process is available in (Weissova and Steinmann, n.d.)). Therefore, any objects dated to large time spans as for example “Roman Imperial Period” ranging from 31 BC to 395 AD in our data set, are contributing to the curve significantly less than any object dated to one exact year. At times it can be very useful to look at both outputs and discuss them separately. We will look at a case study of a single city later on to clarify this. ## Using ggplot2 and ggridges Since the graph is very crowded in this layout, we actually recommend using geom_density_ridges from the ggridges-package. Since the package has no built in support for assigning weight, we have to use stat = "density", which will reference the density()-function to get the appropriate calculations: ggplot(data = inscr_steps, aes(x = DAT_step, y = fct_rev(as_factor(variable)), fill = variable, weight = weight)) + geom_density_ridges(aes(height = ..density..), stat = "density", alpha = 0.9) + scale_fill_discrete(guide = FALSE) Styling will help to make the plots more readable: bluegreen <- colorRampPalette(c("#8dae25", "#17365c")) ggplot(data = inscr_steps, aes(x = DAT_step, y = fct_rev(as_factor(variable)), fill = variable, weight = weight)) + geom_density_ridges(aes(height = ..density..), stat = "density", alpha = 0.9) + scale_x_continuous(breaks = seq(from = -800, to = 800, by = 100), limits = c(-800,800), name = "") + geom_vline(xintercept = 0, alpha = 0.5, lwd = 1) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5), panel.background = element_blank(), panel.grid.major.x = element_line(linetype = "dashed", color = "gray30"), panel.grid.minor.x = element_line(linetype = "dotted", color = "gray80")) + scale_fill_manual(guide=FALSE, values = bluegreen(length(unique(inscr_steps$variable)))) +
labs(title = "Epigraphic Evidence from Bithynia",
subtitle = "Spatio-temporal distribution",
caption = attributes(inscriptions)$source) ## Using ggplot2 and facet_wrap() Another option is to separate the variables with facet_wrap, which is not as condensed and does not support the option to scale the densities in the graph for maximum visibility, as geom_density_ridges() automatically does (see Documentation of ggridges). In our case, this leads to a worse visibility of smaller density curves: ggplot(data = inscr_steps, aes(x = DAT_step, fill = variable, weight = weight)) + geom_density(alpha = 0.9) + scale_fill_discrete(guide = FALSE) + facet_wrap(variable ~ ., ncol = 1) Styling also helps very much to improve the graphs readability: ggplot(data = inscr_steps, aes(x = DAT_step, fill = variable, weight = weight)) + geom_density(alpha = 0.9) + theme(panel.background = element_blank()) + scale_fill_manual(guide=FALSE, values = bluegreen(length(unique(inscr_steps$variable)))) +
scale_x_continuous(breaks = seq(from = -800, to = 800, by = 100),
limits = c(-800,800), name = "") +
facet_wrap(variable ~ ., ncol = 1) +
theme(strip.text.x = element_text(size=8),
strip.background = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
panel.background = element_blank(),
panel.grid.major.x = element_line(linetype = "dashed",
color = "gray30"),
panel.grid.minor.x = element_line(linetype = "dotted",
color = "gray80")) +
labs(title = "Epigraphic Evidence from Bithynia",
subtitle = "Spatio-temporal distribution",
caption = attributes(inscriptions)\$source)

# How Does Datplot and Kernel Density Estimation Perform Compared to Histograms?

In order to compare this output to generally more common visualizations, we can prepare a histogram of the original data. Here we have two possible approaches. We first prepare a two-part histogram for the upper and lower boundaries of the respective dating for each object. We select only three locations to keep the visualization short:

inscr_clean %>%
select(ID, Location, DAT_min, DAT_max) %>%
filter(Location %in% c("Prusias ad Hypium", "Nicomedia", "Apamea")) %>%
reshape2::melt(id.vars = c("ID", "Location")) %>%
ggplot(aes(x = value, fill = variable)) +
geom_histogram(binwidth = 25, position = "dodge") +
facet_wrap(. ~ Location, ncol = 1) +
labs(title = "Distribution of Dated Inscriptions in Bithynia (selection)",
x = "Dating", y = "Number of Inscriptions") +
theme(legend.position = "top")
#> Warning: attributes are not identical across measure variables; they will be
#> dropped

Alternatively, we could display the mean of the Dating range:

inscr_clean %>%
transmute(ID, Location, Language, DAT_mean = ((DAT_min + DAT_max)/2)) %>%
filter(Location %in% c("Prusias ad Hypium", "Nicomedia", "Apamea")) %>%
reshape2::melt(id.vars = c("ID", "Location", "Language")) %>%
ggplot(aes(x = value, fill = Language)) +
geom_histogram(binwidth = 25) +
facet_wrap(. ~ Location, ncol = 1) +
labs(title = "Distribution of Dated Inscriptions in Bithynia (selection)",
x = "Dating", y = "Number of Inscriptions")

While this also gives us an impression of the distribution, it seems some information got lost on the way. Many inscriptions were dated to large time spans, and those have now been gathered into their mean value. The upside is that the histograms show the real count of inscriptions, which density graphs cannot.

We have shown in this vignette, how a data set as it is often encountered by archaeologists can be prepared and formatted in order to be processed by datplot. Furthermore, we have made some suggestions on visualization. This vignette hopes to help researchers that are not yet very familiar with the processes of data cleaning to solutions for their own projects.

# References

Weissova, Barbora. 2019. “Regional Economy, Settlement Patterns and the Road System in Bithynia (4th Century BC - 6th Century AD). Spatial and Quantitative Analysis.” Dissertation, Berlin: Freie Universität Berlin. https://refubium.fu-berlin.de/handle/fub188/23730.
Weissova, Barbora, and Lisa Steinmann. n.d. “Datplot: A New r-Package for the Visualization of Date Ranges in Archaeology.” Advances in Archaeological Practice tba: tba.

1. For information and guidance on regular expressions see e.g. https://regex101.com/. However, R has a slightly different handling of escape characters. Information can be found in the R-Documentation under ?regex or here.↩︎

2. Using different stepsizes is a bit of an artifact left over from a less efficient version of datplot, where is processing would take more than a minute if stepsize was set to 1, so I would usually start out with 100 or 25 just to look for errors and problems and to not lose much time. This, however, is technically not necessary in any way. The function now works faster, though there is a lot of room for improvement, as it may still take several seconds or longer, but generally not several minutes. We determined that the feature might still be useful, and therefore it was not removed.↩︎