Errorlocate uses validation rules from package
validate to locate faulty values in observations (or in database slang: erronenous fields in records).
It follows this simple recipe (Felligi-Holt):
errorlocate does this by translating this problem into a mixed integer problem (see other vignettes) and solving this mathematical problem.
errorlocate has two main functions to be used:
locate_errorsfor detecting errors
replace_errorsfor replacing faulty values with
Let’s start with a simple example:
We have a rule dat age cannot be negative:
<- validator(age > 0)rules
And we have the following data set
"age, income -10, 0 15, 2000 25, 3000 NA, 1000 " -> csv <- read.csv(textConnection(csv), strip.white = TRUE)d
<- locate_errors(d, rules) le summary(le) #> Variable: #> name errors missing #> 1 age 1 1 #> 2 income 0 0 #> Errors per record: #> errors records #> 1 0 3 #> 2 1 1
summary(le) gives an overview of the errors found in this data set. The complete error listing can be found with:
$errors le#> age income #> [1,] TRUE FALSE #> [2,] FALSE FALSE #> [3,] FALSE FALSE #> [4,] NA FALSE
Which says that record 1 has a faulty value for age.
Suppose we expand our rules
<- validator( r1 = age > 0 rules r2 = if (income > 0) age > 16 , )
validate::confront we can see that rule
r2 is violated (record 2).
|r1||4||2||1||1||FALSE||FALSE||age > 0|
|r2||4||2||1||1||FALSE||FALSE||!(income > 0) | (age > 16)|
What errors will be found by
set.seed(1) <- locate_errors(d, rules) le $errors le#> age income #> [1,] TRUE FALSE #> [2,] TRUE FALSE #> [3,] FALSE FALSE #> [4,] NA FALSE
It now detects that
age in observation 2 is also faulty, since it violates the second rule. Note that we use
set.seed. This is needed because in this example, either
income can be considered faulty.
set.seed assures that the procedure is reproducible.
replace_errors we can remove the errors (which still need to be imputed).
<- replace_errors(d, le) d_fixed summary(confront(d_fixed, rules))
|r1||4||1||0||3||FALSE||FALSE||age > 0|
|r2||4||2||0||2||FALSE||FALSE||!(income > 0) | (age > 16)|
replace_errors set all faulty values to
locate_errors allows for supplying weigths for the variables. It is common that the quality of the observed variables differs. When we have more trust in
age we can give it more weight so it choose income when it has to decide between the two (record 2):
set.seed(1) # good practice, although not needed in this example <- c(age = 2, income = 1) weight <- locate_errors(d, rules, weight) le $errors le#> age income #> [1,] TRUE FALSE #> [2,] FALSE TRUE #> [3,] FALSE FALSE #> [4,] NA FALSE
For weights there are three different options:
locate_errors solves a mixed integer problem. When the number of interactions between validation rules is large, finding an optimal solution can be become computationally intensive. Both
locate_errors as well as
replace_errors have a parallization option:
Ncpus making use of multiple processors. The
$duration (s) property of each solution indicates the time spent to find a solution for each record. This can be restricted using the argument
# duration is in seconds. $duration le#>  0.003178596 0.002701521 0.000000000 0.002396584