Linking survey data with SGICs (Subject Generated Identification-Codes)? Awesome! Just remember, you need to validate those IDs. That’s how you get clean data and make sure the link-up goes smoothly.
This vignette shows you:
How to perform plausibility checks on different SGIC components.
How to perform plausibility checks on non-SGIC variables that may serve as additional identifiers.
How to detect duplicate cases using a combination of variables as unique identifiers.
To check the plausibility of ID-related variables in a dataset,
trustmebro
provides several functions beginning with the
prefix inspect. Every inspect-function returns a
boolean value, indicating whether a value has passed or failed the
plausibility check.
We`ll start by loading trustmebro and dplyr:
Data: sailor_students
The survey data we use is the
trustmebro::sailor_students
dataset. It contains fictional
student assessment data from students of the sailor moon universe.
sailor_students
#> # A tibble: 12 × 6
#> sgic school class gender testscore_langauge testscore_calculus
#> <chr> <chr> <chr> <chr> <dbl> <dbl>
#> 1 "MUC__0308" 54321 "3-B " "Male" 425 394
#> 2 "HÄT 2701" 22345 "2-A" "???" 4596 123
#> 3 "MUK3801" 22345 " 2-B" "Femal… 2456 9485
#> 4 "SAM10" 22345 "3-B" "Femal… 2345 3
#> 5 "T0601" 65432 "1-C" "Femal… 1234 NA
#> 6 " UIT3006 " 12345 "3-3" NA 123 394
#> 7 "@@@@@@" NA "3_2 " "Femal… 56 2938
#> 8 NA 12345 "3@41" " Fe… 986 3948
#> 9 " " unkown NA "Femal… 284 205
#> 10 "MOA2210" 12345 " " "Femal… 105 21
#> 11 "MUK3801" 22345 "2-B" "Femal… 9586 934
#> 12 "T0601" 65432 "1-C" "Femal… NA 764
SGIC Plausibility
The variable sgic
stores SGICs created by students. Each
SGIC is a seven-character string created according to the following
instructions:
Characters 1-3 (letters):
First letter of given name (1st character)
Last letter of given name (2nd character)
First letter of family name (3rd character)
Characters 4-7 (digits):
Birthday (4th and 5th character)
Month of birth (6th and 7th character)
Check Character IDs
We can use trustmebro::inspect_characterid
to check if
the provided SGICs adhere to the expected pattern of three letters
followed by four digits. The expected structure can be defined using the
regular expression "^[A-Za-z]{3}[0-9]{4}$"
, which we can
then pass to the function using the pattern =
argument. For
seamless integration into your data workflow, this function can be
conveniently combined with dplyr::mutate
:
sailor_students %>%
mutate(structure_check =
inspect_characterid(
sgic, pattern = "^[A-Za-z]{3}[0-9]{4}$")) %>%
select(sgic, structure_check)
#> # A tibble: 12 × 2
#> sgic structure_check
#> <chr> <lgl>
#> 1 "MUC__0308" FALSE
#> 2 "HÄT 2701" FALSE
#> 3 "MUK3801" TRUE
#> 4 "SAM10" FALSE
#> 5 "T0601" FALSE
#> 6 " UIT3006 " FALSE
#> 7 "@@@@@@" FALSE
#> 8 NA FALSE
#> 9 " " FALSE
#> 10 "MOA2210" TRUE
#> 11 "MUK3801" TRUE
#> 12 "T0601" FALSE
We created trustmebro::inspect_characterid
with SGICs in
mind, but of course, any other non-SGIC strings can also be checked
using a specified regular expression.
Check Birthdate-Components
Since the SGIC should end with a date of birth, you can verify the
plausibility of this date of birth using
trustmebro::inspect_birthdaymonth
. This function checks if
a string contains exactly four digits representing a valid date of
birth. As before, you can combine
trustmebro::inspect_birthdaymonth
with
dplyr::mutate
to generate a plausibility check
variable:
sailor_students %>%
mutate(birthdate_check =
inspect_birthdaymonth(sgic)) %>%
select(sgic, birthdate_check)
#> # A tibble: 12 × 2
#> sgic birthdate_check
#> <chr> <lgl>
#> 1 "MUC__0308" TRUE
#> 2 "HÄT 2701" TRUE
#> 3 "MUK3801" FALSE
#> 4 "SAM10" FALSE
#> 5 "T0601" TRUE
#> 6 " UIT3006 " TRUE
#> 7 "@@@@@@" FALSE
#> 8 NA FALSE
#> 9 " " FALSE
#> 10 "MOA2210" TRUE
#> 11 "MUK3801" FALSE
#> 12 "T0601" TRUE
Some SGICs only use the single day or month a person was born. In
this case, you can use of trustmebro::inspect_birthday
or
trustmebro::inspect_birthmonth
accordingly.
Non-SGIC variables’ plausibility
Besides a SGIC, other variables in a given dataset might be used to
identify cases. As mentioned above,
trustmebro::inspect_characterid
can be used for any string
that should follow a specific pattern. Furthermore, this package also
provides functions for checking other data types beyond strings.
Check Numbers
We can use trustmebro::inspect_numberid
to check if a
number matches an expected length. In our dataset, school
should be a five-digit number. combined with dplyr::mutate
,
we can add a plausibility variable for the schoolnumber, just as we did
before:
sailor_students %>%
mutate(school_check =
inspect_numberid(school, 5)) %>%
select(school, school_check)
#> # A tibble: 12 × 2
#> school school_check
#> <chr> <lgl>
#> 1 54321 TRUE
#> 2 22345 TRUE
#> 3 22345 TRUE
#> 4 22345 TRUE
#> 5 65432 TRUE
#> 6 12345 TRUE
#> 7 NA FALSE
#> 8 12345 TRUE
#> 9 unkown FALSE
#> 10 12345 TRUE
#> 11 22345 TRUE
#> 12 65432 TRUE
Check the presence of a value within the recode map
In the process of using non-SGIC variables as identifiers,
categorical data is often recoded to ensure consistency within a
workflow. We can use trustmebro::inspect_valinvec
to check
if a value exists in a recode map. The recode map should be a named
vector, where the names represent the keys. In our dataset, we want to
inspect if all values in gender
conform to this recode
map:
recode_gender <- c(Male = "M", Female = "F")
The function checks if a value is present as a key. Combine with
dplyr::mutate
to add a variable that contains the check
results:
sailor_students %>%
mutate(gender_check =
inspect_valinvec(gender, recode_gender)) %>%
select(gender, gender_check)
#> # A tibble: 12 × 2
#> gender gender_check
#> <chr> <lgl>
#> 1 "Male" TRUE
#> 2 "???" FALSE
#> 3 "Female" TRUE
#> 4 "Female " FALSE
#> 5 "Female" TRUE
#> 6 NA FALSE
#> 7 "Female" TRUE
#> 8 " Female" FALSE
#> 9 "Female" TRUE
#> 10 "Female" TRUE
#> 11 "Female" TRUE
#> 12 "Female" TRUE
Identify Duplicate Cases
So far, we’ve checked if SGIC
, school
and
gender
contain plausible values. Last, we want to ensure
that these variables, when used together as identifiers, uniquely
identify a single case and that there are no duplicate entries based on
these variables. trustmebro::find_dupes
checks whether the
combination of identifiers is unique by adding a has_dupes variable to
the dataset. To find duplicates in your data, use it like this:
sailor_students %>% find_dupes(school, sgic, gender) %>%
select(school, sgic, gender, has_dupes)
#> # A tibble: 12 × 4
#> school sgic gender has_dupes
#> <chr> <chr> <chr> <lgl>
#> 1 54321 "MUC__0308" "Male" FALSE
#> 2 22345 "HÄT 2701" "???" FALSE
#> 3 22345 "MUK3801" "Female" TRUE
#> 4 22345 "SAM10" "Female " FALSE
#> 5 65432 "T0601" "Female" TRUE
#> 6 12345 " UIT3006 " NA FALSE
#> 7 NA "@@@@@@" "Female" FALSE
#> 8 12345 NA " Female" FALSE
#> 9 unkown " " "Female" FALSE
#> 10 12345 "MOA2210" "Female" FALSE
#> 11 22345 "MUK3801" "Female" TRUE
#> 12 65432 "T0601" "Female" TRUE