After registering with CHAMPS and downloading the data as described in Downloading CHAMPS Data, this data can be read into a format suitable for anaysis in R using the load_data()
function.
load_data()
is called by providing the location of the folder containing the de-identified dataset csv files. It reads in each csv file and does some minimal processing to make data handling and analysis easier in R, but with the variables staying as true to the downloaded files as possible. It also provides some alternate data structures for some of the tables that make certain analyses easier.
Suppose we have downloaded the csv files available in a directory ~/Downloads/CHAMPS_de_identified_data_2020-08-01
. We can load these data and assign it to a variable called d
(you can choose whatever variable name you wish) with the following:
library(champs) d <- load_data("~/Downloads/CHAMPS_de_identified_data_2020-08-01") #> Reading CHAMPS_dataset_version.csv... #> Reading CHAMPS_deid_basic_demographics.csv... #> Reading CHAMPS_deid_decode_results.csv... #> Reading CHAMPS_deid_lab_results.csv... #> Reading CHAMPS_deid_tac_results.csv... #> Reading CHAMPS_deid_verbal_autopsy.csv... #> Reading CHAMPS_ICD_Mappings.csv... #> Reading CHAMPS_icd10_descriptions.csv... #> Reading CHAMPS_vocabulary.csv... #> CHAMPS De-Identified Dataset v4.1 (2020-08-01)
As output, load_data()
returns a list of data frames of each of the input tables along with long versions of the DeCoDe and TAC tables.
d #> List of 11 #> $ version : tibble [1 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame) #> $ dmg : tibble [1,476 × 39] (S3: tbl_df/tbl/data.frame) #> $ dcd : tibble [1,476 × 66] (S3: spec_tbl_df/tbl_df/tbl/data.frame) #> $ lab : tibble [1,474 × 185] (S3: spec_tbl_df/tbl_df/tbl/data.frame) #> $ tac : tibble [1,473 × 325] (S3: spec_tbl_df/tbl_df/tbl/data.frame) #> $ va : tibble [1,389 × 456] (S3: spec_tbl_df/tbl_df/tbl/data.frame) #> $ icd_map : tibble [649 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame) #> $ icd_desc: tibble [14,337 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame) #> $ vocab : tibble [2,306 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame) #> $ tac_long: tibble [477,252 × 6] (S3: tbl_df/tbl/data.frame) #> $ dcd_long: tibble [44,280 × 5] (S3: tbl_df/tbl/data.frame) #> - attr(*, "class")= chr [1:2] "champs_data" "list"
These data frames map to the input csv files or to each other as follows:
version
: represents the data in CHAMPS_dataset_version.csv
dmg
: represents the data in CHAMPS_deid_basic_demographics.csv
dcd
: represents the data in CHAMPS_deid_decode_results.csv
lab
: represents the data in CHAMPS_deid_lab_results.csv
tac
: represents the data in CHAMPS_deid_tac_results.csv
va
: represents the data in CHAMPS_deid_verbal_autopsy.csv
icd_map
: represents the data in CHAMPS_ICD_Mappings.csv
icd_desc
: represents the data in CHAMPS_icd10_descriptions.csv
vocab
: represents the data in CHAMPS_vocabulary.csv
tac_long
: a long-form representation of data in tac
dcd_long
: a long-form representation of data in dcd
The row and column counts may differ depending on the version of the dataset you are working with.
You can access each of these tables for further inspection or use in specific analyses with d$dmg
, d$dcd
, etc.
The object containing all of the tables, which here we named d
, will be used as the primary argument to all of the computation functions, as described in this article.
It is highly recommended that new users to CHAMPS data thoroughly read the dataset description that comes with the downloaded data to get a better understanding of what these different tables mean. We also recommend reading the Intro to CHAMPS data article.
load_data()
detailsThe remainder of this article provides details about all of the processing and transformations that occur when calling load_data()
.
At a high level, load_data()
makes the following changes to the data as it reads in the csv files:
_
).readr::read_csv()
defaults to reading them in as dates and has difficulty parsing them. We have processed them as characters.site
, pmi_range
, acquired24
, acquired48
.age_group
, age_subgroup
, and site
(see description below).The demographic data provides ISO country site codes. We add a site
variable built from site_iso_code
that is a factor with levels: Bangladesh, Kenya, Mali, Mozambique, South Africa, Ethiopia, Sierra Leone.
We also add a categorical postmortem interval range (pmi_range
) variable based on the calc_postmortem_hrs
variable. pmi_range
had nine levels - “0 to 3”, “4 to 6”, “7 to 9”, “10 to 12”, “13 to 15”, “16 to 18”, “19 to 21”, “22 to 24”, “Over 24h”. This represents time between death and the MITS procedure and collection of specimens.
Finally, we add two new variables that represent if the case pathogens were acquired from the community or during their hospital stay, acquired24
and acquired48
, depending on the length of hospitalization before death. If the location of death was in the community, or death was a stillbirth or in the first 24 hours of life or a hospital stay was less than 24 hours then acquired24 = "Community"
. If the location of death was in the community or death was a stillbirth or in the first 24 hours or a hospital stay less than 48 hours yields acquired48 = "Community"
. All other values in both variables were marked “Facility”. A limitation for this variable is that it only considers a hospitalization immediately preceding death and does not capture hospitalizations with a discharge prior to death.
The demographic data has two age group variables - age_group
and age_group_subcat
. The age_group
variable is converted to an ordered factor with factor levels following the natural order of age:
levels(d$dmg$age_group) #> [1] "Stillbirth" #> [2] "Death in the first 24 hours" #> [3] "Early Neonate (1 to 6 days)" #> [4] "Late Neonate (7 to 27 days)" #> [5] "Infant (28 days to less than 12 months)" #> [6] "Child (12 months to less than 60 Months)"
The age_group_subcat
variable provides a further partitioning of the “Infant” and “Early Neonate” categories as found in age_group
.
In the raw data, there are four values recorded for age_group_subcat
which we converted to have descriptions that are similar to age_group
.
Original | New |
---|---|
infant_>6M | Infant (6 months to less than 12 months) |
infant_6M | Infant (28 days to less than 6 months) |
neonate_72h | Early Neonate (24-72 hours) |
neonate_>72h | Early Neonate (72+hrs to 6 days) |
All other values of age_group_subcat
are missing in the raw data. We replace these missing values with the corresponding values provided in age_group
. This gives us an updated age_group_subcat
variable with the following levels:
levels(d$dmg$age_group_subcat) #> [1] "Stillbirth" #> [2] "Death in the first 24 hours" #> [3] "Early Neonate (24-72 hours)" #> [4] "Early Neonate (72+hrs to 6 days)" #> [5] "Late Neonate (7 to 27 days)" #> [6] "Infant (28 days to less than 6 months)" #> [7] "Infant (6 months to less than 12 months)" #> [8] "Child (12 months to less than 60 Months)"
The tac_long
and dcd_long
data frames are long-format transformed versions of the wide-formatted tac
and dcd
data frames with additional description columns added to clarify the coded column names.
The tac
data frame provided in the raw data contains one row per case and one column per TAC assay result. There are 325 columns with names such as bld_abau_1
, bld_adev_1
, etc., indicating the specimen type and assay.
d$tac #> # A tibble: 1,473 x 325 #> champs_deid bld_abau_1 bld_adev_1 bld_bart_1 bld_bruc_1 bld_bups_1 bld_caal_1 #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 AB032461-9… Negative Negative Negative Negative Negative Negative #> 2 46B7C6E7-2… Negative Negative Negative Negative Negative Negative #> 3 3791BBA8-4… Negative Negative Negative Negative Negative Positive #> # … with 1,470 more rows, and 318 more variables: bld_cchf_1 <chr>,
An alternate long-format representation of this data can be more convenient for various analyses. In the long format, we have one row for every case and TAC assay combination and columns providing information about the TAC assay, namely its specimen_type
, target
, pathogen
, and the result
.
d$tac_long #> # A tibble: 477,252 x 6 #> champs_deid name result specimen_type target pathogen #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 AB032461-9D11-4391-A… bld_ab… Negati… Whole blood ABAU_1 Acinetobacter baumann… #> 2 AB032461-9D11-4391-A… bld_ad… Negati… Whole blood ADEV_1 Adenovirus #> 3 AB032461-9D11-4391-A… bld_ba… Negati… Whole blood BART_1 Bartonella spp. #> # … with 477,249 more rows
Note that specimen_type
is derived from the text preceding the first "_" in the original variable names. The pathogen
value is found in a lookup table that is available in the dataset description PDF file that comes with the data download, and which is made available in this R package as a dataset tac_table
.
tac_table #> # A tibble: 128 x 3 #> code assay gene_target #> <chr> <chr> <chr> #> 1 ABAU_1 Acinetobacter baumannii OXA-51 #> 2 ADEV_1 Adenovirus hexon #> 3 AD4X_1 Adenovirus 40/41 fiber protein #> # … with 125 more rows
The pathogen
variable is often much more useful for searching for specific TAC results, as you can search by a pathogen name (e.g. “Acinetobacter baumannii”) rather than parsing out the TAC assay code (e.g. “ABAU_1”).
Similar to the TAC dataset, the DeCoDe dataset in its raw form has one row per case and many columns.
d$dcd #> # A tibble: 1,476 x 66 #> champs_deid immediate_cod ic_champs_group… underlying_cause uc_champs_group… #> <chr> <chr> <chr> <chr> <chr> #> 1 56550401-2… P07.1 Neonatal preter… P36.1 Neonatal sepsis #> 2 BA74FEEB-7… P20 Perinatal asphy… P36.8 Neonatal sepsis #> # … with 1,474 more rows, and 61 more variables: #> # main_maternal_disease_condition <chr>, morbid_condition_01 <chr>, #> # morbid_cond_01_champs_group_desc <chr>, morbid_condition_02 <chr>, #> # morbid_cond_02_champs_group_desc <chr>, morbid_condition_03 <chr>, #> # morbid_cond_03_champs_group_desc <chr>, morbid_condition_04 <chr>, #> # morbid_cond_04_champs_group_desc <chr>, morbid_condition_05 <chr>, #> # morbid_cond_05_champs_group_desc <chr>, morbid_condition_06 <lgl>, #> # morbid_cond_06_champs_group_desc <lgl>, morbid_condition_07 <lgl>, #> # morbid_cond_07_champs_group_desc <lgl>, morbid_condition_08 <lgl>, #> # morbid_cond_08_champs_group_desc <lgl>, other_maternal_condition_01 <chr>, #> # other_maternal_condition_02 <chr>, other_maternal_condition_03 <chr>, #> # other_maternal_condition_04 <chr>, other_significant_condition_01 <chr>, #> # other_significant_condition_02 <chr>, other_significant_condition_03 <chr>, #> # other_significant_condition_04 <chr>, other_significant_condition_05 <chr>, #> # other_significant_condition_06 <chr>, other_significant_condition_07 <lgl>, #> # other_significant_condition_08 <lgl>, other_significant_condition_09 <lgl>, #> # other_significant_condition_10 <lgl>, immediate_cause_of_death_etiol1 <chr>, #> # immediate_cause_of_death_etiol2 <chr>, immediate_cause_of_death_etiol3 <chr>, #> # underlying_cause_factor_etiol1 <chr>, underlying_cause_factor_etiol2 <chr>, #> # underlying_cause_factor_etiol3 <chr>, morbid_condition_01_etiol1 <chr>, #> # morbid_condition_01_etiol2 <chr>, morbid_condition_01_etiol3 <chr>, #> # morbid_condition_02_etiol1 <chr>, morbid_condition_02_etiol2 <chr>, #> # morbid_condition_02_etiol3 <chr>, morbid_condition_03_etiol1 <chr>, #> # morbid_condition_03_etiol2 <chr>, morbid_condition_03_etiol3 <chr>, #> # morbid_condition_04_etiol1 <chr>, morbid_condition_04_etiol2 <lgl>, #> # morbid_condition_04_etiol3 <lgl>, morbid_condition_05_etiol1 <chr>, #> # morbid_condition_05_etiol2 <chr>, morbid_condition_05_etiol3 <chr>, #> # morbid_condition_06_etiol1 <lgl>, morbid_condition_06_etiol2 <lgl>, #> # morbid_condition_06_etiol3 <lgl>, morbid_condition_07_etiol1 <lgl>, #> # morbid_condition_07_etiol2 <lgl>, morbid_condition_07_etiol3 <lgl>, #> # morbid_condition_08_etiol1 <lgl>, morbid_condition_08_etiol2 <lgl>, #> # morbid_condition_08_etiol3 <lgl>
For each case, information about the causes of death determined by the DeCoDe panel can be found. These are available as CHAMPS group descriptions (ending with _champs_group_desc
) and etiologies (ending with _etiol
) if applicable. Immediate and underlying causes of death, along with morbid conditions, are classified as ICD10 codes, CHAMPS group descriptions, and with etiologies depending on the causes. The Other significant conditions are only classified as ICD10 codes at this time in the dataset.
Similar to what was done with the TAC long table, a long version of the DeCoDe table can be convenient for many analyses. We pivoted the data to long format such that there is one row per each etiology (1-3) and “type” of variable (immediate, underlying, morbid_condition_0*), with the corresponding CHAMPS group descriptions being repeated for each type.
Let’s take a glimpse at the data for one case in long form:
dplyr::filter(d$dcd_long, champs_deid == d$dcd_long$champs_deid[1500]) #> # A tibble: 30 x 5 #> champs_deid champs_group_desc type etiol etiol_num #> <chr> <chr> <chr> <chr> <int> #> 1 FF205E2D-7F81-4A… Neonatal sepsis immedi… Escherichia c… 1 #> 2 FF205E2D-7F81-4A… Neonatal sepsis immedi… NA 2 #> 3 FF205E2D-7F81-4A… Neonatal sepsis immedi… NA 3 #> 4 FF205E2D-7F81-4A… Lower respiratory inf… underl… Escherichia c… 1 #> 5 FF205E2D-7F81-4A… Lower respiratory inf… underl… NA 2 #> 6 FF205E2D-7F81-4A… Lower respiratory inf… underl… NA 3 #> 7 FF205E2D-7F81-4A… Meningitis/Encephalit… morbid… Escherichia c… 1 #> 8 FF205E2D-7F81-4A… Meningitis/Encephalit… morbid… NA 2 #> 9 FF205E2D-7F81-4A… Meningitis/Encephalit… morbid… NA 3 #> 10 FF205E2D-7F81-4A… NA morbid… NA 1 #> # … with 20 more rows
Here we see that this case has an immediate CHAMPS group description of “Neonatal sepsis” with an etiology of “Escherichia coli”, and an underlying CHAMPS group description of “Lower respiratory infections”. The long format provides us with a more convenient way to search for specific CHAMPS group descriptions or etiologies, in that we can search for a term in one column rather than across many columns.
Note that for a given case and type, whenever a champs_group_desc
and etiol
are all NA
, we omit that data.
This package leverages many of the paradigms and packages of the Tidyverse. One of the most apparent elements involves the use of tibbles. They explain:
A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles also have an enhanced print() method, which makes them easier to use with large datasets containing complex objects.
If you are uncomfortable with the tibble print format, you can use the following code to convert all the objects to only data.frame.
d <- lapply(d, data.frame)