In this article we illustrate in detail how to compute a common set of statistics of interest with CHAMPS data. This will help newcomers to CHAMPS data with understanding the nuances of the different tables and how they can be used together to answer questions of interest.
We will illustrate a computation that is provided by the calc_cc_detected_by_site_age()
, where we look at in how many cases was death due to a given condition (etiology) in DeCoDe vs. how many cases a given pathogen was detected in TAC results, broken down by site and age. This requires merging information from the demographics, DeCoDe, and TAC tables.
Before diving into the data, we assume that you have read the CHAMPS introduction and data articles.
In this example, we will be looking at in how many cases Group B Streptococcus was detected as a pathogen in the TAC results vs. how many cases were determined by DeCoDe for which Group B Streptococcus was in the causal chain.
For a quick refresh on the distinction between TAC and DeCoDe:
In TAC results, we are looking for detection of a pathogen in one or more specimen types. A pathogen is present if its result is “Positive”.
In DeCoDe, we are looking for detection of the condition that was in the causal chain of the cause of death. This is either provided as a CHAMPS group description or an etiology.
The distinction between counting things in DeCoDe vs. TAC can be thought as comparing dying from a condition vs. dying with a pathogen.
To begin our calculation, we load the package and the de-identified data, which here we have downloaded to a location ~/Downloads/CHAMPS_de_identified_data
.
library(champs) library(dplyr) library(stringr) d <- load_data("~/Downloads/CHAMPS_de_identified_data")
We want to specify the condition and pathogen associated with Group B Streptococcus. Note that conditions and pathogens can overlap but may not always be expressed in the same way. You can call valid_conditions(d)
to get a list of all conditions for which data exists (all unique values of etiologies or CHAMPS group descriptions), and you can call valid_pathogens()
for a list of all valid pathogen values as detected by TAC.
Searching the results of these valid_
functions, we see that the condition we are looking for is not coded as “Group B Streptococcus”, but as “Streptococcus agalactiae”. The pathogen we are looking for, on the other hand, is simply “Group B Streptococcus”.
To start, let’s look at the TAC results to find cases where there is a positive result for “Group B Streptococcus”. We can either use the tac_long
table or the tac
table. Recall that the tac
table looks like this:
d$tac #> # A tibble: 1,473 x 325 #> champs_deid bld_abau_1 bld_adev_1 bld_bart_1 bld_bruc_1 bld_bups_1 bld_caal_1 #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 AB032461-9… Negative Negative Negative Negative Negative Negative #> 2 46B7C6E7-2… Negative Negative Negative Negative Negative Negative #> 3 3791BBA8-4… Negative Negative Negative Negative Negative Positive #> 4 52662D06-B… <NA> Negative <NA> <NA> <NA> <NA> #> 5 7C1C58CB-3… <NA> Negative <NA> <NA> <NA> <NA> #> 6 3D6C22BB-D… Negative Negative Negative Negative Negative Negative #> 7 05B7F6AD-7… Negative Negative Negative Negative Negative Negative #> 8 4A520B0F-4… <NA> Negative <NA> <NA> <NA> <NA> #> 9 C8F59F60-1… Negative Negative Negative Negative Negative Negative #> 10 D2406FD8-D… <NA> Negative <NA> <NA> <NA> <NA> #> # … with 1,463 more rows, and 318 more variables: bld_cchf_1 <chr>, #> # bld_chic_1 <chr>, bld_cobu_1 <chr>, bld_crng_1 <chr>, bld_cymv_1 <chr>, #> # bld_denv_1 <chr>, bld_ecsh_1 <chr>, bld_enfc_1 <chr>, bld_enfm_1 <chr>, #> # bld_entv_4 <chr>, bld_gast_2 <chr>, bld_gbst_1 <chr>, bld_hepe_1 <chr>, #> # bld_hflu_1 <chr>, bld_hiat_1 <chr>, bld_hitb_2 <chr>, bld_hpev_1 <chr>, #> # bld_hsv1_1 <chr>, bld_hsv2_1 <chr>, bld_jenv_1 <chr>, bld_klpn_1 <chr>, #> # bld_lass_1 <chr>, bld_lava_1 <chr>, bld_lavb_1 <chr>, bld_lavp_1 <chr>, #> # bld_lept_1 <chr>, bld_limo_1 <chr>, bld_mevs_1 <chr>, bld_moca_1 <chr>, #> # bld_mump_1 <chr>, bld_mytb_1 <chr>, bld_nego_1 <chr>, bld_nipv_1 <chr>, #> # bld_nmen_1 <chr>, bld_orts_1 <chr>, bld_parv_1 <chr>, bld_plfa_1 <chr>, #> # bld_plvi_1 <chr>, bld_psae_1 <chr>, bld_rick_1 <chr>, bld_rivf_1 <chr>, #> # bld_rubv_2 <chr>, bld_salm_1 <chr>, bld_sals_1 <chr>, bld_sapa_1 <chr>, #> # bld_saty_1 <chr>, bld_stau_2 <chr>, bld_stpn_2 <chr>, bld_stsu_1 <chr>, #> # bld_toxo_1 <chr>, bld_trpa_1 <chr>, bld_urup_1 <chr>, bld_vazv_1 <chr>, #> # bld_wenv_1 <chr>, bld_yefv_1 <chr>, bld_yers_1 <chr>, bld_zika_1 <chr>, #> # bld_sp_abau_1 <chr>, bld_sp_adev_1 <chr>, bld_sp_bart_1 <chr>, #> # bld_sp_bruc_1 <chr>, bld_sp_bups_1 <chr>, bld_sp_caal_1 <chr>, #> # bld_sp_cchf_1 <chr>, bld_sp_chic_1 <chr>, bld_sp_cobu_1 <chr>, #> # bld_sp_crng_1 <chr>, bld_sp_cymv_1 <chr>, bld_sp_denv_1 <chr>, #> # bld_sp_ecsh_1 <chr>, bld_sp_enfc_1 <chr>, bld_sp_enfm_1 <chr>, #> # bld_sp_entv_4 <chr>, bld_sp_gast_2 <chr>, bld_sp_gbst_1 <chr>, #> # bld_sp_hepe_1 <chr>, bld_sp_hflu_1 <chr>, bld_sp_hiat_1 <chr>, #> # bld_sp_hitb_2 <chr>, bld_sp_hpev_1 <chr>, bld_sp_hsv1_1 <chr>, #> # bld_sp_hsv2_1 <chr>, bld_sp_jenv_1 <chr>, bld_sp_klpn_1 <chr>, #> # bld_sp_lass_1 <chr>, bld_sp_lava_1 <chr>, bld_sp_lavb_1 <chr>, #> # bld_sp_lavp_1 <chr>, bld_sp_lept_1 <chr>, bld_sp_limo_1 <chr>, #> # bld_sp_mevs_1 <chr>, bld_sp_moca_1 <chr>, bld_sp_mump_1 <chr>, #> # bld_sp_mytb_1 <chr>, bld_sp_nego_1 <chr>, bld_sp_nipv_1 <chr>, #> # bld_sp_nmen_1 <chr>, bld_sp_orts_1 <chr>, bld_sp_parv_1 <chr>, #> # bld_sp_plfa_1 <chr>, …
While the tac_long
table looks like this:
d$tac_long #> # A tibble: 477,252 x 6 #> champs_deid name result specimen_type target pathogen #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 AB032461-9D11-4391-… bld_ab… Negat… Whole blood ABAU_1 Acinetobacter bauma… #> 2 AB032461-9D11-4391-… bld_ad… Negat… Whole blood ADEV_1 Adenovirus #> 3 AB032461-9D11-4391-… bld_ba… Negat… Whole blood BART_1 Bartonella spp. #> 4 AB032461-9D11-4391-… bld_br… Negat… Whole blood BRUC_1 Brucellaspp. #> 5 AB032461-9D11-4391-… bld_bu… Negat… Whole blood BUPS_1 Burkholderia pseudo… #> 6 AB032461-9D11-4391-… bld_ca… Negat… Whole blood CAAL_1 Candida albicans #> 7 AB032461-9D11-4391-… bld_cc… Negat… Whole blood CCHF_1 Crimean-Congo Hemor… #> 8 AB032461-9D11-4391-… bld_ch… Negat… Whole blood CHIC_1 Chikungunya virus #> 9 AB032461-9D11-4391-… bld_co… Negat… Whole blood COBU_1 Coxiella burnettii #> 10 AB032461-9D11-4391-… bld_cr… Negat… Whole blood CRNG_1 Cryptococcus neofor… #> # … with 477,242 more rows
To count the cases that have a positive the Group B Streptococcus TAC result using the tac
table, we would have to find all the column names that correspond with Group B Streptococcus and look for at least one positive result for each case (across all of the specimen types). This can get tedious and is not elegant to code.
As an alternative, we can use the tac_long
table much more conveniently. To find cases with a positive result for Group B Streptococcus, we search for a match in the pathogen
variable and then filter down to only positive results. There can be more than one positive result for any given case (because there are multiple specimen types), so we want to make sure we only count unique cases.
Below, we use dplyr’s filter()
function to filter down to positive Group B Streptococcus results, and then use dplyr’s pull()
function to select all of the case identifiers (champd_deid
) and store the unique values in a result named id
. Note that we are using the pipe operator %>%
to send the result of each line of code into the next.
ids_tac <- d$tac_long %>% filter(pathogen == "Group B Streptococcus" & result == "Positive") %>% pull(champs_deid) %>% unique() length(ids_tac) #> [1] 207
We see that there are 207 unique cases that have a positive result for Group B Streptococcus, or died with this pathogen.
Now let’s do a similar operation with the DeCoDe results. Similar to TAC, we will use the long format, dcd_long
to find cases.
Recall that dcd_long
looks like this:
d$dcd_long #> # A tibble: 44,280 x 5 #> champs_deid champs_group_desc type etiol etiol_num #> <chr> <chr> <chr> <chr> <int> #> 1 B6F81D28-CE23-4034-… Maternal HIV immedia… Human Immunod… 1 #> 2 B6F81D28-CE23-4034-… Maternal HIV immedia… <NA> 2 #> 3 B6F81D28-CE23-4034-… Maternal HIV immedia… <NA> 3 #> 4 56550401-228F-457C-… Neonatal preterm birt… immedia… <NA> 1 #> 5 56550401-228F-457C-… Neonatal preterm birt… immedia… <NA> 2 #> 6 56550401-228F-457C-… Neonatal preterm birt… immedia… <NA> 3 #> 7 BA74FEEB-7680-4ABA-… Perinatal asphyxia/hy… immedia… <NA> 1 #> 8 BA74FEEB-7680-4ABA-… Perinatal asphyxia/hy… immedia… <NA> 2 #> 9 BA74FEEB-7680-4ABA-… Perinatal asphyxia/hy… immedia… <NA> 3 #> 10 F773A228-9212-4670-… Neonatal preterm birt… immedia… <NA> 1 #> # … with 44,270 more rows
To find cases where Group B Streptococcus is in the causal chain, we want to search in either the champs_group_desc
variable or the etiol
variable for the condition of interest. In our case, we know that the condition of interest, “Streptococcus agalactiae” is found in the etiol
variable so we search there.
ids_dcd <- d$dcd_long %>% filter(etiol == "Streptococcus agalactiae") %>% pull(champs_deid) %>% unique() length(ids_dcd) #> [1] 45
We see that there are 45 unique cases that have Group B Streptococcus in the causal chain, or died from this condition.
Now, we want to tabulate these cases by site and age, which means that we need to find the matching IDs for each in the demographics table. Suppose we only want to look at cases in Kenya, Mozambique, and Ethiopia. To do this for DeCoDe:
sites <- c("Kenya", "Mozambique", "Ethiopia") dcd_dmg <- d$dmg %>% filter(champs_deid %in% ids_dcd & site %in% sites)
Now, since the demographics table contains one row per case, we can tabulate the cases by age group and site using xtabs()
. Note that there are many ways this could be done (e.g. using group_by()
and tally()
), but xtabs()
provides a nice output for displaying in this example.
xtabs(~ age_group + site, data = dcd_dmg, drop.unused.levels = TRUE) %>% addmargins() #> site #> age_group Kenya Mozambique Ethiopia Sum #> Stillbirth 1 1 0 2 #> Death in the first 24 hours 1 3 0 4 #> Early Neonate (1 to 6 days) 0 1 1 2 #> Late Neonate (7 to 27 days) 1 0 0 1 #> Sum 3 5 1 9
We can do the same for TAC:
tac_dmg <- d$dmg %>% filter(champs_deid %in% ids_tac & site %in% sites) xtabs(~ age_group + site, data = tac_dmg, drop.unused.levels = TRUE) %>% addmargins() #> site #> age_group Kenya Mozambique Ethiopia Sum #> Stillbirth 8 1 0 9 #> Death in the first 24 hours 4 6 1 11 #> Early Neonate (1 to 6 days) 3 1 1 5 #> Late Neonate (7 to 27 days) 3 2 0 5 #> Infant (28 days to less than 12 months) 10 2 1 13 #> Child (12 months to less than 60 Months) 11 5 0 16 #> Sum 39 17 3 59