Deep dive into a calc function • champs

In this article we illustrate in detail how to compute a common set of statistics of interest with CHAMPS data. This will help newcomers to CHAMPS data with understanding the nuances of the different tables and how they can be used together to answer questions of interest.

We will illustrate a computation that is provided by the calc_cc_detected_by_site_age(), where we look at in how many cases was death due to a given condition (etiology) in DeCoDe vs. how many cases a given pathogen was detected in TAC results, broken down by site and age. This requires merging information from the demographics, DeCoDe, and TAC tables.

Before diving into the data, we assume that you have read the CHAMPS introduction and data articles.

In this example, we will be looking at in how many cases Group B Streptococcus was detected as a pathogen in the TAC results vs. how many cases were determined by DeCoDe for which Group B Streptococcus was in the causal chain.

For a quick refresh on the distinction between TAC and DeCoDe:

In TAC results, we are looking for detection of a pathogen in one or more specimen types. A pathogen is present if its result is “Positive”.

In DeCoDe, we are looking for detection of the condition that was in the causal chain of the cause of death. This is either provided as a CHAMPS group description or an etiology.

The distinction between counting things in DeCoDe vs. TAC can be thought as comparing dying from a condition vs. dying with a pathogen.

To begin our calculation, we load the package and the de-identified data, which here we have downloaded to a location ~/Downloads/CHAMPS_de_identified_data.

library(champs)
library(dplyr)
library(stringr)
d <- load_data("~/Downloads/CHAMPS_de_identified_data")

We want to specify the condition and pathogen associated with Group B Streptococcus. Note that conditions and pathogens can overlap but may not always be expressed in the same way. You can call valid_conditions(d) to get a list of all conditions for which data exists (all unique values of etiologies or CHAMPS group descriptions), and you can call valid_pathogens() for a list of all valid pathogen values as detected by TAC.

Searching the results of these valid_ functions, we see that the condition we are looking for is not coded as “Group B Streptococcus”, but as “Streptococcus agalactiae”. The pathogen we are looking for, on the other hand, is simply “Group B Streptococcus”.

To start, let’s look at the TAC results to find cases where there is a positive result for “Group B Streptococcus”. We can either use the tac_long table or the tac table. Recall that the tac table looks like this:

d$tac
#> # A tibble: 1,473 x 325
#>    champs_deid bld_abau_1 bld_adev_1 bld_bart_1 bld_bruc_1 bld_bups_1 bld_caal_1
#>    <chr>       <chr>      <chr>      <chr>      <chr>      <chr>      <chr>     
#>  1 AB032461-9… Negative   Negative   Negative   Negative   Negative   Negative  
#>  2 46B7C6E7-2… Negative   Negative   Negative   Negative   Negative   Negative  
#>  3 3791BBA8-4… Negative   Negative   Negative   Negative   Negative   Positive  
#>  4 52662D06-B… <NA>       Negative   <NA>       <NA>       <NA>       <NA>      
#>  5 7C1C58CB-3… <NA>       Negative   <NA>       <NA>       <NA>       <NA>      
#>  6 3D6C22BB-D… Negative   Negative   Negative   Negative   Negative   Negative  
#>  7 05B7F6AD-7… Negative   Negative   Negative   Negative   Negative   Negative  
#>  8 4A520B0F-4… <NA>       Negative   <NA>       <NA>       <NA>       <NA>      
#>  9 C8F59F60-1… Negative   Negative   Negative   Negative   Negative   Negative  
#> 10 D2406FD8-D… <NA>       Negative   <NA>       <NA>       <NA>       <NA>      
#> # … with 1,463 more rows, and 318 more variables: bld_cchf_1 <chr>,
#> #   bld_chic_1 <chr>, bld_cobu_1 <chr>, bld_crng_1 <chr>, bld_cymv_1 <chr>,
#> #   bld_denv_1 <chr>, bld_ecsh_1 <chr>, bld_enfc_1 <chr>, bld_enfm_1 <chr>,
#> #   bld_entv_4 <chr>, bld_gast_2 <chr>, bld_gbst_1 <chr>, bld_hepe_1 <chr>,
#> #   bld_hflu_1 <chr>, bld_hiat_1 <chr>, bld_hitb_2 <chr>, bld_hpev_1 <chr>,
#> #   bld_hsv1_1 <chr>, bld_hsv2_1 <chr>, bld_jenv_1 <chr>, bld_klpn_1 <chr>,
#> #   bld_lass_1 <chr>, bld_lava_1 <chr>, bld_lavb_1 <chr>, bld_lavp_1 <chr>,
#> #   bld_lept_1 <chr>, bld_limo_1 <chr>, bld_mevs_1 <chr>, bld_moca_1 <chr>,
#> #   bld_mump_1 <chr>, bld_mytb_1 <chr>, bld_nego_1 <chr>, bld_nipv_1 <chr>,
#> #   bld_nmen_1 <chr>, bld_orts_1 <chr>, bld_parv_1 <chr>, bld_plfa_1 <chr>,
#> #   bld_plvi_1 <chr>, bld_psae_1 <chr>, bld_rick_1 <chr>, bld_rivf_1 <chr>,
#> #   bld_rubv_2 <chr>, bld_salm_1 <chr>, bld_sals_1 <chr>, bld_sapa_1 <chr>,
#> #   bld_saty_1 <chr>, bld_stau_2 <chr>, bld_stpn_2 <chr>, bld_stsu_1 <chr>,
#> #   bld_toxo_1 <chr>, bld_trpa_1 <chr>, bld_urup_1 <chr>, bld_vazv_1 <chr>,
#> #   bld_wenv_1 <chr>, bld_yefv_1 <chr>, bld_yers_1 <chr>, bld_zika_1 <chr>,
#> #   bld_sp_abau_1 <chr>, bld_sp_adev_1 <chr>, bld_sp_bart_1 <chr>,
#> #   bld_sp_bruc_1 <chr>, bld_sp_bups_1 <chr>, bld_sp_caal_1 <chr>,
#> #   bld_sp_cchf_1 <chr>, bld_sp_chic_1 <chr>, bld_sp_cobu_1 <chr>,
#> #   bld_sp_crng_1 <chr>, bld_sp_cymv_1 <chr>, bld_sp_denv_1 <chr>,
#> #   bld_sp_ecsh_1 <chr>, bld_sp_enfc_1 <chr>, bld_sp_enfm_1 <chr>,
#> #   bld_sp_entv_4 <chr>, bld_sp_gast_2 <chr>, bld_sp_gbst_1 <chr>,
#> #   bld_sp_hepe_1 <chr>, bld_sp_hflu_1 <chr>, bld_sp_hiat_1 <chr>,
#> #   bld_sp_hitb_2 <chr>, bld_sp_hpev_1 <chr>, bld_sp_hsv1_1 <chr>,
#> #   bld_sp_hsv2_1 <chr>, bld_sp_jenv_1 <chr>, bld_sp_klpn_1 <chr>,
#> #   bld_sp_lass_1 <chr>, bld_sp_lava_1 <chr>, bld_sp_lavb_1 <chr>,
#> #   bld_sp_lavp_1 <chr>, bld_sp_lept_1 <chr>, bld_sp_limo_1 <chr>,
#> #   bld_sp_mevs_1 <chr>, bld_sp_moca_1 <chr>, bld_sp_mump_1 <chr>,
#> #   bld_sp_mytb_1 <chr>, bld_sp_nego_1 <chr>, bld_sp_nipv_1 <chr>,
#> #   bld_sp_nmen_1 <chr>, bld_sp_orts_1 <chr>, bld_sp_parv_1 <chr>,
#> #   bld_sp_plfa_1 <chr>, …

While the tac_long table looks like this:

d$tac_long
#> # A tibble: 477,252 x 6
#>    champs_deid          name    result specimen_type target pathogen            
#>    <chr>                <chr>   <chr>  <chr>         <chr>  <chr>               
#>  1 AB032461-9D11-4391-… bld_ab… Negat… Whole blood   ABAU_1 Acinetobacter bauma…
#>  2 AB032461-9D11-4391-… bld_ad… Negat… Whole blood   ADEV_1 Adenovirus          
#>  3 AB032461-9D11-4391-… bld_ba… Negat… Whole blood   BART_1 Bartonella spp.     
#>  4 AB032461-9D11-4391-… bld_br… Negat… Whole blood   BRUC_1 Brucellaspp.        
#>  5 AB032461-9D11-4391-… bld_bu… Negat… Whole blood   BUPS_1 Burkholderia pseudo…
#>  6 AB032461-9D11-4391-… bld_ca… Negat… Whole blood   CAAL_1 Candida albicans    
#>  7 AB032461-9D11-4391-… bld_cc… Negat… Whole blood   CCHF_1 Crimean-Congo Hemor…
#>  8 AB032461-9D11-4391-… bld_ch… Negat… Whole blood   CHIC_1 Chikungunya virus   
#>  9 AB032461-9D11-4391-… bld_co… Negat… Whole blood   COBU_1 Coxiella burnettii  
#> 10 AB032461-9D11-4391-… bld_cr… Negat… Whole blood   CRNG_1 Cryptococcus neofor…
#> # … with 477,242 more rows

To count the cases that have a positive the Group B Streptococcus TAC result using the tac table, we would have to find all the column names that correspond with Group B Streptococcus and look for at least one positive result for each case (across all of the specimen types). This can get tedious and is not elegant to code.

As an alternative, we can use the tac_long table much more conveniently. To find cases with a positive result for Group B Streptococcus, we search for a match in the pathogen variable and then filter down to only positive results. There can be more than one positive result for any given case (because there are multiple specimen types), so we want to make sure we only count unique cases.

Below, we use dplyr’s filter() function to filter down to positive Group B Streptococcus results, and then use dplyr’s pull() function to select all of the case identifiers (champd_deid) and store the unique values in a result named id. Note that we are using the pipe operator %>% to send the result of each line of code into the next.

ids_tac <- d$tac_long %>%
  filter(pathogen == "Group B Streptococcus" & result == "Positive") %>%
  pull(champs_deid) %>%
  unique()

length(ids_tac)
#> [1] 207

We see that there are 207 unique cases that have a positive result for Group B Streptococcus, or died with this pathogen.

Now let’s do a similar operation with the DeCoDe results. Similar to TAC, we will use the long format, dcd_long to find cases.

Recall that dcd_long looks like this:

d$dcd_long
#> # A tibble: 44,280 x 5
#>    champs_deid          champs_group_desc      type     etiol          etiol_num
#>    <chr>                <chr>                  <chr>    <chr>              <int>
#>  1 B6F81D28-CE23-4034-… Maternal HIV           immedia… Human Immunod…         1
#>  2 B6F81D28-CE23-4034-… Maternal HIV           immedia… <NA>                   2
#>  3 B6F81D28-CE23-4034-… Maternal HIV           immedia… <NA>                   3
#>  4 56550401-228F-457C-… Neonatal preterm birt… immedia… <NA>                   1
#>  5 56550401-228F-457C-… Neonatal preterm birt… immedia… <NA>                   2
#>  6 56550401-228F-457C-… Neonatal preterm birt… immedia… <NA>                   3
#>  7 BA74FEEB-7680-4ABA-… Perinatal asphyxia/hy… immedia… <NA>                   1
#>  8 BA74FEEB-7680-4ABA-… Perinatal asphyxia/hy… immedia… <NA>                   2
#>  9 BA74FEEB-7680-4ABA-… Perinatal asphyxia/hy… immedia… <NA>                   3
#> 10 F773A228-9212-4670-… Neonatal preterm birt… immedia… <NA>                   1
#> # … with 44,270 more rows

To find cases where Group B Streptococcus is in the causal chain, we want to search in either the champs_group_desc variable or the etiol variable for the condition of interest. In our case, we know that the condition of interest, “Streptococcus agalactiae” is found in the etiol variable so we search there.

ids_dcd <- d$dcd_long %>%
  filter(etiol == "Streptococcus agalactiae") %>%
  pull(champs_deid) %>%
  unique()

length(ids_dcd)
#> [1] 45

We see that there are 45 unique cases that have Group B Streptococcus in the causal chain, or died from this condition.

Now, we want to tabulate these cases by site and age, which means that we need to find the matching IDs for each in the demographics table. Suppose we only want to look at cases in Kenya, Mozambique, and Ethiopia. To do this for DeCoDe:

sites <- c("Kenya", "Mozambique", "Ethiopia")

dcd_dmg <- d$dmg %>%
  filter(champs_deid %in% ids_dcd & site %in% sites)

Now, since the demographics table contains one row per case, we can tabulate the cases by age group and site using xtabs(). Note that there are many ways this could be done (e.g. using group_by() and tally()), but xtabs() provides a nice output for displaying in this example.

xtabs(~ age_group + site, data = dcd_dmg, drop.unused.levels = TRUE) %>%
  addmargins()
#>                              site
#> age_group                     Kenya Mozambique Ethiopia Sum
#>   Stillbirth                      1          1        0   2
#>   Death in the first 24 hours     1          3        0   4
#>   Early Neonate (1 to 6 days)     0          1        1   2
#>   Late Neonate (7 to 27 days)     1          0        0   1
#>   Sum                             3          5        1   9

We can do the same for TAC:

tac_dmg <- d$dmg %>%
  filter(champs_deid %in% ids_tac & site %in% sites)

xtabs(~ age_group + site, data = tac_dmg, drop.unused.levels = TRUE) %>%
  addmargins()
#>                                           site
#> age_group                                  Kenya Mozambique Ethiopia Sum
#>   Stillbirth                                   8          1        0   9
#>   Death in the first 24 hours                  4          6        1  11
#>   Early Neonate (1 to 6 days)                  3          1        1   5
#>   Late Neonate (7 to 27 days)                  3          2        0   5
#>   Infant (28 days to less than 12 months)     10          2        1  13
#>   Child (12 months to less than 60 Months)    11          5        0  16
#>   Sum                                         39         17        3  59