5

SARS-CoV-2 Wastewater Data, Enhanced

 2 years ago
source link: https://towardsdatascience.com/sars-cov-2-wastewater-data-enhanced-9717d8197f98
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

SARS-CoV-2 Wastewater Data, Enhanced

Combining CDC’s wastewater data with COVID-19 vaccinations, hospitalizations and deaths

Photo by Fusion Medical Animation on Unsplash

Background

There has been a lot of press coverage lately about tracking the SARS-CoV-2 virus, which causes the COVID-19 disease, in wastewater (sewer systems). A strong link has been established between spikes in wastewater virus and spikes in disease about a week later. As just one example, this chart shows the virus signal (dark blue) across various water treatment facilities in the United States versus known US COVID-19 cases (light blue).

1*npI2OsPqt-xOw6lrsB2qYA.png?q=20
sars-cov-2-wastewater-data-enhanced-9717d8197f98
Image from biobot.io/data/, used with permission

Recently the National Wastewater Surveillance System (NWSS) of the US CDC released public data about the wastewater infomation they gather. Their dashboard provides a nice summary of the data and introduction to the topic.

NWSS also makes available two detailed datasets that show every laboratory test performed on wastewater and its results. These datasets contain many columns of information, including:

  • Name and location of the water treatment facility
  • Date and time the sample was gathered
  • Counties and population that drain into that sewer
  • Types of tests performed on the sewage
  • The identity of the lab that did the tests
  • Numerical results of the tests for RNA fragments

These datasets are available to researchers who sign a data-use agreement to protect confidential personal information they might contain. (With a small sewer system it might be possible to infer the names of people who are sick.)

The two detailed datasets differ in that one, named raw, contains just the facts, while the other, analytical, has many columns of statistical results that may help scientists analyze the data.

A Problem

As valuable as the detailed CDC wastewater data is, it is lacking in one key area — correlation to additional real-world COVID-19 data such as vaccination and disease outcomes. The datasets do have information about known cases in the area, but simple case counts are notoriously unreliable. Many sick people never get a government-reported PCR test; the availability of testing varies month to month; people tend to rush out for testing when there are scary news reports; and home tests are not reported to government health departments. The more accurate measures of disease severity are hospitalization rates, ICU admissions and mortality.

A simple approach would be to join wastewater data with vaccination rates and disease outcomes from the same day the wastewater sample was collected, but this would not be helpful. Vaccination offers no reduction in disease resistance on the day of the jab. People who get sick rarely go to the hospital on the first day they have symptoms. When, unfortunately, someone dies of COVID-19, that usually occurs days after they enter a hospital.

To see how vaccination influences SARS-CoV-2 in wastewater, we need to look at various vaccination rates (first shot, full vaccine, booster) some number of days before the wastewater is collected. To see how wastewater predicts hospitalizations, we need to look at hospital data after the water test. To correlate wastewater with mortality, we need an even longer time delta.

Solution

A recent data engineering project I undertook addresses these issues. It generates an enhanced version of the NWSS detailed datasets with vaccination, hospitalization and deaths in that area, and the dates for these facts are before or after the water tests. For example, in a data row that describes a water sample gathered on September 15 in Arenac County, Michigan, there is a column added for full vaccination rate in that county on September 5 and another added for ICU admits there on September 29.

The setback/set-ahead dates are easily adjustable, and it is quick to regenerate the dataset with a 21 day setback for first vaccination and a 28 day set-ahead for mortality, for example.

Another problem solved by my dataset is that most US COVID-19 health data is organized at the county level (or state or country). However, water treatment plants often handle sewage from more than one county. My code reformats the NWSS detail files so that each row shows just one county, to make it easier to join the datasets with available health outcome data.

Here is one enhanced row, showing a few key columns, turned vertically for readability:

CountyFIPS = 08069pcr_gene_target = n1sample_collect_date = 2022-01-20pcr_target_units = copies/l wastewaterpcr_target_avg_conc = 490473.59vax_date = 2022-01-10.  # 10 days before water samplemetrics.vaccinationsInitiatedRatio = 0.724metrics.vaccinationsCompletedRatio = 0.656metrics.vaccinationsAdditionalDoseRatio = 0.324cases_date = 2022-01-27. # 7 days after water samplemetrics.caseDensity100k = 155.5metrics.testPositivityRatio = 0.214hosp_date = 2022-02-03.  # 14 days after water sampleactuals.icuBeds.currentUsageCovid = 13actuals.hospitalBeds.currentUsageCovid = 91deaths_date = 2022-02-10.  # 21 days after water sampleactuals.newDeaths = 2

Larger anonymized samples of the enhanced detailed datasets are on my github for the raw and analytical data. Each has 1000 random rows and all the columns.

Replicating This Result

To build your own enhanced NWSS detailed dataset:

  • Download the Python/pandas source code.
  • Apply to NWSS for access to the detailed datasets of wastewater samples. The contact information for the data owner is on the data description page. Note that this page describes the summarized public NWSS data, but the contact is the same.
  • Download COVID-19 vaccination and outcome data, and US county populations, as shown in the source code comments. These will be joined with the NWSS detailed datasets to make the enhanced versions.

The code and sample data are posted under the MIT license, which requires only attribution to reuse or modify the work.

Future Work

This project created the datasets described here, but did not perform any specific analysis on them. An obvious next step is to use the enhanced datasets to look for correlations between wastewater results and ICU admissions, or whether the first vaccine or full vaccination has more effect on wastewater virus, or the optimum time delta for one of these events to predict another.

The data discussed here is for the United States. There is also wastewater testing around the world for SARS-CoV-2 RNA. The sites doing this detection are shown on dashboards from University of California at Merced and the Global Water Pathogens Project. A valuable extension to my work would be to create a single global dataset that combines all the wastewater detection sites shown on the maps and to enhance that data with cases, hospitalizations, ICU and mortality. In other words, to create the same dataset presented above for the whole world.

Credit

Thank you to Amy Kirby and the NWSS data owner at CDC for helpful discussions and access to the detailed datasets, to Colleen Naughton and Claire Duvallet for encouragement and pointers to other research related to watewater-based epidemiology, and to Mimi Alkattan for valuable review comments.

For More Information


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK