COVID-19 cluster surveillance using exposure data collected from routine contact tracing: The genomic validation of a novel informatics-based approach to outbreak detection in England
Simon Packer, Piotr Patrzylas, Iona Smith, Cong Chen, Adrian Wensley, Olisaeloka Nsonwu, Kyle Dack, Charlie Turner, Charlotte Anderson, Rachel Kwiatkowska, Isabel Oliver, Obaghe Edeghere, Graham Fraser, Gareth Hughes

TL;DR
This study shows that enhanced contact tracing combined with genomic data can effectively detect real COVID-19 outbreaks in England.
Contribution
The first national validation of using contact tracing data to systematically detect outbreaks at a population level.
Findings
Over 269,470 exposure clusters were identified, with 25% genetically validated as real outbreaks.
81% of genetically valid clusters were not recorded in existing systems and were detected one day earlier.
Workplace and education clusters were most strongly associated with genetic validity.
Abstract
Contact tracing was used globally to prevent onwards transmission of COVID-19. Tracing contacts alone is unlikely to be sufficient in controlling community transmission, due to the pre-symptomatic, overdispersed and airborne nature of COVID-19 transmission. We describe and demonstrate the validity of a national enhanced contact tracing programme for COVID-19 cluster surveillance in England. Data on cases occurring between October 2020 and September 2021 were extracted from the national contact tracing system. Exposure clusters were identified algorithmically by matching ≥2 cases attending the same event, identified by matching postcode and event category within a 7-day rolling window. Genetic validity was defined as exposure clusters with ≥2 cases from different households with identical viral sequences. Exposure clusters were fuzzy matched to the national incident management system…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 Digital Contact Tracing · COVID-19 epidemiological studies · Data-Driven Disease Surveillance
Introduction
Globally, contact tracing was deployed during the COVID-19 pandemic to limit and prevent viral transmission through the identification and isolation of persons at greatest risk of developing disease [1–4]. It was recognised early in the pandemic that SARS-CoV-2 could be transmitted prior to symptom onset, and that transmission was overdispersed, with a minority of cases contributing to the majority of onward transmission events [5,6]. These observations suggested that conventional contact tracing alone, primarily focusing on the identification and isolation of named contacts, would have limited impact on the control of community transmission [2,7–13]. These observations were supported by several investigations of COVID-19 clusters where the primary cases and subsequent chains of transmission would not have been identified using traditional contact tracing methods alone [14–21]. While cluster surveillance based on confirmed cases can have significant utility (e.g. for the monitoring of continuing outbreaks in institutional settings), it provides at best indirect evidence for primary events responsible for transmission.
Backwards contact tracing (BCT) aims to identify the index case and other cases linked to the common source/setting of infection [22]. Modelling studies suggested that capturing cases’ exposure data during BCT could substantially increase contact tracing effectiveness [6,23]. This was supported by a prospective epidemiological study demonstrating the use and benefit of BCT among students in the Belgium [24]. A small number of countries, including Japan, adopted this approach early in the pandemic, leading to more timely recognition and termination of transmission chains [25,26].
In June 2020 the United Kingdom Scientific Advisory Group for Emergencies (SAGE) recommended that a bidirectional approach to contact tracing be developed and implemented [27]. The Public Health England (now United Kingdom Health Security Agency; UKHSA) Enhanced Contact Tracing Programme (ECT) was integrated with the conventional forwards national tracing programme of NHS Test and Trace (NHS T&T) in October 2020 and continued until the cessation of contact tracing of all cases in February 2022. The ECT programme deployed a cluster surveillance system based on case exposures during the pre-symptomatic period (3 to 7 days prior to symptom onset). Exposure data were collected on all cases during routine contact tracing and on a daily basis were algorithmically matched to data from other cases to define “exposure clusters”. These clusters were risk assessed by local public health teams to identify events and/or locations potentially associated with transmission [28]. We describe here the ECT programme in England, the epidemiology of clusters of case exposures of COVID-19 and provide evidence for the validity and operational utility of this approach for disease control through the early identification of transmission events and outbreaks.
Methods
COVID-19 contact tracing in England
The NHS T&T national contact tracing programme was launched in England in May 2020. The system received reports on all confirmed cases of SARS-CoV-2 identified through laboratory testing in England. Cases were initially invited by text message or email to self-complete a contact tracing questionnaire; those that couldn’t be contacted or who did not respond within a defined period were contacted by telephone [28,29]. Case information collected included demographic and clinical data, locations visited outside the home, and close contacts during the infectious period (defined as two days before symptoms or confirmatory laboratory test, to date of self-report of contact).
Enhanced contact tracing in england
Additional questions were added to the case questionnaire in October 2020 to collect information on events and activities outside the home during the period where infection was most likely to have occurred. This was defined as 3–7 days before symptom onset or date of positive test (Fig 1) [30]. Data collected included: event description, event category, attendance date, postcode, and proximity risk indicators (crowded, close contact, closed space). Event categories were defined at three levels: the first indicated the type of event (workplace/education, household or accommodation, or events/activities) while the second and third level categories provided an increasing level of detail regarding the type of activity and its location (S1 Table).
Periods of data collection for enhanced contact tracing.The backwards period of contact tracing was that likely to reflect probable exposure for the case (3–7 day period before symptom onset or date of the positive test). Data was collected on events and activities at workplace, education, household and other settings (such as hospitality and leisure).
Exposure clusters, 2-days window and same day event groupings
Exposure clusters were defined as instances where ≥2 cases reported attending an event with the same postcode and setting category (the location) and with attendance dates within a seven-day rolling period (for example, three exposure events would be linked together if cases attended the same location on the first, fourth and tenth day of the month). A 2-day event window was defined where matching events occurred within a two-day rolling window. Same day events were defined as where the matching events occurred on the same day.
Ethical statement
Ethical approval was not required as the work was part of the public health response to COVID-19. Consent was not required as all data were originally collected for contact tracing and health protection purposes and fall under Regulation 3 of the UK Health Service (Control of Patient Information) Regulations 2002.
Operational use of exposure cluster reports
Daily lists of exposure clusters were automatically processed into a PowerBI dashboard where they were accessed by local and regional public health teams and national incident managers, with regional-level access controlled using Microsoft security groups. Public health teams used exposure cluster information to identify, and risk assess clusters as possible outbreaks. Weekly surveillance reports on incident exposure clusters at local authority level were also made available to public health teams.
Data analysis
Exposure clusters
Exposure clusters were identified from event data collected from confirmed cases of COVID-19 referred to NHS T&T for contact tracing between 23 October 2020 and 1 August 2021. Cases were included if they had completed a case questionnaire (via digital self-report or call handler), had a residential address in England, and reported at least one event outside their home. Events were linked to an exposure cluster if they matched deterministically on event postcode and setting category and had attendance dates within 7 days of another matched event. Matches were not permitted between events reported by the same case, e.g., if a case attended a workplace across multiple days. The notification date for exposure clusters was the date of entry of the second case into the contact tracing system. Exposure cluster reports were derived through daily linkage of all events reported by cases in the backwards period with geographical information and an attendance date within the past 30 days. Common exposures with a postcode outside England were removed.
Descriptive epidemiology
National case numbers by specimen date and vaccination data were obtained from the Public Health England (PHE) Coronavirus dashboard [31]. Descriptive analysis included trend analysis of events per case and frequency of exposure clusters by setting, number of cases, distribution of cases over time, background incidence, median age, duration, cumulative 2^nd^ dose vaccination coverage and sex ratio of the exposure cluster. Background incidence (cases per 100,000 population) and 2^nd^ dose vaccine coverage were assigned to exposure clusters based on the earliest attendance date and upper tier local authority (a local government structure responsible for a range of services to the population of a defined area) of the setting. Descriptive statistics (mean, median, interquartile range) were calculated according to the type of data. Events and exposure clusters were grouped into time periods based on the national restrictions in place in England [32,33].
Validation of exposure clusters using genomics data
Contact tracing records were linked to their corresponding laboratory records and whole genome sequencing data as previously described [34]. Exposure clusters were included in validation analysis if ≥2 cases were successfully linked to genomics data. An exposure cluster was considered genetically valid if it included ≥2 cases from different households where sequences were zero single-nucleotide polymorphisms apart. Household sharing was determined using unique property reference number (UPRN) obtained from address matching using the Ordnance Survey Address Base [35].
Exposure clusters and reported outbreaks and incidents
Data on COVID-19 incidents and outbreaks notified to and/or managed by regional health protection teams (HPT) were obtained from HPZone, the national health protection case and incident management system. Exposure clusters were linked to reported incidents/outbreaks by postcode and a further fuzzy match on the free text description of the exposure cluster setting provided by cases during contact tracing and included on HPZone. A successful match was made where ≥70% of words (a pragmatic cut-off, irrespective of length) matched between the exposure cluster and HPZone free text description (i.e., 70% of the words within each description were also found in the other description). Valid links were defined as those where the exposure cluster report date was up to seven days before or after the date the situation was entered onto HPZone.
Multivariable analysis
Single variable and multivariable analyses were used to identify factors associated with genetically valid exposure clusters. Odds ratios (OR) and corresponding 95% confidence intervals (CI) were calculated. A forward approach was used to build a model with the contribution of variables assessed through reduction of Akaike information criterion (AIC) and significance of likelihood ratio test (p<0.05). Variable coefficients and p-values were assessed in the single variable analysis and sequentially added to the multivariable model in order of decreasing significance.
Characteristics of exposure clusters considered for inclusion were the total number of cases, setting, median age of cases, number of same day events, duration, and standard deviation of the sex ratio. A priori confounders (background COVID-19 incidence, cumulative 2^nd^ dose vaccination coverage, urban-rural classification, [36] Index of Multiple Deprivation [37] of exposure postcode) significant through single variable analysis (p<0.05) were considered for inclusion in the model containing exposure cluster characteristics and effects assessed for inclusion as above.
The model was assessed for influential variables (Cook’s distance via residuals versus leverage plot using a cut off 0.5), multi-collinearity (variance inflation factor >5), and the assumption of linearity for continuous variables (Local Polynomial Regression Fitting via graphical assessment of “Loess” line monotonicity). Model prediction of genetic validity was assessed by calculating predicted probabilities and using a receiver operator curve (ROC) statistic. All analysis was undertaken using R version 4.2.1 [38].
Role of the funding source
This work was conducted as part of the public health response to the COVID-19 pandemic in England.
Results
Contact tracing data and reported events
There were 4,628,798 confirmed cases referred for contact tracing during the study period, of which 89% (4,119,630) completed the contact tracing questionnaire. Of those, 57% (2,318,450) declared at least one event outside the home during the backwards period; these cases reported a total of 7,368,666 events (mean 1·6 events per case). Work or education events were most frequently reported with 4,474,540 events declared by 1,494,773 cases (average of 1·7 events per case; Table 1). The median interval between the earliest backwards event and symptom onset was 5 days, with a duration between onset and testing of 6 days and time to referral for contact tracing of 8 days (Table 1).
Table 1: Events declared by cases during contact tracing by direction relative to symptom onset and by event type.
Epidemiology of exposure clusters in England
The distribution and magnitude of exposure clusters varied in relation to changes in case incidence and the implementation of national non-pharmaceutical interventions (Fig 2). Overall, more than a quarter of a million exposure clusters (269,470) were identified during the study period; a median of 4,142 (IQR: 1,402–10,598) clusters each week in England. At the peak, 22,879 exposure clusters were identified in a single week (July 12–18 2021). Clusters were most frequently identified in education (19·8%), shopping (19·4%) and workplace (14·3%) settings (Fig 2).
COVID-19 incidence and events and exposure clusters reported to contact tracing.(A) incidence of new confirmed cases; (B) number of events by period of attendance; (C) exposure clusters by event type. Backwards events reflect those reported by cases during the likely exposure period (3–7 days before symptom onset or date of positive test), forwards events those reported after the case was likely infectious (from 2 days before symptom onset or date of positive test to the time of contact tracing). Data is shown relative to national restrictions in England from 23 October 2020 to 1 August 2021. National non pharmaceutical interventions: 1: second national lockdown; 2: third national lockdown; 3–6: roadmap out of restrictions.
At the start of the study period there was a rapid increase in exposure clusters with a peak at the beginning of November 2020, closely following the peak of the concurrent wave of COVID-19 in England. Most of these exposure clusters were in education, hospitality, and entertainment settings. The lockdown period that followed in November 2020 was associated with a sharp but slightly delayed decrease in exposure clusters, with clusters in education settings retaining a high frequency. Lifting of the lockdown in December 2020 led to a substantial increase in exposure clusters: exposure cluster incidence was high in education settings, but also increased substantially in shopping, workplace and hospitality and entertainment settings. The increase was sustained throughout December 2020, with the exception of educational settings where exposure cluster numbers decreased following school and university closures.
The start of the next national lockdown in January 2021 was associated with a decrease in exposure cluster incidence, with declines in hospitality and entertainment settings, but not in workplace and other settings. From February 2021 to the end of May 2021, case numbers fell markedly, and exposure clusters remained infrequent. In June 2021 during the final lifting of national restrictions, case numbers rose sharply and a concomitant increase in exposure clusters was observed in all settings, with high numbers identified in hospitality, entertainment, education, and workplace settings (Fig 2).
Factors associated with genetically valid exposure clusters
There were 13,058 (5·2%) exposure clusters eligible for inclusion in the analysis of genetic validity. Of these, 25% (3,306) were defined as genetically valid (Table 2). The proportion of genetically valid clusters varied over the study period: from 14% in November 2020 and July 2021 to 36% in April 2021. The proportion of genetically valid exposure clusters was highest in clusters of ≥10 cases (37%, 260/712), in education (35%, 1246/3528) or workplace (42%, 577/1371) settings and those containing greater than five instances where ≥2 cases reported attending on the same day (same day attendance) (43%, 470/1088) (Table 3). IMD and rural/urban classification were both not significantly associated with genetic validity in single variables analysis and were not included in multivariable modelling.
Table 2: Exposure cluster genetic validity and matching to managed incidents.
Table 3: Crude and adjusted associations between characteristics of exposure clusters and genetic validity.
The final model included 12,786 observations (267 exposure clusters excluded due to: postcodes outside of England, missing values for ≥1 variable or found to be highly influential on model fit) and had an area ROC of 0·71 (95% CI 0·70–0·72). Five influential observations were removed from the model which resulted in large percentage change in the association between two settings (personal care and custodial institutions) and genetic validity. No collinearity was observed between variables included in the final model. All continuous variables showed no substantial evidence of non-linearity through visual assessment.
Exposure clusters that included more cases, were shorter in duration, and contained a greater number of same day events, were more likely to indicate genetically valid transmission events (Table 3). There was a dose-response relationship between the number of events in an exposure cluster and likelihood of genetically linked cases. Clusters of longer duration were significantly less likely to represent genetically valid signals for outbreaks. An increased number of same day events within the cluster was associated with genetically linked cases, with odds increasing significantly (using the absence of same day events as the reference group) with the number of same day events included: two same day events (aOR 1·58 [95% CI 1·37–1·82]) and >5 same day events (aOR 3·57 [95% CI 2·89–4·41]) (Table 3).
Clusters in all settings other than those in custodial institutions were found to be independently associated with increased odds of genetic validity (using shopping as the reference group). Strong associations were observed for workplace settings (aOR 5·10 [95% CI 4·23–6·17]), education settings (aOR 3·72 [95% CI 3·08–4·49]), healthcare settings (aOR 3·09 [95% CI 2·27–4·19]), and hospitality settings (aOR 2·89 [95% CI 2·41–3·47]).
Genetically valid exposure clusters and reported incidents/outbreaks
Over 5% (13,494/248,864) of all exposure clusters identified during the study period were linked to incidents recorded on the national incident management system (HPZone). Of the exposure clusters eligible for inclusion in the genetic validity analysis (n = 13,008), 47% (622/1318) of HPZone matched exposure clusters were genetically valid compared to 23% (2684/11690) of those that were not matched. Genetically valid exposure clusters linked to situations on HPZone were identified a median of one day (IQR 0–4, range -7 to 7) earlier through ECT than the corresponding entry on HPZone (Table 2).
Discussion
In this study we have described the epidemiology of COVID-19 case exposure clusters identified by the ECT programme in England and provided evidence for their validity and utility for the rapid identification of outbreaks. Approximately 25% of exposure clusters detected through the ECT programme included ≥2 genetically indistinguishable SARS-CoV-2 infections. This proportion increased to >30% during low incidence periods, where the impact of early action by local public health teams to break transmission chains would be highest. We have also identified cluster characteristics independently associated with increased likelihood of genetic validity; these include clusters of larger size, including same day events, and those in particular settings (including healthcare and workplaces).
The ECT cluster surveillance system frequently detected outbreaks before they were recorded as managed incidents by local health protection services; approximately one half of exposure clusters linked to subsequently confirmed outbreaks were detected before registration on the national incident management system. These events frequently occurred outside of formal institutional settings and could represent important foci of community transmission. Exposure cluster settings included hospitality and mass gatherings, where contacts were likely to be unknown to each other and would not be rapidly identified, if at all, through conventional contact tracing. Community settings contribute significantly to onward spread of COVID-19 [14,17,18,39–42] and cluster identification provided corroborative and real-time information to support local risk assessment and management of outbreaks.
To our knowledge, the ECT programme in England was the only national programme using contact tracing information for systematic surveillance of COVID-19 clusters based on the exposures of cases during their pre-symptomatic period. A key consideration for any cluster surveillance system is achieving the optimal balance between sensitivity and specificity. The ECT exposure cluster algorithm was initially designed to prioritise sensitivity over specificity and used a broad time period and postcodes for linking case events. We have shown that clusters defined through shorter time period linkages (e.g., 2-day event window or same day events) are more likely to represent actual transmission events and can be used to improve specificity. Furthermore, the use of unique property reference numbers was introduced towards the end of the programme to increase the precision of geolocation. Improving specificity whilst maintaining sensitivity would be a key development for future cluster surveillance.
The strengths of this study stems from the secondary analysis of systematically collected national contact tracing data. Exposure data was collected from more than 85% of confirmed COVID-19 cases in England over the study period, providing comprehensive and representative coverage with considerable statistical power. Linkage to available genomics data provided a means to validate exposure clusters using a highly specific indicator of probable transmission. Although genomic sequencing coverage limited the proportion of cases which could be included in assessment of genomic validity, cases were largely selected randomly for sequencing (by geographically weighted sampling of community cases), with some oversampling of some high-risk groups (such as healthcare workers and international travellers). Strengthening the coverage and timeliness of genomic surveillance is critical for more effective cluster detection of this kind.
Limitations include the use of primary source data collected for operational purposes, and likely subject to a degree of heterogeneity and incompleteness in data collection. A significant proportion of exposure events may not have been recorded because cases were either unaware or deliberately chose not to report them, although the direction and potential size of any resulting bias is unclear. Additionally, the genetic validity investigations were based on a small proportion of all exposure clusters, this may have introduced representativity bias, the nature and direction of which cannot be determined.
The use of a highly specific definition for genetic validity means we have likely underestimated the true number of valid clusters. Minor variant genomes can emerge to dominance within an individual [43] with the potential for genetic compartmentalisation between the respiratory tract and gastrointestinal tract [44]. In addition, treatments for COVID-19 that interfere with viral replication can induce mutational signatures associated with greater sequence divergence between transmission pairs [45]. Such signals may be greater in certain population groups (e.g., older adults more likely to receive treatment).
Given that transmission of a genetically identical sequence is more likely to occur earlier during an infection [43], settings more likely to be associated with close to continuous exposure (such as households) are more likely to have been detected using our conservative methodology. The observation that longer clusters were less likely to be genetically valid may also be in part due to the accumulation of substitutions during longer transmission chains. Future work needs to assess the impact of these elements and evaluate the use of more relaxed genetic matches on cluster assessment and outbreak detection.
The ECT programme identified and communicated exposure clusters to local public health teams daily. Based on expert opinion and guidance, exposure clusters were risk assessed for the need for public health action. Without the availability of genetic validation during the response, exposure clusters lacked specificity. In future we recommend that a predictive modelling approach, which uses genomic validation, is used to help triage and prioritise clusters for risk assessment. The use of predictive modelling and genomic validation could enable real-time model calibration based on changes in background epidemiology of the virus. However, such an approach may be limited by the turnaround time for sequencing of isolates. Further work could use network analysis methods to combine exposure cluster data with other available transmission indicators to build a transmission network of extant links. These networks could be used to infer the setting/source of infection for all cases as the pandemic progresses, providing vital information on which settings are associated with transmission and to target interventions.
For this study we employed simplistic text matching methods for detection of exposure clusters and linkage to situations under public health management. This has exposed requirements for machine learning methods to improve text matching and exposure cluster detection from contact tracing data. Additionally, these is a further need to develop unsupervised machine learning models to provide timely predictions that exposure clusters are outbreaks. These three health protection requirements for future work are detailed in Table 4.
Table 4: Technological development needs for future studies.
Through analysis of routine contact tracing data collected in England, we have shown that systematic case exposure cluster surveillance is a feasible and valid tool for outbreak detection and situational awareness that can complement traditional methods. Although an evaluation of the effectiveness of such programmes to reduce transmission are required, exposure cluster surveillance should be considered for pandemics or epidemics where contact tracing is integral to the response. The methodology may be applicable across a range of infectious diseases, particularly those characterised by overdispersion of transmission and where transmission occurs across a variety of different settings.
Supporting information
S1 TableEvent categories used to classify forward and backward events reported by confirmed COVID-19 cases in the national contact tracing system in England.(DOCX)
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1World Health Organization. Contact tracing and quarantine in the context of COVID-19: interim guidance, 6 July 2022. Available from: https://www.who.int/publications/i/item/WHO-2019-n Co V-Contact_tracing_and_quarantine-2022.1.37184162 · pubmed ↗
- 2Eames KTD, Keeling MJ. Contact tracing and disease control. Proc R Soc B Biol Sci. 2003;270: 2565–2571. doi: 10.1098/rspb.2003.2554 14728778 PMC 1691540 · doi ↗ · pubmed ↗
- 3ECDC. Contact tracing: public health management of persons, including healthcare workers, who have had contact with COVID-19 cases in the European Union–third update. Available from: https://www.ecdc.europa.eu/sites/default/files/documents/covid-19-contact-tracing-public-health-management-third-update.pdf.
- 4Hale T, Angrist N, Goldszmidt R, Kira B, Petherick A, Phillips T, et al. A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker). Nat Hum Behav. 2021;5: 529–538. doi: 10.1038/s 41562-021-01079-8 33686204 · doi ↗ · pubmed ↗
- 5Ko YK, Furuse Y, Ninomiya K, Otani K, Akaba H, Miyahara R, et al. Secondary transmission of SARS-Co V-2 during the first two waves in Japan: Demographic characteristics and overdispersion. Int J Infect Dis. 2022;116:365–373. doi: 10.1016/j.ijid.2022.01.036 35066162 PMC 8772065 · doi ↗ · pubmed ↗
- 6Endo A, Abbott S, Kucharski AJ, Funk S. Estimating the overdispersion in COVID-19 transmission using outbreak sizes outside China. Wellcome Open Res. 2020;5:67. doi: 10.12688/wellcomeopenres.15842.3 32685698 PMC 7338915 · doi ↗ · pubmed ↗
- 7Greenhalgh T, Jimenez JL, Prather KA, Tufekci Z, Fisman D, Schooley R. Ten scientific reasons in support of airborne transmission of SARS-Co V-2. Lancet 2021;397:1603–1605. doi: 10.1016/S 0140-6736(21)00869-2 33865497 PMC 8049599 · doi ↗ · pubmed ↗
- 8Tang JW, Marr LC, Li Y, Dancer SJ. Covid-19 has redefined airborne transmission. BMJ. 2021; 373:1–2. doi: 10.1136/bmj.n 913 33853842 · doi ↗ · pubmed ↗
