2019-11-21

Introduction

Early outbreak context

  • within a few days / weeks of index case
  • limited data available
  • no or limited intervention
  • no depletion of susceptibles
  • urgent assessment needed to inform response

Data usually available

  • dates of symptom onset

  • contact data: exposure (who infected you?) and contact tracing (who could you have infected?)

  • dates of exposure / infection

  • dates of outcome: death / recovery

  • metadata on patients: age, gender, location, occupation, etc.

  • data from past outbreaks

Key questions

Disease-dependent, but generally includes:

  • How fast is it growing?
  • What is driving the epidemic growth?
  • What is the case fatality ratio?
  • Who is most severely affected?
  • How many cases should we expect in the next days / weeks?

Refresher on statistics

Some basic definitions

  • population: set of all possible observations of a given process/entity
    example: all possible cases of cholera in location xxx
  • sample: subset of the population
    example: all cases of cholera in xxx reported last week
  • a statistic: quantity used to describe sample / population
    example: % of fatalities in cholera cases in xxx last week
  • inference: statement about population(s) from sample(s)
    example: % of fatalities in cholera cases in xxx is greater than in yyy

Population, sample, uncertainty

Uncertainty vs variability

Good statistical practices


  • when presenting estimates, show the associated uncertainty
  • always show the data when possible (not just a model)
  • account for the variability in the data

Bad practice example 1

Source: Ebola response epicell weekly presentation, Goma (DRC), 19 June 2019

Bad practice example 2

Source: Ebola response epicell weekly presentation, Goma (DRC), 29 May 2019

Bad practice example 3

Source: Ebola response epicell weekly presentation, Goma (DRC), 29 May 2019

Estimating key delays

The incubation period

Definition: time interval between the date on infection and the date of symptom onset

The serial interval

Definition: time interval between onset of symptoms in primary and secondary cases.

The generation time

Definition: time interval between date of infections in primary and secondary cases.

Estimating the underlying distribution

From empirical distribution (data) to estimated distribution.

  • choose type of distribution (e.g. normal, Poisson, Gamma)

  • find \(\theta_x\) which maximise \(p(x)\), i.e. the likelihood

  • visually: best fit between bars (data) and curve (distribution)

Reminder: what is a likelihood?

A relative measure of fit between data and model

Discretising continuous distributions

Using continuous distributions to model discrete variables:

Fitting discretised Gamma distributions for delays

  • flexible distribution (many shapes possibles)

  • 2 parameters: shape, scale

  • alternatively: mean, coefficient of variation

  • typical choice for delay distributions

  • needs to be discretised

Estimating mortality

Case fatality ratio

Definition: the proportion of cases who die of the infection.


Associated uncertainty

  • CFR = mean number of death per case \(\rightarrow\) confidence interval is Normally distributed
  • standard error: \(s_{CFR} = \sqrt{\frac{CFR (1 - CFR)}{D + R}}\)
  • confidence interval with \(\alpha\) threshold: \[ CI_{(1-\alpha)\%} = CFR \pm s_{CFR} \times Q_{1-\alpha/2} \]

where \(Q\) is a Normal quantile (e.g. \(1.96\) for \(\alpha=0.05\))

Example: confidence intervals of CFR

  • CFR: 60%, N = 10: \[ s_{CFR} = \sqrt{\frac{0.6 \times 0.4}{10}} = 0.155 \]

\[ CI_{95\%} = 0.6 \pm 1.96 \times 0.158 = [0.30 ; 0.90] \]

  • CFR: 60%, N = 100: \[ s_{CFR} = \sqrt{\frac{0.6 \times 0.4}{100}} = 0.05 \]

\[ CI_{95\%} = 0.6 \pm 1.96 \times 0.05 = [0.50 ; 0.70] \]

Common caveats

  • "case fatality rate": this is a proportion, not a rate

  • computation using wrong denominator, i.e. including unknown outcome:

\[ \frac{D}{D + R + U} \]

(leads to underestimating the CFR)

  • not accounting for uncertainty, e.g. comparing CFR across groups without statistical tests

Analysing incidence data

What is incidence?

Definition: the incidence is the number of new cases on a given time period.

  • relies on dates, typically of onset of symptoms

  • only daily incidence is non-ambiguous

  • other definitions (e.g. weekly) rely on a starting date

  • prone to reporting delays

Log-linear model of incidence

\(log(y) = r \times t + b + \epsilon\:\:\) so that \(\:\:\hat{y} = e^{r \times t + b}\)

with:

  • \(r\): growth rate
  • b: intercept
  • \(\epsilon \sim \mathcal{N}(0, \hat{\sigma_{\epsilon}})\)

Doubling time

Let \(T\) be the time taken by the incidence to double, given a daily growth rate \(r\).

\[ y_2 / y_1 = 2 \:\: \Leftrightarrow e^{rt_2 + b} / e^{rt_1 + b} = 2 \]

\[ \Leftrightarrow e^{r(t_2 - t_1)} = 2 \Leftrightarrow T = log(2) / r \]

Log-linear model: pros and cons

Pros:

  • fast and simple
  • predictions possible
  • doubling / halving time readily available
  • possible extensions to estimate \(R_0\) from \(r\)

Cons:

  • zero incidence problematic
  • non mechanistic
  • no inclusion of other information (e.g. serial interval)

Individual infectiousness over time

Serial interval: time interval between onset of symptoms of primary and secondary cases.

Global infectiousness over time

\[ \lambda_t = R_0 \times \sum_i w(t - t_i) \]

with: \(\lambda_t\): global force of infection; \(w()\): serial interval distribution; \(t_i\): date of symptom onset

A Poisson model of incidence


Treat incidence \(y_t\) on day \(t\) as a Poisson distribution of rate \(\lambda_t\):

\[ p(y_t | R_0, y_1, ..., y_{t-1}) = e^{-\lambda_t} \frac{\lambda_t^{y_t}}{(!y_t)} \] with (slight rewriting): \(\lambda_t = R_0 \times \sum_{s = 1}^{t-1} y_s w(t - s)\)

Short-term forecasting


  • draw \(R_0\) from estimated distribution
  • simulate \(y_{t+1} \sim \mathcal{P}(\lambda_{t+1} | R_0, y_1, ..., y_t)\) for increasing \(t\)
  • repeat many times with different values of \(R_0\)

Poisson model: pros and cons

Pros:

  • still fast, reasonably simple (1 parameter)
  • accommodates zero incidence
  • predictions possible: forward simulation (Poisson)
  • integrates information on serial interval
  • extension to time-varying \(R\) (EpiEstim)

Cons:

  • needs information on serial interval
  • no spatial processes
  • assumes constant reporting
  • likelihood / Bayesian methods harder to communicate