Estimating Daily People Tested (part 4)
Estimation of the number of people undergoing virus tests in England prior to de-duplication of data records (rev 1.0)
In my previous post in this series we took a look at the weekly count of unique people tested over time using England Test & Trace data. I did a spot of lame numerical acrobatics to convert this into daily data, but this did the trick in that we saw a mighty mismatch between what England’s Test & Trace mob think was happening on the ground and what the UK GOV coronavirus dashboard mob think was happening on the ground. You’d think keeping tabs on who was tested and when would be easy but this doesn’t appear to be the case.
I decided that the Test & Trace mob were more likely to be on the ball and so opted to consider this series further. What I had in mind is determining the relationship between the number of different tests employed and the number of unique people tested. For those late to the party I’ll recap by shouting “De-duplication!”, this being a technique by which the data authorities limit test records to one test per person per accounting week regardless of how many tests they performed. If one test happens to be positive (whether a false or true positive) then that test gets counted even if they churned out a dozen negative tests (whether false or true negatives). Yes, it stinks: we have lived through a testdemic.
Rolling Out Regression
I’ve mentioned that the UK GOV crew have adopted three different methods of book-keeping for their variable uniquePeopleTestedBySpecimenDateRollingSum, these flipping personality on 28 Dec 2020 and 21 Feb 2022 so I shall investigate relationships within each of these periods separately. Let us first have a quick look at the distribution of daily counts for the dependent variable, this being the England Test & Trace data series for unique people tested.
Here’s what the series looks like in the flesh:
And here’s what this series looks like when sliced into three histograms, each representing the distribution of daily counts of people tested within each de-duplication regime:
Cogitation & Coffee: Going Geek
I’m going to throw caution to the wind and use linear regression for this analysis rather than Autoregressive Integrated Moving Average (ARIMA) time series modelling for the simple reason that I want to derive a predictive model from test data and case counts; that is, I want the number of tests conducted each day and the number of cases detected each day to predict the number of people tested each day since there should be a robust relationship between these three variables. This is a bit naughty because I’ve gone and flaunted the rule of independence1, so we need to bear this in mind.
The dependent variable is counts of people tested on a daily basis and this brings the Poisson distribution to mind, this being the most well-known of a family of discrete probability distribution. However Siméon Poisson was fascinated by counts arising from relatively rare events, and counts of folk who stuck a swab up the nose each day isn’t exactly rare, having a feel of a discrete form of the Normal distribution.
After cogitation and coffee I settled on the Tweedie distribution to handle the error structure of first and third de-duplication regimes, with the ubiquitous Normal distribution being given pride of place to drive the model for the second regime.
Ingredients For Linear Regression
Dependent variable:
TTpeople - England Test & Trace De-duplicated Count Of People Tested (transformed from weekly to daily values using the assumption of a rectangular distribution.)
Independent variables:
Model Error structure:
Tweedie with identity link, regime 1: 8 Dec 2020 - 27 Dec 2020 (n=333)
Normal (identity link), regime 2: 28 Dec 2020 - 20 Feb 2022 (n=420)
Tweedie with identity link, regime 3: 21 Feb 2022 - 7 Jun 2022 (n=107)
Descriptive Statistics
Model Performance for Regime 1 (8 Feb 2020 - 27 Dec 2020)
I’m not going to present pages of statistical output and tests of model adequacy, and shall get straight to the point with a slide of observed Test & Trace daily counts of unique people tested over time and values predicted by generalised linear modelling (GLM), though I will mention that the adjusted R-square ended-up at a rather healthy R2(adj) = 0.986. Here’s the proof of the pudding for the first time period:
How tasty is that? By using counts of tests conducted and cases detected we have been able to accurately model the number of people who were subject to testing over the period 8 Feb 2020 - 27 Dec 2020 . The model usefully predicts daily variation across each week turning the rather crude castellated Test & Trace series into something more like!
In part 5 I shall reveal the outcome for modelling the subsequent two periods. I don’t think you are going to be disappointed.
Kettle On!
The consecutive values of a times series are dependent on each other, whereas the standard method of linear regression was designed for values of data arising independently, as in a random sample.