Estimating Daily People Tested (part 2)
Estimation of the number of people undergoing virus tests in England prior to de-duplication of data records (rev 1.0)
In part 1 of this series I introduced the concept of de-duplication and how this technique serves to warp our understanding by eradicating vast numbers of negative test results on a daily basis. I introduced subscribers to Adam who tested negative at work Monday through to Thursday only to obtain a positive result on Friday.
In my book Adam was a person who tested negative on Monday. Adam was also a person who tested negative on Tuesday. He was a person who tested negative on Wednesday, and a person who tested negative on Thursday. These four ‘Adams’, together with Friday’s Adam, yield a case detection rate of 1 in 5 for the week in question (20%) yet the data authorities use de-duplication to delete the negative results of Monday to Thursday and elect to declare Friday’s result only, thus making the case detection rate one positive test for one person for the week in question (100%). Multiply Adam up to represent the nation going test crazy and you can see how the authorities can generate misconception bigger even than Texas whilst pretending to be helpful.
Today I am going to look at some rather simple and somewhat crude ways of putting those missing Adams back to get a feel for the size of the problem. I shall assume subscribers have digested part 1, though you may also find this newsletter useful as well as this one.
Method #1: Moving Average Of Blanked Series
The variable uniquePeopleTestedBySpecimenDateRollingSum is where we start, being a de-duplicated count of Adams across the nation of England that we can easily download from the UK GOV coronavirus dashboard. When unpacked from the rolling 7-day sum into daily counts we find contiguous positive daily entries for the period Saturday 8 Feb 2020 - Sunday 27 Dec 2020, negative Monday entries for the period Monday 28 Dec 2020 - Sunday 20 Feb 2022 and negative Monday and Tuesday entries for the period Monday 21 Feb 2022 onward.
If we blank the negative entries to leave holes we can then use a 7-day centred moving average function to plaster over the cracks to give us some idea of the underlying trend in the number of people being tested. Here is that moving average superimposed on the banked data series:
Method #2: Historic Coefficient
A sneaky way of filling the holes generated by removing the negative entries is to compare the level of testing on Monday and Tuesday for periods when we have that data with the test activity for the rest of the week. Simple coefficients so derived enable us to reconstruct test activity on the basis of historic weekly trends. Here’s two screenshots from my spreadsheet showing the blanked daily series (grey line) with an overlay of the series estimated using the historic coefficient method:
This technique is as crude as they come but the results are surprisingly tasty being an improvement on the standard statistical techniques of using the series mean, mean of nearby points, median of nearby points, linear interpolation or local trend that we find at the heart of time series implementations such as IBM SPSS Temporal Causal Modelling.
When we compare this method with the previous method of essentially doing nothing something surprising happens:
Our filled count 7-day moving average (black line) fetches up lower than doing nothing for the more recent period and this is because test activity on Monday and Tuesday is historically lower than the rest of the week; thus by estimating missing Monday and Tuesday data we’ve lowered the mean of the series! This is most noticeable for when the system of de-duplication flipped into third gear from Monday 21 Feb 2022 onward.
Yes, But Is It Right?
This all feels a bit shaky and rather finger-in-the-air so I decided to seek another source of test activity, being the weekly Test & Trace activity reports which may be found here. The results of this work will be served hot with butter in part 3.
Kettle on!