Reliability Of Vaccination Status in the EPR (part 1)
I utilise data from an unknown NHS Trust to determine the reliability of vaccination status as encoded in the electronic patient record (EPR)
In The PCR Test As A Predictor Of Acute Respiratory Conditions (part 3) I took the liberty of building a predictive model for incidence of a positive test result as embodied in the indicator variable COVID-19 Dx. The idea behind all this fiddling is that we cannot trust the incidence of a positive test result in the EPR of deceased in-patients as being a truthful indicator of genuine disease status at death, and so we instead rely on a statistical model to point to likely genuine cases.
In building a predictive model with the aid of machine learning, which seems to be the fashion these days, I came across an interesting result for the vaccination status indicator that I shall paste again here:
The Illusion Uncovered
Being curious I did wonder where vaccination sat in all of this and found a single main effect cowering near the bottom (OR = 0.18, p<0.001). If we are to follow the shallow reasoning of the pro-vax cult then this indicates a near six-fold magical reduction in the likelihood of a positive test result for everything and everybody. I don’t think so!
The reason I don’t think so is that this main effect is not embedded in an interactive term with an odds ratio less than unity that is pointing to a reduction in acute respiratory conditions or chronic respiratory disease in association with vaccination. In stats-bod speak the indicator variable Vaccination is exhibiting independence; that is to say, it is a variable that bears no associative relation to any other variable in a clinical sense. This situation will come about if the decision to run a PCR test is governed by vaccine status and not by medical need. This is the illusion uncovered!
This is a very long-winded way I saying I don’t believe what I am seeing. Benefit from mRNA vaccines designed at combating severe symptoms arising from SARS-COV-2 infection is one thing but preventing people falling off ladders or dying from cancer is quite another! Something must be driving the seeming broad-brush magic of the elixir and that something must be to do with sample bias. Professors Fenton and Neil tear into this in splendid fashion.
To make matters well wobbly a couple of whistleblowers have mentioned just how shoddy things are at the coal face when it comes to assigning vaccination status, the crunch point being unique identification of individuals via their NHS number. Lack of that all important ID means vaccination status cannot be updated and so folk get categorised as unvaccinated. I have also heard on the grapevine that those with a ratified NHS ID fail to get registered as vaccinated if they die shortly after vaccination.
It’s one of those harsh facts of clinical life that not everybody gets assigned a NHS number in the first instance. I have sat with booking clerks growling at screens after numerous search attempts have failed to secure a unique and unequivocal NHS ID for individuals coming in for diagnosis and treatment. Then there are relatives (and even inpatients) that provide an incorrect DOB or a street address that doesn’t exist. Neither does it help if clerks enter an American style date of MMDDYYYY on an English DDMMYYYY workstation! All this and more is why biscuits are necessary.
Trust In Me
In Walt Disney’s version of Kipling’s The Jungle Book there’s a point when the Python Kaa mesmerises Mowgli by singing the song Trust In Me. Well, that indicator variable I’ve called Vaccination has been mesmerising me from the beginning, and for some reason I popped out of the spell after typing out my last article. Distrust grew to the point where I decided to go back to the raw data and attempt to build a model that predicts vaccination status.
A key ingredient in this recipe is vaccination uptake and for this I went straight for the horse’s mouth that is UKHSA and their Weekly National Influenza and COVID-19 Surveillance Report that provides vaccine uptake by week and quinary age band for the nation of England. A bit of jiggery pokery in Excel and a few minutes later I was looking at an independent variable I decided to call Uptake owing to sheer lack of imagination. My list of raw ingredients now looked like this:
There’s my freshly churned case detection rate (CDR) for pillar 1 revealing a mean of 5.08% for disease prevalence within the health sector over the period 2020/w11 – 2021/w36 - a reasonable figure I shall wager; and there’s my spanking brand new vaccine uptake variable indicating a range from 0% - 95.91% with an overall grand mean of 35.96%. Sitting at the bottom of the independents is the COVID diagnosis as declared in the EPR.
Black Box Bash
Now to pass these 20 variables though the mysterious black box in an attempt to predict vaccination status as recorded in the EPR. Some may ask why another black box and not classic multivariate logistic regression and the answer is I’ve been impressed with its performance in predicting COVID cases and it saves an awful lot of time – building a logistic model that contains 211 terms requires a great deal of patience!
Here’s how the black box fared in terms of classification performance between training and testing stages for prediction of vaccination status over the period 2020/w11 – 2021/w36:
Readers who are used to classification tables and ROC curves will realise this model performs exceptionally well. We’ve got true negatives (prediction of unvaccinated for unvaccinated folk) at 89.5% and 89.0%, and true positives (prediction of vaccinated for vaccinated folk) up at 93.8%. This is a well-decent model so let us take a look at that table for normalised variable importance:
I’m rather pleased with this since it makes total sense. Right at the top is what we’d expect to see and that is my newly-minted vaccine uptake variable. Below this we have the pillar 1 case detection rate, which is also a measure of the passage of time and a proxy for seasonal campaigns by the NHS as well as personal factors such as getting vaccinated in order to see relatives during holiday periods. Age and the total number of diagnoses (a proxy for case complexity) take up third and fourth positions respectively, with the controversial COVID diagnosis as presented in the EPR sliding into fifth place.
If I get the model to make a bold decision (probability >=50%) as to vaccination status and pool the training and testing phases we arrive at this cross-tabulation of observed vaccination status against predicted:
What I’m interested in are false negatives1 (folk who were actually vaccinated but haven’t been flagged as such in the EPR). The model indicates a significant number of these – 1,536 cases in 14,418 (10.7%). That’s a substantial chunk of potentially missing data and we may gingerly compare it to the false positive error rate of 311 cases in 5,039 (6.2%).
Even with all the caveats that we may usually throw out with modelling of this kind, this result suggests to me that my hunch is likely correct. It sure looks like Trust EPR systems are not completely reliable sources for vaccination status, this result being supportive of comments made to me in confidence. But we don’t have to get all conspiratorial about this for as a former NHS clinical database manager I am painfully aware of just how difficult it is to capture data to a high standard at the best of times and with dedicated staff.
As regards those false positives (predicting vaccination when folk were not vaccinated according to the EPR) I was curious as to how the model would handle the pre-vaccination period of 2020/w11 – 2020/w49: would it erroneously predict vaccination for deaths in the pre-vaccination era? Take a look at this:
Well, that smashing result pleased me no end on this fine September morning! Not a single vaccinated in-hospital death was predicted prior to rollout in 2020/w52, though this should not come as a total surprise given the biggie variable in the model is Vaccine Uptake. Clean results like this give me greater faith in what machine learning has… er… learned.
Kit-Kat
With that stew simmering it’s time to take a break and have a Kit-Kat, for there’s a lot to cogitate on. I need to run the model again without the declared (and likely erroneous) COVID diagnosis, and I ought to look to see how classification error changes over time, not just in the prediction of vaccination status but also in the prediction of COVID status.
It is rather ironic that the two variables that matter the most when it comes to assessing vaccine benefit are the two variables I distrust the most, but I guess trying to crack this problem is keeping me off the streets and away from the lawn mower.
Summary
Machine learning (Multilayer Perceptron) was employed in the prediction of vaccination status for 19,457 in-hospital deaths for the period 2020/w11-2021/w36.
Exceptional levels of classification performance were achieved, with an overall true positive classification rate of 93.8%, and overall true negative classification rate of 89.3%.
The overall false positive rate (predicting vaccination in the absence of an EPR flag) was estimated at 10.7% (1,536/14,418) which suggests that linking local Trust records to the national immunisation database - which is critically dependent on NHS number identification and verification - may have failed in up to eleven percent of cases.
The overall false negative rate (failing to predict vaccination in the presence of an EPR flag) was estimated at 6.2% (311/5,039).
Kettle On!
In strict modelling terms these should be called false positives; that is, the model predicts vaccination when the EPR indicates not.
John, this is all fascinating and I can read and enjoy your articles without the feeling of cold dread I used to get in medical school whenever the word statistics was mentioned!