COVID vs. COVID (part 3)
Predictive models for the incidence of COVID-19 amongst adult in-hospital deaths are used to assess diagnostic bias in the EPR of an unknown NHS Trust
In part 2 of this series I stumbled upon a tasty finding, this being a substantial mismatch between incidence of observed and predicted COVID-19 diagnoses for the first 10 weeks of 2021 as vaccination got into its stride. If we assume that the machine learning technique employed (multilayer perceptron) has done a decent job of predicting the likelihood of a COVID diagnosis from a matrix of symptoms then we need to ask why this 10-week period is characterised by a substantial surplus of diagnoses declared in the EPR for 19,457 deceased in-patients. Thorny questions we might ask include:
Was the PCR cycle threshold cranked up to yield a bevy of false positives?
Was the mRNA vaccine inducing false positive test results?
Was the mRNA vaccine resulting in COVID-like illness?
Was the mRNA vaccine wrecking natural immunity?
But we could equally ask if machine learning has produced numerical garbage! I decided to rule this possibility out by repeating the exercise using classic multivariate logistic regression and, at the same time, derive a more suitable proxy measure for disease prevalence.
A Proxy For Prevalence
Up to now I’ve been using a concoction I’ve been calling case detection rate (CDR) as a proxy for disease prevalence. This has been forged by taking the UK GOV coronavirus dashboard variable newCasesBySpecimenDate, turning it into a rolling 7-day sum to iron out weekly administrative fluctuations, and dividing it by newVirusTestsBySpecimenDate, which has also been turned into a rolling 7-day aggregate. When beefed up into a percentage this has offered a useful proxy for disease prevalence for the nation of England on a daily basis, enabling me to broadly account for varying levels of exposure over time.
The problem with this is that it provides a generalised proxy at the national level whereas it is widely acknowledged that COVID-19 is primarily a nosocomial phenomenon, this being a polite way of saying you get infected whilst in hospital. Ideally we’d derive a similar measure for the Trust under study but since I have no idea which Trust I’m dealing with then that option is off the menu. I thus opted to do something sly and that is take the UK GOV newCasesPillarOneBySpecimenDate, turn this into a rolling 7-day sum and divide it by newPillarOneTestsByPublishDate as a rolling 7-day sum. Voilà !
Just in case folk have forgotten what ‘pillar one’ means here’s a link to a testing methodology note that the NHS have now withdrawn. For those not keen on digesting a 16 page document I shall spoil the ending by stating that pillar one testing concerned NHS hospitals, those with a clinical need and health and care workers i.e. testing of those at the coal face.
At this point it would be jolly nice to plot out my original CDR variable as a time series and compare it with the CDR for pillar one testing. Have a taste of this:
First impressions are that there is overall agreement, and jolly good agreement outside of the spring 2020 and 2020/21 seasonal peaks. You can see that the first wave back in 2020 was largely driven by infections within the healthcare sector, with the situation inverting during the 2020/21 winter season. This suggests to me that my original values for CDR are not likely to have seriously thrown the modelling, though I ought to re-run analyses with a more appropriate proxy just to be sure.
Multivariate Logistic Regression
We now come to the bit where I have to predict the incidence of COVID in a sample of 19,457 adult in-hospital deaths using classic logistic regression. The groundwork for this was laid down in this article and I shall be doing the very same again, but with two exceptions: CDR will be replaced by CDR (pillar 1), and Vaccinated will be dropped.
The reason I shall be dropping the vaccination indicator is that I have doubts as to its veracity, the simple reason being that vaccination status can only be ascertained through NHS ID allocation matching within and across two different data systems. Failure to obtain a match between Trust-level and national-level databases means the flag for vaccination status resorts to the default of unvaccinated. Oh dear, indeed! On top of that I have been told tales of shoddy data management within the UK’s vaccination centres, which makes me doubly doubt the validity of information held in the EPR. At some point I shall attempt to model vaccination status but until such time I shall seal the vaccination indicator flag in a Tupperware container to avoid stinking the fridge out like a piece of ripe Stilton or squashy banana. Herewith a list of remaining ingredients:
This listing of 18 main effects will yield 153 two-way and 816 three-way interactions. I still don’t have a tea pot big enough for all those three-way effects so stuck to two-way as my nod to model sophistication. Regular readers will have come across this listing umpteen times but it may be worth me mentioning once again that Diagnoses represents the total number of medical diagnoses made on the medical record, which may range from 0 to 10, this being a rough proxy for case complexity. CDR (pillar 1) is my newly-minted proxy for exposure to SARS-COV-2 in a hospital setting.
All other independent variables should be self-explanatory (N.B. AMI = Acute Myocardial Infarction; HF = Heart Failure; CNS = Central Nervous System; Dx = diagnosis), and you can determine their incidence within the population of 19,457 adult deaths over the period 2020/w11 – 2021/w36 by eyeballing their mean value. Thus we see that a cancer diagnosis leads the way with mention in 30% of all in-hospital deaths. Please do note that these are mentions and not causes, though some will undoubtedly be the primary cause of death.
Once again it took my quad core workstation a fair while to run the 172 parameter logistic model, but it eventually did so in 52 steps. Herewith the resulting monster table of coefficients sorted in descending order of odds ratio (OR) that made it through a forward selection procedure set to 95% confidence (p<=0.05):
Right at the top we have acute respiratory conditions as the biggie predictor for a COVID-19 diagnosis appearing in the EPR of deceased in-patients, which makes an awful lot of sense. Any patient suffering such was six times more likely to have been given a COVID-19 diagnosis prior to death.
Coming second in the OR stakes is a rather interesting interaction term: Inflammatory conditions by Other Cardiac (OR = 3.42, p<0.001). Inflammatory conditions is a very broad and somewhat general category of everything from major (e.g. hepatitis) to minor (e.g. tonsillitis), and we may think of this as the ‘itis category’. Other cardiac includes pericarditis, myocarditis, valve disease, atherosclerotic conditions, and conduction issues such as atrioventricular block. Combine the two factors and you are just over three times more likely to have been given a COVID-19 diagnosis prior to death.
Coming third in the OR stakes is another interesting interaction term: Cancer Dx by Injury & Trauma (OR = 2.52, p=0.045). Whilst this just squeezes into the model we have to ask why cancer patients suffering from injury or trauma were two and a half times more likely to have been given a COVID-19 diagnosis prior to death. This seems mighty strange and requires further investigation.
Heart attacks (AMI) feature in association with organ failure and general sepsis (OR = 1.96, p<0.001), which makes sense given the strain placed on the heart and immune system through development of severe COVID-19 symptoms in conjunction with a viral load leaving folk prone to secondary infection. In fact, all makes clinical sense until we get to Cancer Dx by Diagnoses (OR = 1.24, p<0.001) that points to increased likelihood of a COVID-19 diagnosis for cancer patients with complex medical conditions. I find this most curious – just what is it about cancer patients that makes them feature? Could we be looking at immunocompromised individuals susceptible to viral infection or something else entirely?
We could spend a fair few paragraphs chewing over this model but the main thing is how well it performed. Herewith the nitty gritty:
Thus we observe a true negative rate of 95.9% which is excellent, and a true positive rate of just 34.1% which is rather poor but pretty much par for the course when it comes to predicting incidence of disease states for individuals. Overall the performance of the model is quite respectable, with a ROC area of 0.849 (p<0.001).
Pudding
With a revised model in the bag that is free from biases surrounding vaccination, and which utilises a proxy for disease prevalence found within a hospital setting, let us now grab those crayons and plot out the time series for COVID-19 diagnosis as given in the EPR and COVID-19 diagnosis as predicted using multivariate logistic regression (LR):
Ouch! A wonderful correspondence is to be seen during the first wave back in spring of 2020, and a reasonable correspondence is to be seen over both summers and early autumn of 2021, but there is a shocker of a difference for the 2020/21 winter season. So shocking, in fact, that I made a fresh pot of tea and ran a parallel model using machine learning (MLP) and exactly the same set of ingredients. Herewith the triple choc pudding:
Welcome to the wonderful world of statistical modelling, where all models are wrong. If we run a quick Pearson bivariate correlation over these series we arrive at r = 0.770 (p<0.001) for the logistic model and EPR data, and r = 0.864 (p<0.001) for the neural network MLP model and EPR data. On this occasion I shall have to concede that the black box approach has outgunned classic logistic regression in terms of overall fit to the dependent variable, though the classic approach nailed the first wave rather well.
Although there is less of a shocking discrepancy for the 2020/21 winter peak using MLP it is still there; but let us have a look at the weekly difference between observed and predicted COVID cases:
When crayoned in this manner the two models don’t appear to our eyeballs as being so wildly different. This plot also enables us to see periods of complete agreement and near agreement. The first wave is mighty interesting with a delay of 3 weeks between peak differences. This suggests a rather odd underlying situation, and one that starts off with negative excess - just what was going on back in spring of 2020?
We then come to the pre-vaccine run into the 2020/21 winter season, with the level of excess COVID diagnosis building to around 50 cases per week in this one Trust alone. I am going to suggest that this is evidence of misdiagnosis, perhaps with old-fashioned flu being re-branded as the new-fangled COVID. It will be interesting to learn what PCR cycle thresholds were used during this period and just when the decision was made to drop from three primer sequences to two and then just one.
There is no denying the peak of peaks that just so happens to emerge 4 - 8 weeks after vaccine rollout. The logical next step is to examine the raw data but I’m not convinced vaccination status has been reliably linked to the EPR from NIMS. I will thus need to examine the reliability of this variable before we can proceed with deliberations.
NOTE: Before folk get too excited please do bear in mind both models were poor at predicting presence of COVID (35% - 40% true positive), so will underestimate weekly counts. This could be due to the limits of modelling and/or because COVID designation within the EPR is, itself, suspect. That being said we ought to ask why the models successfully mimic EPR counts for the majority of weeks only to stumble just prior and just after vaccination rollout. Are we looking at a contrived disease profile?
Summary
Case detection rate (CDR) is a useful proxy for disease prevalence that can be used to adjust for differing levels of exposure to the virus over time. Since COVID-19 is largely a nosocomial phenomenon CDR was modified using pillar 1 test and case data.
Vaccination status may not be a reliable indicator within the EPR owing to reliance on linking Trust-level and national-level data systems. Failure to specify a unique NHS number in both systems will result in a default coding of unvaccinated.
Two statistical methods (multivariate logistic regression and MLP machine learning) were used to predict COVID status prior to death in 19,457 in-patients over the period 2020/w11 - 2021/w36.
Excess counts for COVID status (observed minus predicted) were most notable over the 2020/21 winter season, and especially 4 - 8 weeks after vaccine rollout.
Kettle On!
Can a Simpson’s paradox manifest in the time domain, whereby some key factors to which the model(s) are sensitive are not time invariant? Is there potential for additional insight if multivariate LR would be applied separately to more than one time period where modalities are suspected to differ?
From what I can tell it's not the PCR threshold.
I don't kow about your data, but generally speaking COVID diagnoses are not always PCR confirmed.
Most of these surplus cases fall into the non-PCR-confirmed category, judging by my familiarity of the US data.
Besides, how would that just happen? Nobody ordered labs to increase cycle thresholds and later reduce them again.
I'll write you an email with two exemplary scatter charts to illustrate the US situation.