I think this must be wrong, John, since very good analysis suggests that we had far far higher rates of false positives at most times during the pandemic than the 15% or so that your analysis suggests. False positives were more like 70-90% most of the time, particularly in areas (most places in the world) that had strong screening programs in place. Or am I misunderstanding your analysis? Is it limited to particular kinds of tests or testing milieus? Here's our writeup on the false positive paradox resulting from screening for Covid. https://www.bmj.com/content/373/bmj.n1411/rr
These false positive rates vary wildly according to location and point in time.
Even the measured prevalence varied by a factor of 10 within weeks (example here would be June vs July/August 2021 in the USA, about factor 12).
Given a constant rate of false positives and a more or less constant number of truly negative patients being tested per day, the proportion of misdiagnosed covid cases will vary by a factor greater than the factor by which the total number of cases varies (here in the example 12).
This approach should extend to most observations.
Single figures like x % of false positives just can't possibly do justice to the complexity of a highly dynamic situation like this. Or rather 'situations like these'.
hi Fabian, agreed that prevalence differed greatly over time but much of that was probably due to false positive signals and random drift that is part and parcel of the false positive paradox. We've also heard frequently that the false positive paradox doesn't matter much in times of "high prevalence" and people simply assume that Covid was in fact "high prevalence" much of the time. But we attempt to show in the essay I linked to (and an accompanying academic paper that was never accepted for publication), based on the large vaccine clinical trial published studies, which screened all of their participants for Covid upon commencement of the trials, that even during "spikes" in Covid background prevalence was probably never higher than 1% and generally was quite a bit less than this. This is not "high prevalence" and at these levels the false positive rates from screening programs and using imperfect tests (no tests are perfect) we get false positive rates generally well into the 90% range and higher.
I agree on 1% not really beinga high prevalence in the face of every person being tested, assuming specificity rates far below 99%.
But how do you explain full genome sequencing results matching SARS-CoV-2? Changes in the number of positive sequences are highly correlated with changes in the number of positive PCR tests.
I can supply timeseries for GISAID sequences if interested. Write me@pervaers.com
I will give you anything I have, but it's a work in progress, so I can't really supply a any links. What I am currently working on is the situation in Q3 2021, where the South caught up on vaccinations with the other divisions. Changes in mortality were highly correlated with changes in the number of administered vaccine doses. During that time, respiratory mortality was the biggest excess mortality factor. COVID deaths, COVID cases, positivity rates and CFR were all highly correlated with new first doses. It's a work in progress, but I can show you what I got once you email me.
Great, looking forward to it. You mentioned sequencing data. Is that published or in your private data? There have been quite a few sequencing verification studies, but not enough, and most sequencing was used to identify variants rather than as PCR verification.
A bit of misunderstanding but also a likely a big dollop of error! The model tries to predict genuine COVID but does so using the erroneous COVID-19 Dx within the EPR. We thus arrive at a modified COVID status in which some, but not all false positives have been shaven off. What would be good to do is arrange a model that is independent of the COVID-19 Dx, relying on the symptom matrix alone. In this regard you may find the following report interesting...
That makes sense, but even symptoms can be very difficult particularly ex post b/c the symptom list is so dang long. Seems that perhaps a method where at least 2 or 3 symptoms are present PLUS a positive test result should be determined ex post as a "case" in an exercise designed to determine how many real cases were present vs. what were recorded at the time.
Well, this is what the EPR purports to show but does not. I had an irate registrar fume on about them not relying on the PCR alone to come to a diagnostic conclusion but, of course, it isn't the medics who process casenotes but the clinical coding teams. Three have come forward to confide that management are ordering them to code cases as COVID even if 'NOT COVID' has been scrawled by the SHO. This means that we are swimming in clinical nonsense that will require a casenote audit to unpick, but even then I have my doubts as to whether anything truthful may emerge. I say this because I know of cases where CXR has swung the diagnosis made in hospital but the GP has disagreed, pointing to pre-existing shadow.
In terms of the stats approach to unravelling matters I have used MLP to define objectively what it thinks is a genuine COVID case. In essence this does what you suggest in that it models a positive test result using a symptom matrix. However, the default probability threshold used to define a 'probable case' is rather lax at a score of 50% or greater. In part 2 of this mini series I'll be exploring the raw probabilities and revealing what happens when we impose stricter criteria for defining a probable case.
Wow, please do share more on the reversal of "not Covid" diagnoses by "management." I'm sure you've seen the US CDC and WHO coding criteria, which yes indeed placed a massive bias toward rolling everything in as "covid" for deaths certification. I wrote this way back in 2020 but I think it still holds up: https://tamhunt.medium.com/data-quality-issues-and-the-coronavirus-pandemic-db0356373fc2
I can't say anymore since this was provided in strict confidence by those who still need to hang on to their jobs. It would be a simple audit to undertake for any Trust but there's no way this will be permitted.
- Is count of MLP covid cases the sum of fractional cases, or is it a sum of cases with probability>50%?
- As we previously brainstormed: "I have date of vaccination and date of death so it will be interesting to look at the distribution of time elapsed for COVID and non-COVID cases." This was in regards to questioning if EPR covid dx remained in the system after the person already recovered. Another incentive to look at that is specifically regarding these newfound outlier cases for non-covid Dx. If something is really funny about their timing relative to vaccination or death, that may reveal an issue.
You are on the money - MLP defines a COVID case as being >=50%. An examination of the mismatched cases for clues is down on my list of tasks. Delays Vx > death is a big study area that I'll be rolling out using survival techniques. Also on my list is constructing a 'possible COVID' score that is independent of the test result and totally dependent on the diagnostic matrix just to see how this squares up.
I'm suspect of a single fixed 50% threshold. I would be curious to know if covid probability was put into 10 bins or 10 thresholds, of let it run any number like age - whatever the right words are - how sensitive is that AcuteResp OR? (I see you must have fixed an error and downgraded OR 151 to OR 11 =) ).
Also curious if MLP covid case count was a sum of fractional cases, what would the concordance of diagnosis be (instead of 85.6%)?
I can't remember if this has already been talked about, but I wonder if "in hospital covid prevalence" could be an important variable, as opposed to just CDR to capture admin/behavioral stuff.
Yes, that's the system default but I have the raw probabilities for non-COVID as well as COVID so can revise the thresholds, and will likely make this issue the subject of part 2. In fact, I started out with the raw scores rather than the binary indicator and this is what gave that almighty OR = 151. The trouble is folk would have misinterpreted that since they think in terms of 'got it or not' rather than a distribution of probabilities that form part of a multi-term risk model. An option would be banging all the terms in a spreadsheet and allowing folk to play with the parameters but I went for simplicity.
Yes, that 85.6% is intriguing and I shall digging down a bit more. Yep, I also reckon IHCP is going to be key as the primary measure of exposure but I'm at a loss as to how to derive that with the limited data I have.
Two ways to calculate IHCP that probably won't work / make things worse, but might get your gears turning:
1) Take the ratio of CDR and hospitalization count, country-wide, and use this as a proxy. That assumes it can be generalized to this specific Trust though. If such calculation could be done over many regions of the country, and they all happened to be close to each other, then the assumption might be valid. But then you'd need to have enough data to do those regional calculations. Also, this might be a chance to simultaneously see if CDR is fairly uniform by region also. Though interpretation of all that might also depend on if the Trust is regional, or a bunch of disparate facilities, which I do not know.
2) You have date of death. On each date, you can assign a total covid case count based on (ProbableCovid) X (Binary: did they die today?) Then you can shift this entire histogram back in time based on a guess of the average length of time between in-hospital covid Dx and death. Or if you really wanted to be bad you could make that time-shift a function of the value of their covid probability.
3) You could further try and validate numbers 1 and 2 by seeing if they match.
I think this must be wrong, John, since very good analysis suggests that we had far far higher rates of false positives at most times during the pandemic than the 15% or so that your analysis suggests. False positives were more like 70-90% most of the time, particularly in areas (most places in the world) that had strong screening programs in place. Or am I misunderstanding your analysis? Is it limited to particular kinds of tests or testing milieus? Here's our writeup on the false positive paradox resulting from screening for Covid. https://www.bmj.com/content/373/bmj.n1411/rr
These false positive rates vary wildly according to location and point in time.
Even the measured prevalence varied by a factor of 10 within weeks (example here would be June vs July/August 2021 in the USA, about factor 12).
Given a constant rate of false positives and a more or less constant number of truly negative patients being tested per day, the proportion of misdiagnosed covid cases will vary by a factor greater than the factor by which the total number of cases varies (here in the example 12).
This approach should extend to most observations.
Single figures like x % of false positives just can't possibly do justice to the complexity of a highly dynamic situation like this. Or rather 'situations like these'.
hi Fabian, agreed that prevalence differed greatly over time but much of that was probably due to false positive signals and random drift that is part and parcel of the false positive paradox. We've also heard frequently that the false positive paradox doesn't matter much in times of "high prevalence" and people simply assume that Covid was in fact "high prevalence" much of the time. But we attempt to show in the essay I linked to (and an accompanying academic paper that was never accepted for publication), based on the large vaccine clinical trial published studies, which screened all of their participants for Covid upon commencement of the trials, that even during "spikes" in Covid background prevalence was probably never higher than 1% and generally was quite a bit less than this. This is not "high prevalence" and at these levels the false positive rates from screening programs and using imperfect tests (no tests are perfect) we get false positive rates generally well into the 90% range and higher.
I agree on 1% not really beinga high prevalence in the face of every person being tested, assuming specificity rates far below 99%.
But how do you explain full genome sequencing results matching SARS-CoV-2? Changes in the number of positive sequences are highly correlated with changes in the number of positive PCR tests.
I can supply timeseries for GISAID sequences if interested. Write me@pervaers.com
We review the Sanger sequencing verification studies in this preprint here and those studies generally support the very high false positive PCR rate narrative I've suggested here. Would love to see your data and analysis. https://www.authorea.com/users/61793/articles/527660-the-false-positive-paradox-and-the-risks-of-testing-asymptomatic-people-for-covid-19.
Here's a table we generated for a later version of the same paper:
https://www.dropbox.com/scl/fi/n0ve6brqic1bq059lvfs2/Table-1-False-positive-paradox-JID.png?rlkey=jceqwyyal6k2n77im981qu8fn&dl=0
Can you post some links here and I'll also email you?
I will give you anything I have, but it's a work in progress, so I can't really supply a any links. What I am currently working on is the situation in Q3 2021, where the South caught up on vaccinations with the other divisions. Changes in mortality were highly correlated with changes in the number of administered vaccine doses. During that time, respiratory mortality was the biggest excess mortality factor. COVID deaths, COVID cases, positivity rates and CFR were all highly correlated with new first doses. It's a work in progress, but I can show you what I got once you email me.
Great, looking forward to it. You mentioned sequencing data. Is that published or in your private data? There have been quite a few sequencing verification studies, but not enough, and most sequencing was used to identify variants rather than as PCR verification.
A bit of misunderstanding but also a likely a big dollop of error! The model tries to predict genuine COVID but does so using the erroneous COVID-19 Dx within the EPR. We thus arrive at a modified COVID status in which some, but not all false positives have been shaven off. What would be good to do is arrange a model that is independent of the COVID-19 Dx, relying on the symptom matrix alone. In this regard you may find the following report interesting...
https://jdee.substack.com/p/catastrophic-health-collapse-part-ea2
That makes sense, but even symptoms can be very difficult particularly ex post b/c the symptom list is so dang long. Seems that perhaps a method where at least 2 or 3 symptoms are present PLUS a positive test result should be determined ex post as a "case" in an exercise designed to determine how many real cases were present vs. what were recorded at the time.
Well, this is what the EPR purports to show but does not. I had an irate registrar fume on about them not relying on the PCR alone to come to a diagnostic conclusion but, of course, it isn't the medics who process casenotes but the clinical coding teams. Three have come forward to confide that management are ordering them to code cases as COVID even if 'NOT COVID' has been scrawled by the SHO. This means that we are swimming in clinical nonsense that will require a casenote audit to unpick, but even then I have my doubts as to whether anything truthful may emerge. I say this because I know of cases where CXR has swung the diagnosis made in hospital but the GP has disagreed, pointing to pre-existing shadow.
In terms of the stats approach to unravelling matters I have used MLP to define objectively what it thinks is a genuine COVID case. In essence this does what you suggest in that it models a positive test result using a symptom matrix. However, the default probability threshold used to define a 'probable case' is rather lax at a score of 50% or greater. In part 2 of this mini series I'll be exploring the raw probabilities and revealing what happens when we impose stricter criteria for defining a probable case.
Wow, please do share more on the reversal of "not Covid" diagnoses by "management." I'm sure you've seen the US CDC and WHO coding criteria, which yes indeed placed a massive bias toward rolling everything in as "covid" for deaths certification. I wrote this way back in 2020 but I think it still holds up: https://tamhunt.medium.com/data-quality-issues-and-the-coronavirus-pandemic-db0356373fc2
I can't say anymore since this was provided in strict confidence by those who still need to hang on to their jobs. It would be a simple audit to undertake for any Trust but there's no way this will be permitted.
- Is count of MLP covid cases the sum of fractional cases, or is it a sum of cases with probability>50%?
- As we previously brainstormed: "I have date of vaccination and date of death so it will be interesting to look at the distribution of time elapsed for COVID and non-COVID cases." This was in regards to questioning if EPR covid dx remained in the system after the person already recovered. Another incentive to look at that is specifically regarding these newfound outlier cases for non-covid Dx. If something is really funny about their timing relative to vaccination or death, that may reveal an issue.
You are on the money - MLP defines a COVID case as being >=50%. An examination of the mismatched cases for clues is down on my list of tasks. Delays Vx > death is a big study area that I'll be rolling out using survival techniques. Also on my list is constructing a 'possible COVID' score that is independent of the test result and totally dependent on the diagnostic matrix just to see how this squares up.
I'm suspect of a single fixed 50% threshold. I would be curious to know if covid probability was put into 10 bins or 10 thresholds, of let it run any number like age - whatever the right words are - how sensitive is that AcuteResp OR? (I see you must have fixed an error and downgraded OR 151 to OR 11 =) ).
Also curious if MLP covid case count was a sum of fractional cases, what would the concordance of diagnosis be (instead of 85.6%)?
I can't remember if this has already been talked about, but I wonder if "in hospital covid prevalence" could be an important variable, as opposed to just CDR to capture admin/behavioral stuff.
Yes, that's the system default but I have the raw probabilities for non-COVID as well as COVID so can revise the thresholds, and will likely make this issue the subject of part 2. In fact, I started out with the raw scores rather than the binary indicator and this is what gave that almighty OR = 151. The trouble is folk would have misinterpreted that since they think in terms of 'got it or not' rather than a distribution of probabilities that form part of a multi-term risk model. An option would be banging all the terms in a spreadsheet and allowing folk to play with the parameters but I went for simplicity.
Yes, that 85.6% is intriguing and I shall digging down a bit more. Yep, I also reckon IHCP is going to be key as the primary measure of exposure but I'm at a loss as to how to derive that with the limited data I have.
Two ways to calculate IHCP that probably won't work / make things worse, but might get your gears turning:
1) Take the ratio of CDR and hospitalization count, country-wide, and use this as a proxy. That assumes it can be generalized to this specific Trust though. If such calculation could be done over many regions of the country, and they all happened to be close to each other, then the assumption might be valid. But then you'd need to have enough data to do those regional calculations. Also, this might be a chance to simultaneously see if CDR is fairly uniform by region also. Though interpretation of all that might also depend on if the Trust is regional, or a bunch of disparate facilities, which I do not know.
2) You have date of death. On each date, you can assign a total covid case count based on (ProbableCovid) X (Binary: did they die today?) Then you can shift this entire histogram back in time based on a guess of the average length of time between in-hospital covid Dx and death. Or if you really wanted to be bad you could make that time-shift a function of the value of their covid probability.
3) You could further try and validate numbers 1 and 2 by seeing if they match.