Data Pandemic: How the COVID-19 Crisis Exposed a Critical Weakness in the Nation’s Data Handling
While the COVID-19 pandemic itself may be unprecedented, America’s poor national response to it shares some of the same root causes with previous public healthcare failures, including the disappointing initial rollout of the Affordable Care Act (ACA). Both cases involved the need for massive amounts of heterogeneous data to be shared by many IT systems. And both situations depended on collecting data from disparate sources, then bringing it together for analysis. In the case of the novel coronavirus, reliable data was needed to track COVID-19 prevalence, among other factors. In the case of the Affordable Care Act, large scale data management was a critical element in determining Affordable Care Act subsidies. In both cases, the quality and handling of the initial data was largely ignored, leading to critical problems later on. This pre-integration part of the data pipeline tends to be neglected by organizations of all kinds, which often leads to disastrous consequences. It was the Achilles heel of Healthcare.gov, and it has made a fiasco of our COVID-19 response. Without better data pipelines, we can expect that our future responses to pandemics and other disasters may also be disasters in their own right.
Competent data professionals were brought in to triage the problems with the Affordable Care Act, fixing what would have otherwise been an ongoing failure, and they can do the same for our COVID-19 response. People are currently dying unnecessarily, but the problems are solvable. Unlike the late 20th century, when many hospital-related IT systems were created, we now have the technology to address the underlying problems of heterogeneous data interoperation, thanks to advances in AI and other areas.
What do we mean when we describe data “interoperating”? Consider the ability of Google Calendar to display appointments you made on your Apple Calendar: that’s data interoperability. Now compare it to “data integration”, which describes the ability to recognize that pete@gmail.com and peter@apple.com are the same person. That’s a lot tougher! Data integration is contextual, semantic, and requires human decision-making, whereas data interoperability can be fully automated. Fortunately, better data interoperation can solve most of the problems at hand, as long as we apply the right tools to accomplish it.
Many problems in data infrastructure, such as those we are seeing with COVID-19, stem from using data integration systems, which require frequent human intervention, when it is actually infallible data interoperability that is required, which could be achieved with AI. Doing data integration when you need interoperability is like manufacturing a custom cable from scratch in your basement every time you need to plug in your headphones, rather than relying on a standard interoperability protocol like Bluetooth or a typical headphone jack.
In practical terms, this problematic conflation of data interoperability and integration manifests itself not only in poor outcomes, but also in runaway budgets and the manual creation of code that can rarely be reused in the future. Rather than pay an army of programmers to build a data pipeline from each hospital to the CDC and then run whatever tests we can think of in the hope it will be ready for a pandemic, we could pay one domain expert to formalize the CDC data warehouse schema using mathematics, and then use new techniques from AI to prove (or disprove) that each hospital’s reporting process will be correct. We can essentially replace an army of programmers with a math problem, which can then be solved with AI in a fully automated way and be tested for accuracy.
In fields outside of data management, this transition to “provably correct code” is already taking place. Provably correct operating systems run pacemakers and fighter jets, and are used in smart contracts and many other systems. But whereas in those domains AI is generally still too weak to fully automate tasks, in data science, AI is now strong enough to completely automate these tasks, thanks to recent advances in both data size and algorithms. It is therefore critical to lay these data pipelines now and establish them, so that we know they will function when we need them.
Naysayers sometimes complain that automation puts jobs at risk, but in this case, it will actually open the door to more and better jobs. Integration and migration of data within the world’s healthcare system currently happens almost entirely by hand. It’s vocational-level IT work. The people who do it are often overqualified for the job, yet lack any contextual knowledge of healthcare or the meaning of the data sets. Mistakes and dissatisfaction are inevitable. While the revenue this work generates for the world’s large consultancies reflects the deployment of this massive programmer army, the results do not.
We hear daily of mistakes. Data was not collected properly. The formats were mismatched. Someone used Microsoft Excel and corrupted a column of data without catching it. (For example, it is estimated that roughly 20% of published papers in genetics contain basic errors in data compilation involving corruption from improperly using Excel.*) Some data is in an old mainframe while other data is in the cloud. One system uses SAP while another uses Oracle. Organizations believe if they can just get their data on a single platform, their problems will be over. But this is a mirage. Data automation is impossible without first formalizing data relationships mathematically. And the most insidious part is that the people using these data sets are often unaware of the problems they contain.
Unthinkable errors can result from this type of data mismanagement. Several U.S. states improperly merged COVID-19 diagnostic testing data with post-infection antibody testing, creating a falsely optimistic picture of their testing situation. This happened not just in Georgia, as widely publicized**, but also in Virginia. Meanwhile, other states had confusion about whether test counts reflected the number of people tested or just the number of samples.*** The result was that “positive” test result statistics became meaningless, no matter how effective the underlying tests. What was the cause? Human error in data management.
Similar problems occur in everyday medical data. We’ve seen data sets that included the yes/no query of whether the patient was a smoker. One well-meaning physician entered, “Not now, but consumed 1 pack/day in 2019.” Picture the data integration process, in which that answer gets “normalized” to just report the “no”. Multiply that potential error by 331 million Americans, and no expensive multi-year cloud migration can make that data useful.
So how can AI help solve these problems on a larger scale? Rather than just relying on AI tools to analyze and integrate data, we should have AI look at the system that allowed a “yes/no” field where more detail was needed. We shouldn’t force doctors to have less nuance; we need to force the data system to have more. And AI can help us to achieve this at a scale that can help solve healthcare problems rather than cause them. The devil is in the details at the beginning of the data pipeline, and we must get those right to achieve full automation, because just guessing at them can get people killed.
* https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7 ** https://www.ajc.com/news/state--regional-govt--politics/latest-data-lapse-inflatedgeorgia-virus-test-count-000/2RG89mkuryApRMdQzblMgP *** https://www.theatlantic.com/health/archive/2020/05/covid-19-tests-combinevirginia/611620/