Data Lakes Editorial (Healthcare Version) A Vaccine for Our Hobbled Healthcare System: How data interoperability can help heal healthcare in the pandemic and beyond.
In an effort to help stem the pandemic, many enterprise AI-driven companies have announced the public release of huge collections of their data, known as “data lakes.” They are sharing these data lakes in the hope that global researchers will use all the potential information in fighting COVID-19. In theory, they could extract all sorts of knowledge and apply it to an immensely complicated global healthcare problem.
However, what most of these well-meaning organizations don’t understand is that making a COVID-19 data lake isn’t very useful in itself. A massive data repository might seem like a helpful resource amid a crisis of this type. But unless the data is prepared properly before integration (which is rare), it can’t be used to accomplish anything of scale. It’s a bit like dumping everything ever written on public health and epidemiology into a huge empty stadium. Without a universal shelving and cataloging system, that giant pile of information isn’t going to magically become a library, nor can it be employed to gain practical knowledge, without hundreds of years to go through it all manually. Since the pandemic began, more than 23,000 new papers about COVID-19 have been published. But without being able to organize and summarize the content, meta-analysis is impossible, and the big-picture knowledge is lost.
People unfamiliar with the problem typically think that we can simply deploy artificial intelligence (AI) to parse the available information. But what most don’t realize is that the ability of AI to draw conclusions is directly dependent on the quality and preparation of the information it is fed in the first place. And make no mistake: data lakes do NOT reliably provide the inputs needed to make effective decisions and take the necessary actions. It’s a data lake, not a wisdom lake. And it’s effectively a data swamp if it’s not prepared correctly. What the organizations that collect the data haven’t yet realized is that the value of data isn’t in its collection and storage, but in its organization and retrieval.
One of the many organizations that dealt with these data lakes is The COVID Tracking Project, which provides some of the most critical data available on COVID-19 testing at a national level. Their work shows just how hard it is to make data useful, accessible, and understandable. The COVID Tracking project started with two journalists at The Atlantic who couldn’t find the testing data they needed to report on the pandemic, because no one else was doing it in a coordinated fashion. Neither the CDC nor any federal agency offered any official national statistics. Some states were tracking their own testing data and reporting it; others weren’t. And each state that was tracking testing used different types of categories in their tallies. These two journalists attempted to fill this information void by combining the disparate data sets and manually integrating them into a compiled collection, so that an accurate testing picture could finally emerge. This combined data set is now what drives the information that many organizations rely on about the pandemic. But it wasn’t easy.
While The COVID Tracking Project used some fairly advanced technology to scrape the appropriate data from the available sets, it also required a huge human effort to pull it all together and make it interoperate. And that type of effort simply isn’t scalable or practical for most uses, nor can it be done for larger data sets. This particular data set is just one small blip in a giant sea of available information about the crisis that we should be using, but can’t because we aren’t managing the data properly. The distribution of personal protective equipment (PPE) has also been hampered by poor data interoperation. Despite the fact that there are huge amounts of PPE theoretically available, most of it hasn’t been distributed to the people that need it. One reason is that the movement and distribution of the equipment is handled by global transport and logistics companies that oversee hundreds of ships, most of which carry tens of thousands of containers for various clients. And the data they have simply doesn’t allow them to know where all their inventory is at any given moment. Locating their clients’ shipments should be simple, but with increased demand, tracking the shipments became almost impossible to do in a timely fashion.
Why? In order to track and manage a shipment, a company must be able to type a search term (e.g. “medical”) into its system and pinpoint which container, on which ship, in which port, is carrying the specific items. This should be instant, but during the pandemic, finding the PPE often took days or weeks, because a human had to comb through countless spreadsheets— sometimes even on paper— to find the word “medical”, then send another human to the shipyard to find the specific ship and container. As with the testing project, the data to avoid this problem existed among the companies involved, but their technology didn’t allow it to be easily searched, compared, mixed, or matched. The mere existence of that data is not enough. It has to be organized in a consistent and standardized way to be useful.
This limitation is one of the biggest problems in AI right now: data usability hasn’t kept up with data growth. Until data can be organized and analyzed at a pace that matches data growth, a human bottleneck in applying AI will persist. If our response to a global pandemic is two journalists sitting at their computers, calling various state health departments on the telephone and reading press releases to get accurate testing numbers, we have little hope of ever applying the power of big data to solve bigger problems. Even though they worked as fast as humanly possible, they were still just humans managing data manually. Without a way to quickly make existing data useful at scale, the future of AI will be hamstrung.
But there is hope. There is an answer that would efficiently organize the available data, and make it usable by AI. If all these potentially life-saving data sets were appropriately formalized through a newly-relevant branch of mathematics known as “category theory,” we could harness the power of all this information. By employing Category Query Language instead of the current organizational languages for merging giant data sets, we could make the information useful and connectable before putting it into the lake. Suddenly, all of those books and articles would have a retrofitted indexing system that was reliable and scalable to any amount of data. And instead of a stadium full of paper, we would have a searchable library where we could find whatever information we needed.
Unlike its current predecessors, Category Query Language forms connections among existing data points and sets by recognizing their relationships in a formal mathematical way. This allows even very different data sets to be merged into an interoperable whole without creating a digital disaster. And it brings all the available data into the decision-making process, with diverse queries that have never been possible before. What Category Query Language does, that can’t be said of any current data management solutions, is give the data sets shared context. And nothing could be more critical to the future of healthcare. The organizations that use this approach will be the ones that can help us emerge from any future healthcare crises with the minimum possible damage.