INSPIRE DATA MESH

Leveraging on the OHDSI/ OMOP CDM to develop a systematic approach to clinical research for answering real-world questions across a global network of standardized Covid-19 data in Kenya and Malawi.

data-pinepeline
Shared Definitions

In a data mesh, it is the shared definitions across a federation that assures that local research can be combined across space and time as needed. In the context of OHDSI and OMOP there are shared definitions on several levels.

  • At the bottom most level, source data from each “observation shop” across a network of observation shops is mapped consistently to the observations and measurements that make up the OMOP CDM. This is more or less assured by shared implementation guides which direct structured source data into OMOP clinical and/or population health events. The guides are part of a shared (meta)data pipeline each observation shop may operates that moves the source data into an OMOP CDM instance.
  • On top of these observations and measurements, in OHDSI we can construct phenotypes. Each phenotype circumscribes a set of observable characteristics or traits of the subject of observation. In fact, OHDSI comes with an extensible library of phenotypes through which research across space and time can use to construct comparable cohorts.
  • Finally, at the topmost level cohorts with different exposures are analyzed in studies that are undertaken across a research network. Each study has a human and machine readable study definition. These study definitions can be used both retrospectively and prospectively. Prospectively, they can guide the execution of emulated clinical trials both within and between the observation shops that make up a federation. Retrospectively, they can be used in the course of a meta-analysis. In the prospective use case research is frequently conducted by a software agent. In the retrospective use case meta-analysis up till now is mostly conducted by a human researcher.
Cross Africa Virtual Machines

Cross-Africa virtual machines consume population health surveillance data in line with the shared definitions first to characterize the data with the assistance of dashboards and then to execute the research.

OHDSI virtual machines (VMs) are constructed under the same or different cloud providers and contain the OMOP CDM and a set of OHDSI services that run on top of OMOP. Currently INSPIRE has built its own VM that runs at just one cloud provider – Microsoft Azure. Both INSPIRE and other OHDSI research entities are in the process of building VMs that, through containerization, can run under multiple cloud providers.

INSPIRE imagines that over time a brigade of VMs will be deployed in the African context that are birthed by one or more research entities. Membership in the brigade will be coordinated by the Africa Chapter of OHDSI. Each VM will have to pass the same test.

INSPIRE is building a catalog that chronicles OHDSI experiments undertaken across VMs. The catalog is a collection of observational studies specified with schema.org. Each study published at the INSPIRE website will appear in an internet-wide catalog of datasets called Google Dataset Search. Google Dataset Search is, in effect, a catalog of catalogs.

Data Pipeline

Data pipelines facilitate the movement of (meta)data from Population Health Surveillance (below) into an OMOP CDM instance (VM). There are specific pipelines for each type of surveillance – demographic, disease, clinical (including COVID-19) and, in the future, sensors that capture environmental exposures.

Data pipelines are centrally managed shared resources across OMOP CDM instances.

While these pipelines differ in detail depending on the surveillance type, at a high level they all follow the same steps:

  • Each pipeline begins with structured surveillance data aka source data. The structure comes from standard exchange formats – one for each type of surveillance.
  • The structured source data comes with a codebook. Currently only DDI codebooks are supported.
  • For each surveillance type, an implementation guide is constructed only a little automatically now, based on the exchange format and the codebook. In the OMOP CDM “clinical” tables, the structured data shows up in source concepts and source values.
  • The implementation guides provide direction to an OHDSI product called Rabbit in a Hat. Rabbit in a Hat performs the actual mapping between the structured surveillance data and the OMOP CDM.
  • Rabbit in a Hat produces skeletal ETL.
  • The skeletal ETL in turn guides the construction of an ETL program written in Pentaho using its Kettle ETL language. In Kettle ETL there are high level steps called jobs and specific transformations that run under each job. In our use case, jobs assure that the ETL for each surveillance type follows the same strategy even as the ETL for each surveillance type may have different details.
Population Health Data

The VMs consume population health surveillance marshaled into multiple exchange formats aka standards. These exchange standards and the data they format are run through shared data pipelines at each VM.

At INSPIRE population health surveillance is growing incrementally. This means we have a roadmap, and we are adding new data pipelines in connection with new data types over time. Currently, INSPIRE supports the following data types, more or less:

  • Demographic surveillance (in depth). Demographic surveillance is now being used to calculate excess deaths during the COVID-19 era.
  • Integrated Disease Surveillance and Response (IDSR) Africa Region person-level CRFs aka DHIS2. This is work in progress. Currently some VMs are hosting IDSR synthetic data as negotiations are ongoing with Ministries of Health.
  • The WHO COVID-19 Core CRF. We have constructed an implementation guide that directs EMR data into OMOP by way of the WHO COVID-19 Core CRF. This pipeline is not complete. INSPIRE has obtained funding to complete this pipeline. It is on the INSPIRE roadmap now. Also, on our roadmap there are futures for mapping OpenMRS EMR data and a standard FHIR export that many EMR systems now support into OMOP, again by way of the same WHO COVID-19 Core CRF.
  • Future data types that INSPIRE is seeking funds for include (1) mental health real world data by way of pharmacy and mental health provider CRFs and (2) the exposome. Note that the exposome is a person’s history of environmental exposures. Currently, OHDSI is in the process of extending the OMOP CDM to host environmental exposures and INSPIRE is a participant in an OHDSI working group directing this effort.