Use Case 2 – From Data to Knowledge

Clinico-Molecular Predictive Knowledge Tool

In the MIRACUM consortium’s use case “From Data to Knowledge – Clinico-Molecular Predictive Knowledge Tool”, the aim is to develop and establish methods for the cross-site analysis of patient data in the participating university hospitals. The methods will be used to generate knowledge that can be directly applied in clinical practice.

The gradual expansion of the data integration centers at the medical university sites of the Medical Informatics Initiative will create a basis for identifying patient cohorts based on clinical parameters, biomarkers and molecular/genomic studies and dividing them into subgroups. In Use Case 2 of the MIRACUM consortium, predictive models are to be developed on this basis, which can contribute to medical knowledge and potentially support physicians in their diagnostic and therapeutic decisions. In the clinical area, the use case focuses on patients with lung diseases (asthma and COPD) and brain tumors.



MIRACUM – Gemeinsam gegen Asthma und COPD (in German, source: BMBF)

Read more

A concrete example: Alpha-1-antitrypsin deficiency (AATM) is a hereditary disease in which the enzyme alpha-1-antitrypsin is missing in the body. As a result, tissue damage to the lungs and liver can occur, leading to chronic obstructive pulmonary disease (COPD) at a young age. Thus, COPD patients with and without AATM often differ fundamentally – both in age and in smoking history, the biggest risk factors for COPD. The problem is that COPD with AATM is rather rare, which is why prognostic factors for complications and emerging comorbidities have usually been established in COPD records of patients without AATM. The use case “From Data to Knowledge” now wants to investigate whether these factors can be used for COPD patients with AATM despite the fundamental differences.

The corresponding data in MIRACUM are to be regarded as particularly worth protecting from a data protection perspective. A centralized collection across all locations is potentially too great a risk. Therefore, the goal is not to bring the data to analysis, but to bring the analysis to the data. More precisely, only aggregated and anonymous data should leave the sites. This principle is implemented by the software DataSHIELD, which was developed at the University of Newcastle. The software is published under an open source license and can be used freely. DataSHIELD offers various procedures that are part of the statistical toolkit, ranging from the calculation of simple ratios, such as averages or frequencies, to more complex regression models that are used in the clinical application described above. In addition to these already implemented analysis procedures, DataSHIELD also offers a flexible and expandable infrastructure to develop new types of artificial intelligence methods and apply them to networked data. To this end, the MIRACUM consortium is in close exchange with the development team and the DataSHIELD community.

In addition to the use of anonymous aggregated data, the use of synthetic data is researched in use cases to meet data protection requirements. Synthetic data are data that do not contain real observations and patient information, but rather replicate general characteristics and statistical relationships of real data. For the use of data in research, this means that virtual patient data are created for each site, which are not bound to the data of an individual patient. Such data can then be shared and allow the use of different analysis concepts, such as standard statistical analyses or artificial intelligence techniques. Machine learning approaches are required to generate synthetic data from real data. Specifically, so-called generative models are used, which map the systematic and random variability of the original data. This is made possible by artificial intelligence techniques, in particular techniques from the field of deep learning. The generation of virtual patient data is distributed over different MIRACUM locations. The DataSHIELD infrastructure is also used for this purpose. In this way, the analysis of the data with established procedures and the development of new methods for the data protection-compliant analysis of distributed patient data can be jointly advanced.


Zöller D, Haverkamp C, Makoudjou A, Sofack G, Kiefer S, Gebele D, Pfaffenlehner M, Boeker M, Binder H, Karki K, Seidemann C, Schmeck B, Greulich T, Renz H, Schild S, Seuchter SA, Tibyampansha D, Buhl R, Rohde G, Trudzinski FC, Bals R, Janciauskiene S, Stolz D, Fähndrich S. Alpha-1-antitrypsin-deficiency is associated with lower cardiovascular risk: an approach based on federated learning. Respir Res 2024; 25:38. DOI: 10.1186/s12931-023-02607-y.

Lenz S, Hess M, Binder H. Deep generative models in DataSHIELD. BMC Med Res Methodol. 2021; 21, 64. Doi: 0.1186/s12874-021-01237-6. PMID: PMC8019187.

Gruendner J, Wolf N, Tögel L, Haller F, Prokosch HU, Christoph J. Integrating Genomics and Clinical Data for Statistical Analysis by Using GEnome MINIng (GEMINI) and Fast Healthcare Interoperability Resources (FHIR): System Design and Implementation. JMIR 2020; 22:e19879. DOI: 10.2196/19879.

Gruendner J, Prokosch HU, Schindler S, Lenz S, Binder H. A Queue-Poll Extension and DataSHIELD: Standardised, Monitored, Indirect and Secure Access to Sensitive DataStud Health Technol Inform. 2019;258:115-119. Doi: 10.3233/978-1-61499-959-1-115. PMID: 30942726.