Distributed Analytics Infrastructure – Database Group Mittweida

Introduction

Nowadays, much data is created in different places, i.e., by different organizations and institutions. These collected data volumes are in some cases very large. Scientific studies or other types of investigations often record images, videos or voice tapes about a subject / object of interest whereas other investigations produce large collections of proprietary data types that are very specific for the considered use case or domain of investigation, such as genomic and sequencing analyses in bioinformatics and medical sciences. Moreover, such data collections often underlie privacy constraints, in particular, when they are associated with individual persons and represent person specific states.

Therefore, it is often unfeasible to integrate and replicate this data into a new data source (data silo) for an overall data analysis. To address this challenge, distributed analytics approaches have been developed in recent years. Such approaches have been developed in some way very focused on (scientific) domain or on specific types of data. Two examples: The XNAT system is primarily used in distributed image analytics and, thus, is focused on handling image data. Data Shield is often used in medical sciences but remains limited on structured data. We are part of an international team working on a distributed infrastructure called Personal Health Train.

Personal Health Train Approach

The Personal Health Train (PHT) approach makes heavily use of the train metaphor. A train starts at a train station, moves forward to next stations until it receives the final station. The train drives on secured tracks. In this way, an accident is only possible if somebody doesn’t passed the rules. At each station, new people can use the train.

In this context, the PHT approach uses a station at each data location whereas a train contains analysis algorithms and intermediate results. Hence, the analysis algorithms are transported to the data – in contrast to the historically scheme bring data to analysis. This saves network bandwidth and resources / time to get access and run the analysis on data. Moreover, no individual data will leave institutional borders. The PHT will only transport intermediate results from which association to individuals cannot be drawn. Such intermediate results can consist of statistical or machine learning models, sample size numbers, and others. In order to run the analysis in a homogenized environment, the PHT uses a containerization technology. In this way, each train is deployed as a container that comprises all algorithms and intermediate results in a predefined environment. With each station new intermediate results can be added or replace the older ones.

We are working with teams at the RWTH Aachen and University Tübingen on a German PHT implementation. In this network, we work on specific PHT components, such as the station registry listing all data locations, the fundamental infrastructure, and on analysis algorithms for distributed analytics. We apply this infrastructure in different research projects, such as for poly-pharmazy and rare diseases.

While these use case are very specific for medical sciences – therefore, it is called Personal Health Train – we are very interested in applying this infrastructure to other domains too, such as forensics.