Concept and Architecture of the RDMC and its Platform for NFDIxCS and the Research Community

Whitepaper

Jan Bernoth, Michael Goedicke, Michael Striewe

Motivation

Missing Aspects:

Research Data Management Container (RDMC)

The RDMC is the heart of the efforts made by NFDIxCS. The RDMC is a portable object that manages itself in terms of access, specific workflow and all the data, software and further information to describe the data and all the means to access / bring the data back to life. These RDMCs will be hosted by a few service providers initially within the consortium and by trusted service providers.

Concepts

The design of the RDMC is guided by several principal concepts, which will be outlined in this section. Actual consequences for the RDMC architecture will be derived from these concepts in the subsequent sections.

Connecting Data, Software, and Context through a Container

Research is not a series of isolated steps. Therefore, it’s essential to permanently link contextual information with datasets [4]. The container facilitates this linkage by consolidating diverse artifacts of varying formats and referencing any additional required artifacts in one location. To illustrate the types of data, software, and context being referred to, several examples are provided in the NFDIxCS proposal:

Discipline Examples Data Software Context
Theoretical Computer Science Formal proof for interactive theorem proof data for testing, evaluation and training; for example libraries of proof problems theorem prover
Software Engineering and Programming Languages
Security and Dependability
Operating, Communication, Database and Distributed Systems
Visual Computing
Business Information Systems
Computer Architecture and Embedded Systems
Massively Parallel & Data Intensive Systems
Artificial Intelligence and Machine Learning
Interactive Systems

Several open questions emerge when considering the balance between the stability and flexibility of the container, especially in its role connecting data, software, and context. On one hand, the container represents a snapshot of current research, ensuring consistency. On the other, there’s a need for adaptability as datasets change or grow and software evolves. Stemming from this dynamic:

The container itself receives a stable ID, so that other containers, and scientific work, can reference this research artefact and thus create a network of rich artifacts where all contextual elements are available for any of its individual pieces.

As Open as possible

As already mentioned above, research can produce and use entirely different artifacts. Consequently, the container must be as open as possible to allow for any type of artifact in any data format and for any type of reference to any other place. Driving this idea further, there will even be no single implementation of code managing RDMCs, but the opportunity to create multiple implementations that all read or create RDMCs conforming to the standards defined by NFDIxCS.

Embracing multiple standards

While there is a need to establish a standard for RDMCs, it doesn’t imply neglecting existing standards. Embracing openness requires us to accommodate various standards across different areas of research data and software management. These areas, where multiple standards need to be addressed or prevalent standards should be adopted, are identified by the NFDIxCS Community and encompass the following topics:

The container also needs the flexibility to adapt to new standards if there are changes in the underlying technology or schemas.

Using archives vs creating an archive

The RDMC and its platform should not be built as another archive. Constructing an archive requires significantly more resources, and the landscape for archiving in individual disciplines would be expanded by a new archiving system. As a result, NFDIxCS doesn’t plan to create a new archive but to utilize existing solutions. The RDMC and its platform must architecturally act as mediators. They should provide users the opportunity to point to existing data and software archives and orchestrate them into a unified container. Defining this mediation space raises various questions:

Guarantees / Levels of consistency and conservation

The concept sketched so far is delibarately open to multiple implementations, sets little constaints and gives great freedom in using the opportunities provided. Consequently, not all RDMCs will be in exactly the same shape and it is important to know what can be achieved by using one option or another. We first examine the different options and their consequences. Subsequently, we provide a sequence of level, where each level provides more guarantees regarding the consistency of the container and the conservation of data than the previous level. These levels may be promoted by badges declaring the conformance to one of the levels in the platform later on.

Levels:

Manifest File

The RDMC comes in different “flavours”, probably implemented in ascending order from simple to complex during the development process of NFDIxCS. The minimal contant of a container is a manifest file explaining the RDMC contents in machine-readable format, also containing signature and versioning information. Any resources associated with the RDMC may be provided as references to external data sources. This way, it is easy to set up a minimal RDMC if data is already published in an appropriate repository and if no additional guarantees are relevant.

Active vs. passive containers

Besides the manifest file, a RDMC may contain a file structure that has three key components:

The simplest way of creating a RDMC is to assemble its contents manually in the way sketched above, describe them properly in a manifest file and package the resulting file structure into an archive file. This is not the preferred way of handling RDMCs, as this provides little guarantees with respect to the consistency of its contents.

The platform will provide a service by which a RDMC can be build and managed via a UI. (Potantially a standalone software tool will be available to do the same.) Rights may restrict write access. The platform will implement mechanisms to ensure the integrity of the RDMC based on signatures. Nevertheless, the result will still be a “passive” container that can only be managed manually or via the platform.

In order to create an active RDMC, the file structure can be established within an executable container (like Docker). A piece of software will be developed by NFDIxCS that can be added to the container as an entry point. When started, it will provide a UI similar to the platform that grants access to the RDMC contents. Rights may restrict write access. The piece of software will implement mechanisms to ensure the integrity of the RDMC based on signatures.

The piece of software may provide additional services that resolve dependencies to reusable execution environments and thus can setup an environment automatically.

Implementing the RDMC

NFDIxCS will provide a specification for the RDMC contents and the piece of software. NFDIxCS will provide a reference implementation based on one container technology and one programming language for the piece of software. Other implementations are welcome.

The Platform

Architecture

The model of the architecture will be oriented on C4. C4 leads the architects to create visualizations for different detail levels:

Context

Users actively using the NFDIxCS Platform:

Container

Glossar

Abbreviation Full Form
RDMC Research Data Management Container

References

1.
Lamprecht, A.-L., Garcia, L., Kuzak, M., Martinez, C., Arcila, R., Del Martin Pico, E., Del Dominguez Angel, V., van de Sandt, S., Ison, J., Martinez, P.A., McQuilton, P., Valencia, A., Harrow, J., Psomopoulos, F., Gelpi, J.Ll., Chue Hong, N., Goble, C., Capella-Gutierrez, S.: Towards FAIR principles for research software. Data Science. 3, 37–59 (2020). https://doi.org/10.3233/DS-190026.
2.
Gruenpeter, M., Katz, D.S., Lamprecht, A.-L., Honeyman, T., Garijo, D., Struck, A., Niehues, A., Martinez, P.A., Castro, L.J., Rabemanantsoa, T., Chue Hong, N.P., Martinez-Ortiz, C., Sesink, L., Liffers, M., Fouilloux, A.C., Erdmann, C., Peroni, S., Martinez Lavanchy, P., Todorov, I., Sinha, M.: Defining research software: A controversial discussion, (2021). https://doi.org/10.5281/zenodo.5504016.
3.
Hinsen, K.: Verifiability in computer-aided research: The role of digital scientific notations at the human-computer interface. PeerJ. Computer science. 4, e158 (2018). https://doi.org/10.7717/peerj-cs.158.
4.
Smedt, K. de, Koureas, D., Wittenburg, P.: FAIR digital objects for science: From data pieces to actionable knowledge units. Publications. 8, 21 (2020). https://doi.org/10.3390/publications8020021.
5.
Hasselbring, W., Carr, L., Hettrick, S., Packer, H., Tiropanis, T.: From FAIR research data toward FAIR and open research software. it - Information Technology. 62, 39–47 (2020). https://doi.org/10.1515/itit-2019-0040.