Deutsch Intern
IZKF Core Unit RDM

Metadata

Metadata is crucial in scientific research for organizing, understanding, and efficiently using data. In the context of, e.g., sequencing or bioimaging, metadata can include a wide range of information such as gene and sample details, lab folder links, experimental procedures, experimenter names, dates, temperatures, sequencer and microscope types, organisms, and experiment types. This information can be stored in metadata sheets or formats such as BIDS, OME-TIFF, OME-ZARR, or HDF5. Like other data, metadata can hence be organized in multi-dimensional matrices and integrated into relational databases or other computationally readable data objects. Through these databases, one can construct  knowledge graphs that connect various data points, facilitating customized data searches, preparation, and analysis.

Standardized metadata, such as stable gene IDs, disease IDs, or experiment IDs, allows for effective organization and computational analysis. Automated workflows can provide overviews of genes analyzed, organisms investigated and available experiments and data, aiding in research planning and management. Metadata also plays a pivotal role in AI experiments due to the preference of computers for standardized data formats.

Ideally, metadata are stored in electronic lab books, such as Labfolder. Here, they offer an overview of completed experiments, further aiding in research coordination and planning. Project leaders can gain quick insights into ongoing research activities, identify bottlenecks and plan resources efficiently (e.g., chemical usage and ordering, needs based on past data).

Example: Metadata in sequencing

In sequencing, using metadata can significantly enhance data management and integration. Next to gene and sample-specific information, experiment-related data can be included. For instance, metadata can include:

  • Study: the research question that is studied
  • Experiment type: nature of the experiment, e.g., infection model or KO study
  • Experimental procedure: detailed steps of the experiment
  • Organism: classification of the biological subject
  • Tissue and cell type: detailed classification of the biological subject
  • Sequencer Type: information about the equipment used
  • Data: link to raw and processed data
  • Computational analysis: link to analysis scripts
  • Cofactors: access date, temperature, and other cofactors during the experiment
  • Experimenter ID: identification of the researcher
  • eLabBook Link: direct reference to the corresponding lab notebook entry

The metadata allows you to search similar experiments in different tissues easily, integrate the data, and quickly gain new and enhanced insights from existing datasets. However, the extent of metadata annotation should be balanced, as extensive detail can be time-consuming for humans to input. Consortia should establish metadata standards and minimum requirements. There are guidelines and recommendations for a minimum metadata set in each field (e.g., bioimaging, sequencing) provided by organizations like the NFDI for different fields. Even without advanced tools such as OMERO (RDM Bioimaging), SODAR (RDM sequencing), VRE Charité, or Aruna (advanced RDM clouds), standard formats and key-value pairs can be maintained in simple formats like Excel, which are easily converted for SQL analysis or Python workflows. The consortium's agreement on the necessary key-value pairs for different datasets or the minimum set of metadata within their research fields is crucial.

In essence, well-organized metadata streamlines data analysis, integration, and research planning and sets the stage for sophisticated AI-driven experiments, highlighting its critical role in modern scientific research.