Standard formats and pipelines
Standard formats and analysis pipelines are integral components of RDM for several reasons:
- Enhanced accessibility and sharing: standard formats ensure that data can be easily accessed and understood by others in the field, promoting collaboration and data sharing.
- Data quality: standards play a pivotal role in enhancing data quality by enforcing rules and guidelines governing the representation of data.
- Data preservation: standardized data formats help in long-term preservation, ensuring that data can be used and understood in the future.
- Compliance and ethical considerations: many funding agencies and journals require adherence to certain data management standards, ensuring that data is ethically sourced, stored, and shared.
- Reproducibility and validation: consistent analysis pipelines allow for the replication of research results, a cornerstone of scientific validation.
Standard Formats in RDM
Several standard formats are widely used across various fields for data storage and sharing:
- CSV/TSV: simple, text-based formats ideal for tabular data.
- JSON/XML/GRAPHML: useful for hierarchical or nested data, widely used in web applications and graphs.
- TIFF/NIfTI/DICOM/OME-ZARR in imaging: TIFF is widely used for high-quality graphics, NIfTI for brain imaging data, DICOM for standardized medical imaging across various modalities, and OME-ZARR for storing and sharing large multidimensional datasets such as bioimaging data.
- FASTQ/BAM/GFF/HDF5 in sequencing: FASTQ stores biological sequences and quality scores, BAM compresses and indexes sequence alignments, GFF describes genomic features, and HDF5 organizes large amounts of complex data in sequencing and scientific research.
- STL (Stereolithography) for 3D printing models used in scaffold design and OBJ format for representing complex tissue geometries in 3D modelling software are used in tissue engeneering and virtualization projects.
Analysis Pipelines in RDM
Research analysis pipelines serve as foundational frameworks across various scientific disciplines, ensuring the reproducibility and robustness of studies in fields like genomics, imaging, and beyond. Researchers develop tailored pipelines using programming languages such as Python, R, Matlab, and specialized software (e.g., Fiji for imaging and Cell Ranger for sequencing), allowing for the precise processing and analysis of complex datasets.
Databases, both proprietary and public (such as IDR, GEO, Ensembl, PRIDE), provide essential datasets that are seamlessly integrated and processed through these pipelines. Modern Research Data Management (RDM) tools like SODAR, OMERO, and MySQL feature APIs in widely-used programming languages, facilitating the automated and integrated analysis of these datasets.
The adoption of standard data formats is critical for ensuring the reproducibility of scientific findings and the broader applicability of research methodologies. Pre-made pipelines offer streamlined solutions for researchers seeking efficiency, while platforms like Bioconductor, Galaxy, and FSL democratize access to advanced analytical tools, making them accessible to a wider range of researchers.
Pipeline wrapper software, including Nextflow and Snakemake, automates the integration of disparate software tools, simplifying the analysis process. This approach promotes open science by ensuring that methodologies are as sharable as a widely-used protocol, fostering collaboration and innovation across the scientific community.
A case in point: the NWB data standard in neurophysiology |
The Neurodata Without Borders (NWB) is the most well-known example of such standardization frameworks in the field of basic neurophysiology. The NWB project endeavors to standardize both neurophysiology data and metadata, with the overarching objective of promoting data sharing and collaboration within the neuroscience community. Specifically concentrating on the creation of a shared format for the storage and interchange of data derived from neurophysiological experiments, NWB seeks to improve reproducibility, interoperability, and overall efficiency within the field of neurophysiology. Obviously, the first step necessary to fulfill the FAIR compliance is to deploy common data standards based on which the data could be streamed among a variety of users and platforms. This would require the Labs producing different sources of data to finally share their data with the community in the adapted structure based on the agreed standards. Due to the variety of experiments and the very diverse acquisition structures (e.g., different file extensions and software) in different laboratories across the world, the existence of conversion pipelines that automatically convert the Lab-specific acquisitions into interoperable and reusable formats (data structures commonly agreed upon) for the broad community is necessary. In collaboration with UKW’s Defense Circuits Lab the CRDM develops pipelines for conversion of specific data formats into the standardized NWB format. |