The NOMAD schemas and processed data system is designed to describe and manage intricate hierarchies of connected data. This is ideal for metadata and lots of small data quantities, but does not work for large quantities. Quantities are atomic and are always manages as a whole; there is currently no functionality to stream or splice large quantities. Consequently, tools that produce or work with such data cannot scale.
Large data quantities should be managed within HDF5 raw files. These can be files that are either produced during processing (e.g. created by a parser) or that are uploaded and contain the large quantities already. To describe these data, schemas and processed data can include references into HDF5 raw files. This allows to describe large data quantities with metadata that is described with schemas and contained in the processed data.
The data type
HDF5Reference is a regular type that can be used for quantities
in schemas. The values are similar to reference values and contain
the HDF5 file and a path to a group or field in the HDF5, e.g.
For HDF5 files that are generated during processing, it is good practice to structure
the HDF5 inline with the schema. E.g. a
HDF5Reference quantity with a sub-section
data.process.log should be stored in a field names
log in the sub-group
that is part of the root group
In future version of NOMAD, processed data might be maintained partially or in full in HDF5 files. Structuring HDF5 files and processed data alike, might simplify later migration.
NOMAD clients (e.g. NOMAD UI) can pick up on these
HDF5Reference quantities and
provide respective functionality (e.g. showing a H5Web view).
Code example, coming soon ...
Metadata for large quantities¶
Coming soon ...