Extending the Archive and Metainfo¶
In Using the Archive and Metainfo, we learned what the Archive and Metainfo are. It also demonstrated the Python interface and how to use it on Archive data. The Metainfo is written down in Python code as a bunch of classes that define sections and their properties. Here, we will look at how the Metainfo classes work and how the metainfo can be extended with new definitions.
Starting example¶
from nomad.metainfo import MSection, Quantity, SubSection, Units
class System(MSection):
'''
A system section includes all quantities that describe a single simulated
system (a.k.a. geometry).
'''
n_atoms = Quantity(
type=int, description='''
A Defines the number of atoms in the system.
''')
atom_labels = Quantity(
type=MEnum(ase.data.chemical_symbols), shape['n_atoms'])
atom_positions = Quantity(type=float, shape=['n_atoms', 3], unit=Units.m)
simulation_cell = Quantity(type=float, shape=[3, 3], unit=Units.m)
pbc = Quantity(type=bool, shape=[3])
class Run(MSection):
section_system = SubSection(sub_section=System, repeats=True)
We define simple metainfo schema with two sections called System
and Run
. Sections
allow to organize related data into, well, sections. Each section can have two types of
properties: quantities and sub-sections. Sections and their properties are defined with
Python classes and their attributes.
Each quantity defines a piece of data. Basic quantity attributes are type
, shape
,
unit
, and description
.
Sub-sections allow to place section within each other, forming containment
hierarchies or sections and the respective data within them. Basic sub-section attributes are
sub_section
(i.e. a reference to the section definition of the sub-section) and repeats
(determines whether a sub-section can be included once or multiple times).
The above simply defines a schema. To use the schema and create actual data, we have to instantiate the above classes:
run = Run()
system = run.m_create(System)
system.n_atoms = 3
system.atom_labels = ['H', 'H', 'O']
print(system.atom_labels)
print(n_atoms = 3)
Section instances can be used like regular Python objects: quantities and sub-sections
can be set and accessed like any other Python attribute. Special meta-info methods, starting
with m_
allow us to realize more complex semantics. For example m_create
will
instantiate a sub-section and add it to the parent section in one step.
Another example for an m_
-method is:
This will convert the data into JSON:
Definitions and Instances¶
As you have already seen in the example, we must first define how data can look like (schema), before we can actually program with them. Since schema and data are often discussed in the same context, it is of utmost importance to clearly distingish between the two. For example, if we just say "system", it is unclear what we refer to. We could mean the idea of a system, i.e. all possible systems, a data structure that comprises a lattice, or atoms with their elements and positions in the lattice. Or we mean a specific system of a specific calculation, with a concrete set of atoms, and real numbers for lattice vectors and atom positions as concrete data.
The NOMAD Metainfo is just a collection of definitions that describe what material science data might look like (a schema). The NOMAD Archive is all the data that we extract from all data provided to NOMAD. The data in the NOMAD Archive follows the definitions of the NOMAD metainfo.
Similarely, we need to distingish between the NOMAD Metainfo as a collection of definitions and the Metainfo system that defines how to define a section or a quantity. In this sense, we have a three layout model where the Archive (data) is an instance of the Metainfo (schema) and the Metainfo is an instance of the Metainfo system (schema of the schema).
This documentation describes the Metainfo by explaining the means of how to write down definitions in Python. Conceptually we map the Metainfo system to Python language constructs, e.g. a section definition is a Python class, a quantity a Python property, etc. If you are familiar with databases, this is similar to what an object relational mapping (ORM) would do.
Common attributes of Metainfo Definitions¶
In the example, you have already seen the basic Python interface to the Metainfo. Sections are
represented in Python as objects. To define a section, you write a Python classe that inherits
from MSection
. To define sub-sections and quantities you use Python properties. The
definitions themselves are also objects derived from classes. For sub-sections and
quantities, you directly instantiate :classSubSection
and :classQuantity
. For sections
there is a generated object derived from :class:Section
and available via
m_def
from each section class and section instance.
These Python classes, used to represent metainfo definitions, form an inheritance hierarchy to share common properties
name
, each definition has a name. This is typically defined by the corresponding Python property. E.g. a sections class name, becomes the section name; a quantity gets the name from its Python property, etc.description
, each definition should have one. Either set it directly or use doc stringslinks
, a list of useful internet references.more
, a dictionary of custom information. Any additionalkwargs
set when creating a definition are added tomore
.
Sections¶
Sections are defined with Python classes that extend MSection
(or other section classes).
base_sections
are automatically taken from the base classes ofc the Python class.extends_base_section
is a boolean that determines the inheritance. If this isFalse
, normal Python inheritance implies and this section will inherit all properties (sub-sections, quantities) from all base classes. If this isTrue
, all definitions in this section will be added to the properties of the base class section. This allows the extension of existing sections with additional properties.
Quantities¶
Quantity definitions are the main building block of meta-info schemas. Each quantity represents a single piece of data. Quantities can be defined by:
- A
type
, that can be a primitive Python type (str
,int
,bool
), a numpy data type (np.dtype('float64')
), aMEnum('item1', ..., 'itemN')
, a predefined metainfo type (Datetime
,JSON
,File
, ...), or another section or quantity to define a reference type. - A
shape
that defines the dimensionality of the quantity. Examples are:[]
(number),['*']
(list),[3, 3]
(3 by 3 matrix),['n_elements']
(a vector of length defined by another quantityn_elements
). - A physics
unit
. We use Pint here. You can use unit strings that are parsed by Pint, e.g.meter
,m
,m/s^2
. As a convention the metainfo uses only SI units.
Sub-Section¶
A sub-section defines a named property of a section that refers to another section. It allows to define that a section can contain another section.
sub_section
(aliasessection_def
,sub_section_def
) defines the section that can be contained.repeats
is a boolean that determines whether the sub-section relationship allows multiple section or only one.
References and Proxies¶
Beside creating hierarchies (e.g. tree structures) with subsections, the metainfo also allows to create cross references between sections and other sections or quantity values:
class Calculation(MSection):
system = Quantity(type=System.m_def)
atom_labels = Quantity(type=System.atom_labels)
calc = Calculation()
calc.system = run.systems[-1]
calc.atom_labels = run.systems[-1]
To define a reference, define a normal quantity and simply use the section or quantity you want to refer to as type. Then you can assign respective section instances as values.
In Python memory, quantity values that reference other sections simply contain a Python reference to the respective section instance. However, upon serializing/storing metainfo data, these references have to be represented differently.
Value references are a little different. When you read a value reference, it behaves like the reference value. Internally, we do not store the values, but a reference to the section that holds the referenced quantity. Therefore, when you want to assign a value reference, use the section with the quantity and not the value itself.
References are serialized as URLs. There are different types of reference URLs:
#/run/0/calculation/1
, a reference in the same Archive/run/0/calculation/1
, a reference in the same archive (legacy version)../upload/archive/mainfile/{mainfile}#/run/0
, a reference into an Archive of the same upload/entries/{entry_id}/archive#/run/0/calculation/1
, a reference into the Archive of a different entry on the same NOMAD installation/uploads/{upload_id}/archive/{entry_id}#/run/0/calculation/1
, similar to the previous one but based on uploadshttps://myoasis.de/api/v1/uploads/{upload_id}/archive/{entry_id}#/run/0/calculation/1
, a global reference towards a different NOMAD installation (Oasis)
The host and path parts of URLs correspond with the NOMAD API. The anchors are paths from the root section of an Archive, over its sub-sections, to the referenced section or quantity value. Each path segment is the name of the subsection or an index in a repeatable subsection: /system/0
or /system/0/atom_labels
.
References are automatically serialized by :py:meth:MSection.m_to_dict
. When de-serializing
data with :py:meth:MSection.m_from_dict
these references are not resolved right away,
because the reference section might not yet be available. Instead references are stored
as :class:MProxy
instances. These objects are automatically replaced by the referenced
object when a respective quantity is accessed.
If you want to define references, it might not be possible to define the referenced section or quantity beforehand, due to the way Python definitions and imports work. In these cases, you can use a proxy to reference the reference type. There is a special proxy implementation for sections:
The strings given to SectionProxy
are paths within the available definitions.
The above example works, if System
is eventually defined in the same package.
Categories¶
In the old meta-info this was known as abstract types.
Categories are defined with Python classes that have :class:MCategory
as base class.
Their name and description are taken from the name and docstring of the class. An example
category looks like this:
class CategoryName(MCategory):
''' Category description '''
m_def = Category(links=['http://further.explanation.eu'], categories=[ParentCategory])
Packages¶
Metainfo packages correspond to Python packages. Typically your metainfo Python files should follow this pattern:
from nomad.metainfo import Package
m_package = Package()
# Your section classes and categories
m_package.__init_metainfo__()
Adding definition to the existing metainfo schema¶
Now you know how to define new sections and quantities, but how should your additions be integrated in the existing schema and what conventions need to be followed?
Metainfo schema super structure¶
The EntryArchive
section definition sets the root of the archive for each entry in
NOMAD. It therefore defines the top level sections:
metadata
, all "administrative" metadata (ids, permissions, publish state, uploads, user metadata, etc.)results
, a summary with copies and references to data from method specific sections. This also presents the searchable metadata.workflows
, all workflow metadata- Method specific sub-sections, e.g.
run
. This is were all parsers are supposed to add the parsed data.
The main NOMAD Python project includes Metainfo definitions in the following modules:
nomad.metainfo
defines the Metainfo itself. This includes a self-referencing schema. E.g. there is a sectionSection
, etc.nomad.datamodel
mostly defines the sectionmetadata
that contains all "administrative" metadata. It also contains the root sectionEntryArchive
.nomad.datamodel.metainfo
defines all the central, method specific (but not parser specific) definitions. For example the sectionrun
with all the simulation definitions (computational material science definitions) that are shared among the respective parsers.
Extending existing sections¶
Parsers can provide their own definitions. By conventions, these are placed into a
metainfo
sub-module of the parser Python module. The definitions here can add properties
to existing sections (e.g. from nomad.datamodel.metainfo
). By convention us a x_mycode_
prefix. This is done with the
extends_base_section
Section property. Here is an example:
from nomad.metainfo import Section
from nomad.datamodel.metainfo.simulation import Method
class MyCodeRun(Method)
m_def = Section(extends_base_section=True)
x_mycode_execution_mode = Quantity(
type=MEnum('hpc', 'parallel', 'single'), description='...')
Metainfo schema conventions¶
- Use lower snake case for section properties; use upper camel case for section definitions.
- Use a
_ref
suffix for references. - Use subsections rather than inheritance to add specific quantities to a general section.
E.g. the section
workflow
contains a sectiongeometry_optimization
for all geometry optimization specific workflow quantities. - Prefix parser-specific and user-defined definitions with
x_name_
, wherename
is the short handle of a code name or other special method prefix.