How to write a parser¶
NOMAD uses parsers to automatically extract information from raw files and output that information into structured archives. Parsers can decide which files act upon based on the filename, mime type or file contents and can also decide into which schema the information should be populated into.
This documentation shows you how to write a plugin entry point for a parser. You should read the documentation on getting started with plugins to have a basic understanding of how plugins and plugin entry points work in the NOMAD ecosystem.
Getting started¶
You can use our template repository to create an initial structure for a plugin containing a parser. The relevant part of the repository layout will look something like this:
nomad-example
├── src
│ ├── nomad_example
│ │ ├── parsers
│ │ │ ├── __init__.py
│ │ │ ├── myparser.py
├── LICENSE.txt
├── README.md
└── pyproject.toml
See the documentation on plugin development guidelines for more details on the best development practices for plugin, including linting, testing and documenting.
Parser entry point¶
The entry point defines basic information about your parser and is used to automatically load the parser code into a NOMAD distribution. It is an instance of a ParserEntryPoint
or its subclass and it contains a load
method which returns a nomad.parsing.Parser
instance that will perform the actual parsing. You will learn more about the Parser
class in the next sections. The entry point should be defined in */parsers/__init__.py
like this:
from pydantic import Field
from nomad.config.models.plugins import ParserEntryPoint
class MyParserEntryPoint(ParserEntryPoint):
def load(self):
from nomad_example.parsers.myparser import MyParser
return MyParser(**self.dict())
myparser = MyParserEntryPoint(
name = 'MyParser',
description = 'My custom parser.',
mainfile_name_re = '.*\.myparser',
)
Here you can see that a new subclass of ParserEntryPoint
was defined. In this new class you can override the load
method to determine how the Parser
class is instantiated, but you can also extend the ParserEntryPoint
model to add new configurable parameters for this parser as explained here.
We also instantiate an object myparser
from the new subclass. This is the final entry point instance in which you specify the default parameterization and other details about the parser. In the reference you can see all of the available configuration options for a ParserEntryPoint
.
The entry point instance should then be added to the [project.entry-points.'nomad.plugin']
table in pyproject.toml
in order for the parser to be automatically detected:
Parser
class¶
The resource returned by a parser entry point must be an instance of a nomad.parsing.Parser
class. In many cases you will, however, want to use the already existing nomad.parsing.MatchingParser
subclass that takes care of the file matching process for you. This parser definition should be contained in a separate file (e.g. */parsers/myparser.py
) and could look like this:
from typing import Dict
from nomad.datamodel import EntryArchive
from nomad.parsing import MatchingParser
class MyParser(MatchingParser):
def parse(
self,
mainfile: str,
archive: EntryArchive,
logger=None,
child_archives: Dict[str, EntryArchive] = None,
) -> None:
logger.info('MyParser called')
If you are using the MatchingParser
interface, the minimal requirement is
that your class has a parse
function, which will take as input:
mainfile
: Filepath to a raw file that the parser should open and run onarchive
: TheEntryArchive
object in which the parsing results will be storedlogger
: Logger that you can use to log parsing events into
Note here that if using MatchingParser
, the process of identifying which files the parse
method is run against is take care of by passing in the required parameters to the instance in the load
mehod. In the previous section, the load
method looked something like this:
There we are passing all of the entry configuration options to the parser instance, including things like mainfile_name_re
and mainfile_contents_re
. The MatchingParser
constructor uses these parameters to set up the file matching appropriately. If you wish to take full control of the file matching process, you can use the nomad.parsing.Parser
class and override the is_mainfile
function.
Match your raw file¶
If you are using the MatchingParser
interface you can configure which files
are matched directly in the ParserEntryPoint
. For example to match only certain file extensions and file contents, you can use the mainfile_name_re
and mainfile_contents_re
fields:
myparser = MyParserEntryPoint(
name = 'MyParser',
description = 'My custom parser.',
mainfile_name_re = '.*\.myparser',
mainfile_contents_re = '\s*\n\s*HELLO WORLD',
)
You can find all of the available matching criteria in the ParserEntryPoint
reference
Running a parser¶
Parsers automatically run for the matched files within a NOMAD distribution, but it is also possible to run the manually for specific files. This can be useful for testing and for connecting them into external software.
Using the CLI¶
If you have installed a NOMAD plugin into a Python virtual environment, you can run a parser from that plugin with the nomad
command line interface. The following command will uses the CLI to parse a given input file, and store the resulting JSON output into an output file:
The parse command will automatically match the right parser to your file and run the parser. To skip the parser matching, i.e. the process that determined which parser fits to the given file, you can use the --parser
argument to provide a parser entry point id:
You can check the CLI reference for nomad parse
for the full list of arguments, but the following should get you started:
--show-metadata
: Return json representation of the basic metadata--skip-normalizers
: Skip any normalizers--preview-plots
: Optionally previews the generated plots.--save-plot-dir <directory>
: Specifies a directory to save the plot images.
Within python code¶
You can also invoke the NOMAD parsers using Python. This will give you the parse results as metainfo objects to conveniently analyze the results in Python. You can either run the parser class directly:
from nomad.datamodel import EntryArchive
from nomad_example.parsers.myparser import MyParser
import logging
p = ExampleParser()
a = EntryArchive()
p.parse('tests/data/example.out', a, logger=logging.getLogger())
print(a.m_to_dict())
Or alternatively through the parse
function that is also used internally by the CLI:
import sys
from nomad.client import parse, normalize_all
# Match and run the parser
archives = parse('path/to/you/file')
# Run all normalizers
for archive in archives:
normalize_all(archive)
# Get the 'main section' section_run as a metainfo object
section_run = archive.run[0]
# Get the same data as JSON serializable Python dict
python_dict = section_run.m_to_dict()
Parsing text files¶
ASCII text files are amongst the most common files used. Here, we show you how to parse the text by matching specific regular expressions in these files. For the following example, we will use the project file tests/data/example.out
:
Check out the master
branch of the exampleparser
project,
and examine the file to be parsed in tests/data/example.out
:
2020/05/15
*** super_code v2 ***
system 1
--------
sites: H(1.23, 0, 0), H(-1.23, 0, 0), O(0, 0.33, 0)
latice: (0, 0, 0), (1, 0, 0), (1, 1, 0)
energy: 1.29372
*** This was done with magic source ***
*** x°42 ***
system 2
--------
sites: H(1.23, 0, 0), H(-1.23, 0, 0), O(0, 0.33, 0)
cell: (0, 0, 0), (1, 0, 0), (1, 1, 0)
energy: 1.29372
At the top there is some general information such as date, name of the code (super_code
)
and its version (v2
). Then is information for two systems (system 1
and system 2
),
separated with a string containing a code-specific value magic source
. Both system sections contain the quantities sites
and energy
, but each have a unique quantity as well, latice
and cell
, respectively.
In order to convert the information from this file into the NOMAD archive, we first have to
parse the necessary quantities. The nomad-lab
Python package provides a text_parser
module for declarative (i.e., semi-automated) parsing of text files. You can define text file parsers as follows:
def str_to_sites(string):
sym, pos = string.split('(')
pos = np.array(pos.split(')')[0].split(',')[:3], dtype=float)
return sym, pos
calculation_parser = TextParser(
quantities=[
Quantity(
'sites',
r'([A-Z]\([\d\.\, \-]+\))',
str_operation=str_to_sites,
repeats=True,
),
Quantity(
Model.lattice,
r'(?:latice|cell): \((\d)\, (\d), (\d)\)\,?\s*\((\d)\, (\d), (\d)\)\,?\s*\((\d)\, (\d), (\d)\)\,?\s*',
repeats=False,
),
Quantity('energy', r'energy: (\d\.\d+)'),
Quantity(
'magic_source',
r'done with magic source\s*\*{3}\s*\*{3}\s*[^\d]*(\d+)',
repeats=False,
),
]
)
mainfile_parser = TextParser(
quantities=[
Quantity('date', r'(\d\d\d\d\/\d\d\/\d\d)', repeats=False),
Quantity('program_version', r'super\_code\s*v(\d+)\s*', repeats=False),
Quantity(
'calculation',
r'\s*system \d+([\s\S]+?energy: [\d\.]+)([\s\S]+\*\*\*)*',
sub_parser=calculation_parser,
repeats=True,
),
]
)
The quantities to be parsed can be specified as a list of Quantity
objects in TextParser
.
Each quantity should have a name and a regular expression (re) pattern to match the value.
The matched value should be enclosed in a group(s) denoted by (...)
. In addition, we can
specify the following arguments:
findall (default=True)
: Switches simultaneous matching of all quantities usingre.findall
. In this case, overlap between matches is not tolerated, i.e. two quantities cannot share the same block in the file. If this cannot be avoided, setfindall=False
switching tore.finditer
. This will perform matching one quantity at a time which is slower but with the benefit that matching is done independently of other quantities.repeats (default=False)
: Switches finding multiple matches for a quantity. By default, only the first match is returned.str_operation (default=None)
: An external function to be applied on the matched value to perform more specific string operations. In the above example, we definedstr_to_sites
to convert the parsed value of the atomic sites.sub_parser (default=None)
: A nested parser to be applied on the matched block. This can also be aTextParser
object with a list of quantities to be parsed or otherFileParser
objects.dtype (default=None)
: The data type of the parsed value.shape (default=None)
: The shape of the parsed data.unit (default=None)
: The pint unit of the parsed data.flatten (default=True)
: Switches splitting the parsed string into a flat list.convert (default=True)
: Switches automatic conversion of parsed value.comment (default=None)
: String preceding a line to ignore.
A metainfo.Quantity
object can also be passed as first argument in place of name in order
to define the data type, shape, and unit for the quantity. TextParser
returns a dictionary
of key-value pairs, where the key is defined by the name of the quantities and the value is
based on the matched re pattern.
To parse a file, simply do:
To parse a file, specify the path to such file and call the parse()
function of TextParser
:
This will populate the mainfile_parser
object with parsed data and it can be accessed
like a Python dict with quantity names as keys or directly as attributes:
mainfile_parser.get('date')
'2020/05/15'
mainfile_parser.calculation
[TextParser(example.out) --> 4 parsed quantities (sites, lattice_vectors, energy, magic_source), TextParser(example.out) --> 3 parsed quantities (sites, lattice_vectors, energy)]
The next step is to write the parsed data into the NOMAD archive. We can use one of the
predefined plugins containing schema packages in NOMAD.
However, to better illustrate the connection between a parser and a schema we will define our own schema in this example (See How to write a schema in python for additional information on this topic). We define a root section called Simulation
containing two subsections, Model
and Output
. The definitions are found in exampleparser/metainfo/example.py
:
class Model(ArchiveSection):
m_def = Section()
n_atoms = Quantity(
type=np.int32, description="""Number of atoms in the model system."""
)
labels = Quantity(
type=str, shape=['n_atoms'], description="""Labels of the atoms."""
)
positions = Quantity(
type=np.float64, shape=['n_atoms'], description="""Positions of the atoms."""
)
lattice = Quantity(
type=np.float64,
shape=[3, 3],
description="""Lattice vectors of the model system.""",
)
class Output(ArchiveSection):
m_def = Section()
model = Quantity(
type=Reference(Model), description="""Reference to the model system."""
)
energy = Quantity(
type=np.float64,
unit='eV',
description="""Value of the total energy of the system.""",
)
class Simulation(ArchiveSection):
m_def = Section()
code_name = Quantity(
type=str, description="""Name of the code used for the simulation."""
)
code_version = Quantity(type=str, description="""Version of the code.""")
date = Quantity(type=Datetime, description="""Execution date of the simulation.""")
model = SubSection(sub_section=Model, repeats=True)
output = SubSection(sub_section=Output, repeats=True)
ArchiveSection
. This is the abstract class used in NOMAD to define sections and subsections in a schema. The Model
section is used to store the sites
and lattice/cell
information, while the Output
section is used to store the energy
quantity.
Each of the classes that we defined is a sub-class of ArchiveSection
. This is required in order to assign these sections to the data
section
of the NOMAD archive.
The following is the implementation of the parse
function of ExampleParser
to write the parsed quantities from our mainfile parser into the archive:
def parse(self, mainfile: str, archive: EntryArchive, logger):
simulation = Simulation(
code_name='super_code', code_version=mainfile_parser.get('program_version')
)
date = datetime.datetime.strptime(mainfile_parser.date, '%Y/%m/%d')
simulation.date = date
for calculation in mainfile_parser.get('calculation', []):
model = Model()
model.lattice = calculation.get('lattice_vectors')
sites = calculation.get('sites')
model.labels = [site[0] for site in sites]
model.positions = [site[1] for site in sites]
simulation.model.append(model)
output = Output()
output.model = model
output.energy = calculation.get('energy') * units.eV
magic_source = calculation.get('magic_source')
if magic_source is not None:
archive.workflow2 = Workflow(x_example_magic_value=magic_source)
simulation.output.append(output)
# put the simulation section into archive data
archive.data = simulation
Now, run the parser again and check that the new archive stores the intended quantities from tests/data/example.out
.
Additionally, the standard normalizers will be applied as well. This is run automatically during parsing, one can skip these by passing the
argument skip-normalizers
.
Extending the Metainfo¶
There are several built-in schemas NOMAD (nomad.datamodel.metainfo
).
In the example above, we have made use of the base section for workflow and extended
it to include a code-specific quantity x_example_magic_value
.
# We extend the existing common definition of section Workflow
class ExampleWorkflow(Workflow):
# We alter the default base class behavior to add all definitions to the existing
# base class instead of inheriting from the base class
m_def = Section(extends_base_section=True)
# We define an additional example quantity. Use the prefix x_<parsername>_ to denote
# non common quantities.
x_example_magic_value = Quantity(
type=int, description='The magic value from a magic source.'
)
This is the approach for domain-specific schemas such as for simulation workflows. Refer to how to extend schemas.
Other FileParser classes¶
Aside from TextParser
, other FileParser
classes are also defined. These include:
-
DataTextParser
: in addition to matching strings as inTextParser
, this parser uses thenumpy.loadtxt
function to load structured data files. The loadednumpy.array
data can then be accessed from the property data. -
XMLParser
: uses the ElementTree module to parse an XML file. Theparse
method of the parser takes in an XPath-style key to access individual quantities. By default, automatic data type conversion is performed, which can be switched off by settingconvert=False
.