Data Access (Archive)

Of course, you can access the NOMAD Archive directly via the NOMAD API (see the API tutorial and API reference). But, it is more effective and convenient to use NOMAD’s Python client library.

Install the NOMAD client library

The NOMAD client library is a Python module (part of the nomad Python package) that allows to access the NOMAD archive to retrieve and analyse (large amounts) of NOMAD’s archive data. It allows to use queries to filter for desired entries, bulk download the required parts of the respective archives, and navigate the results using NOMAD’s metainfo Python API.

To install the NOMAD Python package, you can use pip install to install our source distribution

pip install nomad-lab

First example

'''
A simple example that uses the NOMAD client library to access the archive.
'''

from nomad.client import ArchiveQuery
from nomad.metainfo import units


query = ArchiveQuery(
    # url='http://nomad-lab.eu/prod/rae/beta/api',
    query={
        '$and': {
            'dft.code_name': 'VASP',
            '$not': {
                'atoms': ["Ti", "O"]
            }
        }
    },
    required={
        'section_run': {
            'section_single_configuration_calculation': {
                'energy_total': '*'
            },
            'section_system': '*'
        }
    },
    per_page=10,
    max=None)


print(query)

# for i, result in enumerate(query):
#     if i < 10:
#         calc = result.section_run[0].section_single_configuration_calculation[-1]
#         energy = calc.energy_total
#         formula = calc.single_configuration_calculation_to_system_ref.chemical_composition_reduced
#         print('%s: energy %s' % (formula, energy.to(units.hartree)))

This script should yield a result like this:

Number queries entries: 7628
Number of entries loaded in the last api call: 10
Bytes loaded in the last api call: 118048
Bytes loaded from this query: 118048
Number of downloaded entries: 10
Number of made api calls: 1

Cd2O2: energy -11467.827149010665 hartree
Sr2O2: energy -6551.45699684026 hartree
Sr2O2: energy -6551.461104765451 hartree
Be2O2: energy -178.6990610734937 hartree
Ca2O2: energy -1510.3938165430286 hartree
Ca2O2: energy -1510.3937761449583 hartree
Ba2O2: energy -16684.667362890417 hartree
Mg2O2: energy -548.9736595672932 hartree
Mg2O2: energy -548.9724185656775 hartree
Ca2O2: energy -1510.3908614326358 hartree

Let’s discuss the different elements here. First, we have a set of imports. The NOMAD source codes comes with various sub-modules. The client module contains everything related to what is described here; the metainfo is the Python interface to NOMAD’s common archive data format and its data type definitions; the config module simply contains configuration values (like the URL to the NOMAD API).

Next, we create an ArchiveQuery instance. This object will be responsible for talking to NOMAD’s API for us in a transparent and lazy manner. This means, it will not download all data right away, but do so when we are actually iterating through the results.

The archive query takes several parameters:

  • The query is a dictionary of search criteria. The query is used to filter all of NOMAD’s entry down to a set of desired entries. You can use NOMAD’s GUI to create queries and copy their Python equivalent with the <>-code button on the result list.

  • The required part, allows to specify what parts of the archive should be downloaded. Leave it out to download the whole archives. Based on NOMAD’s Metainfo (the ‘schema’ of all archives), you can determine what sections to include and which to leave out. Here, we are interested in the first run (usually entries only have one run) and the first calculation result.

  • With the optional per_page you can determine, how many results are downloaded at a time. For bulk downloading many results, we recommend ~100. If you are just interested in the first results a lower number might increase performance.

  • With the optional max, we limit the maximum amount of entries that are downloaded, just to avoid accidentely iterating through a result set of unknown and potentially large size.

When you print the archive query object, you will get some basic statistics about the query and downloaded data.

The archive query object can be treated as a Python list-like. You use indices and ranges to select results. Here we iterate through a slice and print the calculated energies from the first calculation of the entries. Each result is a Python object with attributes governed by the NOMAD Metainfo. Quantities yield numbers, string, or numpy arrays, while sub-sections return lists of further objects. Here we navigate the sections section_run and sub-section section_system to access the quantity energy_total. This quantity is a number with an attached unit (Joule), which can be converted to something else (e.g. Hartree).

The create query object keeps all results in memory. Keep this in mind, when you are accessing a large amount of query results. You should use ArchiveQuery.clear() to remove unnecessary results.

The NOMAD Metainfo

You can imagine the NOMAD Metainfo as a complex schema for hiearchically organized scientific data. In this sense, the NOMAD Metainfo is a set of data type definitions. These definitions then govern how the archive for an data entry in NOMAD might look like. You can browse the hierarchy of definitions in our Metainfo browser.

Be aware, that the definitions entail everything that an entry could possibly contain, but not all entries contain all sections and all quantities. What an entry contains depends on the information that the respective uploaded data contained, what could be extracted, and of course what was calculated in the first place. To see what the archive of an concrete entry looks like, you can use the search interface, select an entry from the list fo search results, and click on the Archive tab.

To see inside an archive object in Python, you can use nomad.metainfo.MSection.m_to_dict() which is provided by all archive objects. This will convert a (part of an) archive into a regular, JSON-serializable Python dictionary.

For more details on the metainfo Python interface, consult the metainfo documentation.

The ArchiveQuery class

class nomad.client.ArchiveQuery(query: dict = None, required: dict = None, url: str = None, username: str = None, password: str = None, parallel: int = 1, per_page: int = 10, max: int = 10000, raise_errors: bool = False, authentication: Union[Dict[str, str], nomad.client.KeycloakAuthenticator] = None)

Object of this class represent a query on the NOMAD Archive. It is solely configured through its constructor. After creation, it implements the Python Sequence interface and therefore acts as a sequence of query results.

Not all results are downloaded at once, expect that this class will continuesly pull results from the API, while you access or iterate to the far side of the result list.

query

A dictionary of search parameters. Consult the search API to get a comprehensive list of parameters.

required

A potentially nested dictionary of sections to retrieve.

url

Optional, override the default NOMAD API url.

username

Optional, allows authenticated access.

password

Optional, allows authenticated access.

per_page

Determine how many results are downloaded per page (or scroll window). Default is 10.

max

Optionally determine the maximum amount of downloaded archives. The iteration will stop if max is surpassed even if more results are available. Default is 10.000. None value will set it to unlimited.

raise_errors

There situations where archives for certain entries are unavailable. If set to True, this cases will raise an Exception. Otherwise, the entries with missing archives are simply skipped (default).

authentication

Optionally provide detailed authentication information. Usually, providing username and password should suffice.

parallel

Number of processes to use to retrieve data in parallel. Only data from different uploads can be retrieved in parallel. Default is 1. The argument per_page will refer to archived retrieved in one process per call.

__init__(query: dict = None, required: dict = None, url: str = None, username: str = None, password: str = None, parallel: int = 1, per_page: int = 10, max: int = 10000, raise_errors: bool = False, authentication: Union[Dict[str, str], nomad.client.KeycloakAuthenticator] = None)

Initialize self. See help(type(self)) for accurate signature.

Working with private data

Public NOMAD data can be accessed without any authentication; everyone can use our API without the need for an account or login. However, if you want to work with your own data that is not yet published, or embargoed data was shared with you, you need to authenticate before accessing this data. Otherwise, you will simply not find it with your queries. To authenticate simply provide your NOMAD username and password to the ArchiveQuery constructor.