Quick Start: Uploading computational data in NOMAD¶

Attention

This part of the documentation is still work in progress.

This page provides an overview of NOMAD's usage with computational data. If you are completely new to NOMAD, we recommend to first read through the Navigating to NOMAD, Uploading and publishing data, and Exploring data tutorials.

Uploading data in NOMAD can be done in several ways:

By dragging-and-dropping your files into the PUBLISH > Uploads page: suitable for users who have a relatively small amount of data.
By using the Python-based NOMAD API: suitable for users who have larger datasets and need to automatize the upload.
By using the shell command curl for sending files to the upload: suitable for users who have larger datasets and need to automatize the upload.

You can upload the files one by one or you can zip them in .zip or .tar.gz formats to upload a larger amount of files at once.

Drag-and-drop uploads¶

On the top-left menu, click on PUBLISH > Uploads.

You can then click on CREATE A NEW UPLOAD or try one of the example uploads by clicking in ADD EXAMPLE UPLOADS and selecting one of the multiple options, including data from an ELN, various instruments, or computational software. For a clear demonstration of the entire process, we will use the following example data:

Download Example Data

This particular example represents a computational workflow to investigate some properties of Si~2~, however the details are not important for our demonstration here.

After downloading the example .zip file, you can drag-and-drop it or click on the CLICK OR DROP FILES button to browse through your local directories.

After the files are uploaded, a processing is triggered. This generally includes an automatic identification of the uploaded files that are supported in NOMAD, and then a corresponding processing to harvest all the relevant (meta)data. The precise details of the processing depend on each use-case. For example, you can find out more about the processing of computational data in Processing of computational data.

You will receive an email when the upload processing is finished.

Sections of the Uploads page¶

At the top of the uploads page, you can modify certain general metadata fields.

The name of the upload can be modify by clicking on the pen icon . The other icons correspond to:

Manage members: allows users to invite collaborators by defining co-authors and reviewers roles.
Download files: downloads all files present in the upload.
Reload: reloads the uploads page.
Reprocess: triggers again the processing of the uploaded data.
API: generates a JSON response to use by the NOMAD API.

Delete the upload: deletes completely the upload.

The remainder of the uploads page is divided in 4 sections.

Prepare and upload your files¶

This section shows the files and folder structure in the upload. You can add a README.md in the root directory and its content will be shown above this section.

Process data¶

This section shows the processed data and the generated entries in NOMAD.

Edit author metadata¶

This section allows users to edit certain metadata fields from all entries recognized in the upload. This includes comments, where you can add as much extra information as you want, references, where you can add a URL to your upload (e.g., an article DOI), and datasets, where you can create or add the uploaded data into a more general dataset (see Organizing data in datasets).

Publish¶

This section lets the user to publish the data with or without an embargo.

Publishing¶

After uploading and a successful parsing, congratulations! Now you can publish your data and let other users browse through it and re-use it for other purposes.

You can define a specific Embargo period of up to 36 months, after which the data will be made publicly available under the CC BY 4.0 license.

After publishing by clicking on PUBLISH, the uploaded files cannot be altered. However, you can still edit the metadata fields.

Organizing data in datasets¶

You can organize your uploads and individual entries by grouping them into common datasets.

In the uploads page, click on EDIT AUTHOR METADATA OF ALL ENTRIES.

Under Datasets you can either Create a new dataset or Search for an existing dataset. After selecting the dataset, click on SUBMIT.

Now, the defined dataset will be defined under PUBLISH > Datasets.

The icon allows you to assign a DOI to a specific dataset. Once a DOI has been assign to a dataset, no more data can be added to it. This can then be added into your publication so that it can be used as a reference, e.g., see the Data availability statement in M. Kuban et al., Similarity of materials and data-quality assessment by fingerprinting, MRS Bulletin 47, 991-999 (2022).

Processing of computational data¶

See From files to data and Processing for full explanations about data processing in NOMAD.

When data is uploaded to NOMAD, the software interprets the files and determines which of them is a mainfile. Any other files in the upload can be viewed as auxiliary files. In the same upload, there might be multiple mainfiles and auxiliary files organized in a folder tree structure.

The mainfiles are the main output file of a calculation. The presence of a mainfile in the upload is key for NOMAD to recognize a calculation. In NOMAD, we support an array computational codes for first principles calculations, molecular dynamics simulations, and lattice modeling, as well as workflow and database managers. For each code, NOMAD recognizes a single file as the mainfile. For example, the VASP mainfile is by default the vasprun.xml, although if the vasprun.xml is not present in the upload NOMAD searches the OUTCAR file and assigns it as the mainfile (see VASP POTCAR stripping).

The rest of files which are not the mainfile are auxiliary files. These can have several purposes and be supported and recognized by NOMAD in the parser. For example, the band*.out or GW_band* files in FHI-aims are auxiliary files that allows the NOMAD FHI-aims parser to recognize band structures in DFT and GW, respectively.

You can see the full list of supported codes, mainfiles, and auxiliary files in the general NOMAD documentation under Supported parsers.

We recommend that the user keeps the folder structure and files generated by the simulation code, but without reaching the uploads limits. Please, also check our recommendations on Best Practices: preparing the data and folder structure.

Structured data with the NOMAD metainfo¶

Once the mainfile has been recognized, a new entry in NOMAD is created and a specific parser is called. The auxliary files are searched by and accessed within the parser.

For this new entry, NOMAD generates a NOMAD archive. It will contain all the (meta)information extracted from the unstructured raw data files but in a structured, well defined, and machine readable format. This metadata provides context to the raw data, i.e., what were the input methodological parameters, on which material the calculation was performed, etc. We define the NOMAD Metainfo as all the set of sections, sub-sections, and quantities used to structure the raw data into a structured schema. Further information about the NOMAD Metainfo is available in the general NOMAD documentation page in Learn > Structured data.

NOMAD sections for computational data¶

Under the Entry / archive section, there are several sections and quantities being populated by the parsers. For computational data, only the following sections are populated:

metadata: contains general and non-code specific metadata. This is mainly information about authors, creation of the entry time, identifiers (id), etc.
run: contains the parsed and normalized raw data into the structured NOMAD schema. This is all the possible raw data which can be translated into a structured way.
workflow2: contains metadata about the specific workflow performed within the entry. This is mainly a set of well-defined workflows, e.g., GeometryOptimization, and their parameters.
results: contains the normalized and search indexed metadata. This is mainly relevant for searching, filtering, and visualizing data in NOMAD.

workflow and workflow2 sections: development and refactoring

You have probably noticed the name workflow2 but also the existence of a section called workflow under archive. This is because workflow is an old version of the workflow section, while workflow2 is the new version. Sometimes, certain sections suffer a rebranding or refactoring, in most cases to add new features or to polish them after we receive years of feedback. In this case, the workflow section will remain until all older entries containing such section are reprocessed to transfer this information into workflow2.

Parsing¶

A parser is a Python module which reads the code-specific mainfile and auxiliary files and populates the run and workflow2 sections of the archive, along with all relevant sub-sections and quantities.

Parsers are added to NOMAD as plugins and are divided in a set of Github sub-projects under the main NOMAD repository.

Normalizing¶

After the parsing populates the run and workflow2 sections, an extra layer of Python modules is executed on top of the processed NOMAD metadata. This has two main purposes: 1. normalize or homogenize certain metadata parsed from different codes, and 2. populate the results section. For example, this is the case of normalizing the density of states (DOS) to its size intensive value, independently of the code used to calculate the DOS. The set of normalizers relevant for computational data are listed in /nomad/config/models.py and are executed in the specific order defined there. Their roles are explained more in detail in Processing.

Search indexing (and storing)¶

The last step is to store the structured metadata and pass some of it to the search index. The metadata which is passed to the search index is defined in the results section. These metadata can then be searched by filtering in the Entries page of NOMAD or by writing a Python script which searches using the NOMAD API.

Entries OVERVIEW page¶

Once the parsers and normalizers finish, the Uploads page will show if the processing of the entry was a SUCCESS or a FAILURE. The entry information can be browsed by clicking on the icon.

You will land on the OVERVIEW page of the entry. On the top menu you can further select the FILES page, the DATA page, and the LOGS page.

The overview page contains a summary of the parsed metadata, e.g., tabular information about the material and methodology of the calculation (in the example, a G0W0 calculation done with the code exciting for bulk Si₂), and visualizations of the system and some relevant properties. We note that all metadata are read directly from results.

LOGS page¶

In the LOGS page, you can find information about the processing. You can read error, warning, and critical messages which can provide insight if the processing of an entry was a FAILURE.

We recommend you to Get support or contact our team in case you find FAILURE situations. These might be due to bugs which we are rapidly fixing, and whose origin might be varied: from a new version of a code which is not yet supported to wrong handling of potential errors in the parser script. It may also be a problem with the organization of the data in the folders. In order to minimize these situations, we suggest that you read Best Practices: preparing the data and folder structure.

DATA page¶

The DATA page contains all the structured NOMAD metainfo populated by the parser and normalizers. This is the most important page in the entry, as it contains all the relevant metadata which will allow users to find that specific simulation.

Furthermore, you can click on the icon to download the NOMAD archive in a JSON format.

Best Practices: preparing the data and folder structure¶

Attention

Under construction.

VASP POTCAR stripping¶

For VASP data, NOMAD complies with the licensing of the POTCAR files. In agreement with Georg Kresse, NOMAD extracts the most important information of the POTCAR file and stores them in a stripped version called POTCAR.stripped. The POTCAR files are then automatically removed from the upload, so that you can safely publish your data.