# Introduction

**NOvel Materials Discorvery (NOMAD)** comprises storage, processing, management, discovery, and
analytics of computational material science data from over 40 community *codes*.
The original NOMAD software, developed by the
[NOMAD-coe](http://nomad-coe.eu) project, is used to host over 50 million total energy
calculations in a single central infrastructure instances that offers a variety
of services (*repository*, *archive*, *encyclopedia*, *analytics*, *visualization*).

.. figure:: /assets/nomad.png
   :alt: nomad's overall structure

This is the documentation of **nomad@FAIRDI**, the Open-Source continuation of the
original NOMAD-coe software that reconciles the original code base,
integrate it's services, allows 3rd parties to run individual and federated instance of
the nomad infrastructure, provides nomad to other material science domains, and applies
the FAIRDI principles as proliferated by the [FAIRDI Data Infrastructure e.V.](https://fairdi.eu).
A central and publically available instance of the nomad software is run at the
[MPCDF](https://www.mpcdf.mpg.de/) in Garching, Germany. Software development and the
operation of NOMAD is done by the [NOMAD Laboratory](https://nomad-lab.eu)

The nomad software runs SAAS on a server and is used via web-based GUI and ReSTful
API. Originally developed and hosted as individual services, **nomad@FAIRDI**
provides all services behind one GUI and API into a single coherent, integrated, and
modular software project.

This documentation is only about the nomad *software*; it is about architecture,
how to contribute, code reference, engineering and operation of nomad.


## Architecture

The following depicts the *nomad@FAIRDI* architecture with respect to software components
in terms of python modules, gui components, and 3rd party services (e.g. databases,
search engines, etc.). It comprises a revised version of the repository and archive.

.. figure:: /assets/components.png
   :alt: nomad components

Besides various scientific computing, machine learning, and computational material
science libraries (e.g. numpy, skikitlearn, tensorflow, ase, spglib, matid, and many more),
Nomad uses a set of freely available or Open Source technologies that already solve most
of its processing, storage, availability, and scaling goals. The following is a non
comprehensive overview of used languages, libraries, frameworks, and services.


### Python 3

The *backend* of nomad is written in Python. This includes all parsers, normalizers,
and other data processing. We only use Python 3 and there is no compatibility with
Python 2. Code is formatted close to [pep8](https://www.python.org/dev/peps/pep-0008/),
critical parts use [pep484](https://www.python.org/dev/peps/pep-0484/) type-hints.
[Pycodestyle](https://pypi.org/project/pycodestyle/),
[pylint](https://www.pylint.org/), and
[mypy](http://mypy-lang.org/) (static type checker) are used to ensure quality.
Tests are written with [pytest](https://docs.pytest.org/en/latest/contents.html).
Logging is done with [structlog](https://www.structlog.org/en/stable/) and *logstash* (see
Elasticstack below). Documentation is driven by [Sphinx](http://www.sphinx-doc.org/en/master/).


### celery

[Celery](http://celeryproject.org) (+ [rabbitmq](https://www.rabbitmq.com/))
is a popular combination for realizing long running tasks in internet applications.
We use it to drive the processing of uploaded files.
It allows us to transparently distribute processing load while keeping processing state
available to inform the user.


### elastic search

[Elasticsearch](https://www.elastic.co/webinars/getting-started-elasticsearch)
is used to store repository data (not the raw files).
Elasticsearch allows for flexible scalable search and analytics.


### mongodb

[Mongodb](https://docs.mongodb.com/) is used to store and track the state of the
processing of uploaded files and therein contained calculations. We use
[mongoengine](http://docs.mongoengine.org/) to program with mongodb.


### Keycloak

[Keycloak](https://www.keycloak.org/) is used for user management. It manages users and
provide functions for registering, password forget, editing user accounts, and single
sign on of fairdi@nomad and other related services.


### flask, et al.

The ReSTful API is build with the [flask](http://flask.pocoo.org/docs/1.0/)
framework and its [ReST+](https://flask-restplus.readthedocs.io/en/stable/) extension. This
allows us to automatically derive a [swagger](https://swagger.io/) description of the nomad API,
which in  turn allows us to generate programming language specific client libraries, e.g. we
use [bravado](https://github.com/Yelp/bravado) for Python and
[swagger-js](https://github.com/swagger-api/swagger-js) for Javascript.
Fruthermore, you can browse and use the API via [swagger-ui](https://swagger.io/tools/swagger-ui/).


### Elasticstack

The [elastic stack](https://www.elastic.co/guide/index.html)
(previously *ELK* stack) is a central logging, metrics, and monitoring
solution that collects data within the cluster and provides a flexible analytics frontend
for said data.


### Javascript, React, Material-UI

The frontend (GUI) of **nomad@FAIRDI** build on top of the
[React](https://reactjs.org/docs/getting-started.html) component framework.
This allows us to build the GUI as a set of re-usable components to
achieve a coherent representations for all aspects of nomad, while keeping development
efforts manageable. React uses [JSX](https://reactjs.org/docs/introducing-jsx.html)
(a ES6 variety) that allows to mix HTML with Javascript code.
The component library [Material-UI](https://material-ui.com/)
(based on Google's popular material design framework) provides a consistent look-and-feel.


### docker

To run a **nomad@FAIRDI** instance, many services have to be orchestrated:
the nomad app, nomad worker, mongodb, Elasticsearch, Keycloak, RabbitMQ,
Elasticstack (logging), the nomad GUI, and a reverse proxy to keep everything together.
Further services might be needed (e.g. JypiterHUB), when nomad grows.
The container platform [Docker](https://docs.docker.com/) allows us to provide all services
as pre-build images that can be run flexibly on all types of platforms, networks,
and storage solutions. [Docker-compose](https://docs.docker.com/compose/) allows us to
provide configuration to run the whole nomad stack on a single server node.


### kubernetes + helm

To run and scale nomad on a cluster, you can use [kubernetes](https://kubernetes.io/docs/home/)
to orchestrated the  necessary containers. We provide a [helm](https://docs.helm.sh/)
chart with all necessary service and deployment descriptors that allow you to setup and
update nomad with few commands.


### GitLab

Nomad as a software project is managed via [GitLab](https://docs.gitlab.com/).
The **nomad@FAIRDI** project is hosted [here](https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR).
GitLab is used to manage versions, different branches of development, tasks and issues,
as a [registry for Docker images](https://docs.gitlab.com/ee/user/project/container_registry.html),
and [CI/CD platform](https://docs.gitlab.com/ee/ci/).


## Data model

.. figure:: /assets/data.png
   :alt: nomad's data model

The entities that comprise the nomad data model are *users*, *datasets*, *uploads*,
*calculations* (calc), and *materials*. *Users* upload multiple related *calculations*
in one *upload*. *Users* can curate *calculations* into *datasets*. *Caclulations*
belong to one *material* based on the simulated system.

### Users
- The user `email` is used as a primary key to uniquely identify users
  (even among different nomad installations)

### Uploads
- An upload contains related calculations in the form of raw code input and output files
- Uploader are encouraged to upload all relevant files
- The directory structure of an upload might be used to relate calculations to each other
- Uploads have a unique randomly choosen `upload_id` (UUID)
- The `uploader` is the user that provided the upload. There is always one immutable `uploader`
- Currently, uploads can be provided as `.zip` or `.tar.gz` files.

### Entries (Calculations, Code runs)
- There are confusing names. Internally, in the nomad source code, the term `calc` is used.
An entry represents a single set of input/output used and produces by an individual run of a
DFT code. If nomad is applied to other domains, i.e. experimental material science, entries might represent
experiments or other entities.
- An entry (calculation) has a unique `calc_id` that is based on the upload's id and the `mainfile`
- The `mainfile` is a upload relative path to the main output file.
- Each calculation, when published, gets a unique `pid`. Pids are ascending intergers. For
each `pid` a shorter `handle` is created. Handles can be registered with a handle system,
e.g. the central nomad installation at MPCDF is registered at a MPCDF/GWDW handle system.
- The `calc_hash` is computed from the main and other parsed raw files.
- Entry data comprises *user metadata* (comments, references, datasets, coauthors),
*calculation metadata* (code, version, system and symmetry, used DFT method, etc.),
the *archive data* (a hierarchy of all parsed quantities), and the uploaded *raw files*.

### Datasets
- Datasets are user curated sets of calculations.
- Users can assign names and nomad can register a DOI for a dataset.
- A calculation can be put into multiple datasets.

### Materials
- Materials aggregate calculations based on common system properties
(e.g. system type, atoms, lattice, space group, etc.).

### Data
We distinguish various forms of calculation data:
- raw data: The raw files provided by nomad users
- (repository) metadata: All data necessary to search and inspect nomad entries.
- archive data: The data extracted from raw files by nomad parsers and normalizers.
This data is represented in the *meta-info* format.
- materials data: Aggregated information about calculations that simulated the *same* material.

.. figure:: /assets/datamodel_dataflow.png
   :alt: nomad's data flow

### Metadata
Metadata refers to those pieces of data, those quantities/attributes that we use
to represent, identify, and index uploads and calculations in the API, search, GUI, etc.
There are three catergories of metadata:
- entry metadata: attributes that are necessary to uniquely identify entities (see also :ref:`id-reference-label`), that describe the upload, processing, etc. This data is derived by the nomad infrastructure.
- user metadata: attributes provided by the user, e.g. comments, references, coauthors, datasets, etc. This data is provided by the user.
- domain metadata: metadata parsed from raw files that describe calculations on a high level, e.g. code name, basis set, system type, etc. This data is derived from the uploaded data.

Those sets of metadata along with the actual raw and archive data are often transformed,
passed, stored, etc. by the various nomad modules.

### Implementation
The different entities have often multiple implementations for different storage systems.
For example, aspects of calculations are stored in files (raw files, calc metadata, archive data),
Elasticsearch (metadata), and mongodb (metadata, processing state).
Different transformation between different implementations exist. See
:py:mod:`nomad.datamodel` for further information.


## Processing

.. figure:: /assets/proc.png
   :alt: nomad's processing workflow

See :py:mod:`nomad.processing` for further information.