Skip to content

Developing NOMAD

Clone the sources

If not already done, you should clone nomad. If you have a gitlab@MPCDF account, you can clone with git URL:

git clone git@gitlab.mpcdf.mpg.de:nomad-lab/nomad-FAIR.git nomad

Otherwise, clone using HTTPS URL:

git clone https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR.git nomad

then change directory to nomad

cd nomad

There are several branches in the repository. The master branch contains the latest released version, but there are also develop branches for each version called vX.X.X. Checkout the branch you want to work on.

git checkout vX.X.X
The development branches are protected and you should create a new branch including your changes.
git checkout -b <my-branch-name>
This branch can be pushed to the repo, and then later may be merged to the relevant branch.

Installation

Setup a Python environment

You should work in a Python virtual environment.

pyenv

The nomad code currently targets python 3.7. If your host machine has an older version installed, you can use pyenv to use python 3.7 in parallel to your system's python. Never the less, we have good experience with 3.8 and 3.9 users as well and everything might work with newer versions as well.

virtualenv

We strongly recommend to use virtualenv to create a virtual environment. It allows you to keep nomad and its dependencies separate from your system's python installation. Make sure that the virtual environment is based on Python 3.

To install virtualenv, create an environment, and activate the environment, use:

pip install virtualenv
virtualenv -p `which python3` .pyenv
source .pyenv/bin/activate

If you use pyenv (or similar solutions) make sure that the -p arguments evaluates to the python binary with the desired version.

conda

If you are a conda user, there is an equivalent, but you have to install pip and the right python version while creating the environment.

conda create --name nomad_env pip python=3.7
conda activate nomad_env

To install libmagick for conda, you can use (other channels might also work):

conda install -c conda-forge --name nomad_env libmagic

The following command can be used to install all dependencies and the submodules of the NOMAD-coe project.

bash setup.sh

The script includes the following steps:

Upgrade pip

Make sure you have the most recent version of pip:

pip install --upgrade pip

Install missing system libraries (e.g. on MacOS)

Even though the NOMAD infrastructure is written in python, there is a C library required by one of our python dependencies. Libmagic is missing on some systems. Libmagic allows to determine the MIME type of files. It should be installed on most unix/linux systems. It can be installed on MacOS with homebrew:

brew install libmagic

Install nomad

Finally, you can add nomad to the environment itself (including all extras). The -e option will install the NOMAD with symbolic links allowing you to change the code without having to reinstall after each change.

pip install -e .[all]

If pip tries to use and compile sources and this creates errors, it can be told to prefer binary version:

pip install -e .[all] --prefer-binary

Install sub-modules

Nomad is based on python modules from the NOMAD-coe project. This includes parsers, python-common and the meta-info. These modules are maintained as their own GITLab/git repositories. To clone and initialize them run:

git submodule update --init

All requirements for these submodules need to be installed and they themselves need to be installed as python modules. Run the dependencies.sh script that will install everything into your virtual environment:

./dependencies.sh -e

If one of the Python packages, that are installed during this process, fail because it cannot be compiled on your platform, you can try pip install --prefer-binary <packagename> to install set packages manually.

The -e option will install the NOMAD-coe dependencies with symbolic links allowing you to change the downloaded dependency code without having to reinstall after.

Generate GUI artifacts

The NOMAD GUI requires static artifacts that are generated from the NOMAD Python codes.

nomad.cli dev metainfo > gui/src/metainfo.json
nomad.cli dev search-quantities > gui/src/searchQuantities.json
nomad.cli dev toolkit-metadata > gui/src/toolkitMetadata.json
nomad.cli dev units > gui/src/unitsData.js
nomad.cli dev parser-metadata > gui/src/parserMetadata.json

Or simply run

./generate_gui_artifacts.sh

The generated files are not stored in GIT. If you pull a different commit, the GUI code might not match the expected data in outdated files. If there are changes to units, metainfo, new parsers, new toolkits it might be necessary to regenerate these gui artifacts.

In addition, you have to do some more steps to prepare your working copy to run all the tests. See below.

Run the infrastructure

Install docker

You need to install docker and docker-compose.

Run required 3rd party services

To run NOMAD, some 3rd party services are needed

  • elastic search: nomad's search and analytics engine
  • mongodb: used to store processing state
  • rabbitmq: a task queue used to distribute work in a cluster

All 3rd party services should be run via docker-compose (see below). Keep in mind the docker-compose configures all services in a way that mirror the configuration of the python code in nomad/config.py and the gui config in gui/.env.development.

The default virtual memory for Elasticsearch is likely to be too low. On Linux, you can run the following command as root:

sysctl -w vm.max_map_count=262144

To set this value permanently, see here. Then, you can run all services with:

cd ops/docker-compose/infrastructure
docker-compose up -d mongo elastic rabbitmq
cd ../../..

If your system almost ran out of disk space the elasticsearch enforces a read-only index block (read more), but after clearing up the disk space you need to reset it manually using the following command:

curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": false}'

Note that the ElasticSearch service has a known problem in quickly hitting the virtual memory limits of your OS. If you experience issues with the ElasticSearch container not running correctly or crashing, try increasing the virtual memory limits as shown here.

To shut down everything, just ctrl-c the running output. If you started everything in deamon mode (-d) use:

docker-compose down

Usually these services are used only by NOMAD, but sometimes you also need to check something or do some manual steps. You can access mongodb and elastic search via your preferred tools. Just make sure to use the right ports.

Run NOMAD

Before you run NOMAD for development purposes, you should configure it to use the test realm of our user management system. By default, NOMAD will use the fairdi_nomad_prod realm. Create a nomad.yaml file in the root folder:

keycloak:
  realm_name: fairdi_nomad_test

NOMAD consist of the NOMAD app/api, a worker, and the GUI. You can run the app and the worker with the NOMAD cli. These commands will run the services and display their log output. You should open them in separate shells as they run continuously. They will not watch code changes and you have to restart manually.

nomad admin run app
nomad admin run worker

Or both together in one process:

nomad admin run appworker

On MacOS you might run into multiprocessing errors. That can be solved as described here.

The app will run at port 8000 by default.

To run the worker directly with celery, do (from the root)

celery -A nomad.processing worker -l info

Before you can run the gui, make sure that generated artifacts have been created:

nomad dev metainfo > gui/src/metainfo.json
nomad dev search-quantities > gui/src/searchQuantities.json
nomad dev toolkit-metadata > gui/src/toolkitMetadata.json
nomad dev units > gui/src/unitsData.js

If you run the gui on its own (e.g. with react dev server below), you also have to have the app manually. The gui and its dependencies run on node and the yarn dependency manager. Read their documentation on how to install them for your platform.

cd gui
yarn
yarn start

Run tests

To run the tests some additional settings and files are necessary that are not part of the code base.

You have to provide static files to serve the docs and NOMAD distribution:

mkdocs build && mv site docs/build
python setup.py compile
python setup.py sdist
cp dist/nomad-lab-*.tar.gz dist/nomad-lab.tar.gz

You need to have the infrastructure partially running: elastic, rabbitmq. The rest should be mocked or provided by the tests. Make sure that you do no run any worker, as they will fight for tasks in the queue.

cd ops/docker-compose/infrastructure
docker-compose up -d elastic rabbitmq
cd ../..
pytest -svx tests

We use pylint, pycodestyle, and mypy to ensure code quality. To run those:

nomad dev qa --skip-test

To run all tests and code qa:

nomad dev qa

This mimiques the tests and checks that the GitLab CI/CD will perform.

Setup your IDE

The documentation section for development guidelines (see below) details how the code is organized, tested, formatted, and documented. To help you meet these guidelines, we recommend to use a proper IDE for development and ditch any VIM/Emacs (mal-)practices.

We strongly recommend that all developers use visual studio code, or vscode for short, (this is a completely different producs than visual studio). It is available for free for all major platforms here.

You should launch and run vscode directly from the projects root directory. The source code already contains settings for vscode in the .vscode directory. The settings contain the same setup for stylechecks, linter, etc. that is also used in our CI/CD pipelines. If you want to augment this with your own settings, you can have a .vscode/settings.local.json. This file is in .gitignore and only belongs to you.

The settings also include a few launch configuration for vscode's debugger. You can create your own launch configs in .vscode/launch.json (also in .gitignore).

The settings expect that you have installed a python environment at .pyenv as described in this tutorial (see above).

Code guidelines

Principles and rules

  • simple first, complicated only when necessary
  • adopting generic established 3rd party solutions before implementing specific solutions
  • only uni directional dependencies between components/modules, no circles
  • only one language: Python (except, GUI of course)

The are some rules or better strong guidelines for writing code. The following applies to all python code (and were applicable, also to JS and other code):

  • Use an IDE (e.g. vscode or otherwise automatically enforce code (formatting and linting). Use nomad qa before committing. This will run all tests, static type checks, linting, etc.

  • There is a style guide to python. Write pep-8 compliant python code. An exception is the line cap at 79, which can be broken but keep it 90-ish.

  • Test the public API of each sub-module (i.e. python file)

  • Be pythonic and watch this.

  • Document any public API of each sub-module (e.g. python file). Public meaning API that is exposed to other sub-modules (i.e. other python files).

  • Use google docstrings.

  • Add your doc-strings to the sphinx documentation in docs. Use .md, follow the example. Markdown in sphinx is supported via recommonmark and AutoStructify

  • The project structure is according to this guide. Keep it!

  • Write tests for all contributions.

Enforcing Rules with CI/CD

These guidelines are partially enforced by CI/CD. As part of CI all tests are run on all branches; further we run a linter, pep8 checker, and mypy (static type checker). You can run nomad qa to run all these tests and checks before committing.

The CI/CD will run on all refs that do not start with dev-. The CI/CD will not release or deploy anything automatically, but it can be manually triggered after the build and test stage completed successfully.

Names and identifiers

There is a certain terminology consistently used in this documentation and the source code. Use this terminology for identifiers.

Do not use abbreviations. There are (few) exceptions: proc (processing); exc, e (exception); calc (calculation), repo (repository), utils (utilities), and aux (auxiliary). Other exceptions are f for file-like streams and i for index running variables. Btw., the latter is almost never necessary in python.

Terms:

  • upload: A logical unit that comprises a collection of files uploaded by a user, organized in a directory structure.
  • entry: An archive item, created by parsing a mainfile. Each entry belongs to an upload and is associated with various metadata (an upload may have many entries).
  • calculation: denotes the results of a theoretical computation, created by CMS code. Note that entries do not have to be based on calculations; they can also be based on experimental results.
  • raw file: A user uploaded file, located somewhere in the upload's directory structure.
  • mainfile: A raw file identified as parseable, defining an entry of the upload in question.
  • aux file: Additional files the user uploaded within an upload.
  • entry metadata: Some quantities of an entry that are searchable in NOMAD.
  • archive data: The normalized data of an entry in nomad's meta-info-based format.

Throughout nomad, we use different ids. If something is called id, it is usually a random uuid and has no semantic connection to the entity it identifies. If something is called a hash then it is a hash generated based on the entity it identifies. This means either the whole thing or just some properties of this entities.

  • The most common hashes is the entry_hash based on mainfile and auxfile contents.
  • The upload_id is a UUID assigned to the upload on creation. It never changes.
  • The mainfile is a path within an upload that points to a file identified as parseable. This also uniquely identifies an entry within the upload.
  • The entry_id (previously called calc_id) uniquely identifies an entry. It is a hash over the mainfile and respective upload_id. NOTE: For backward compatibility, calc_id is also still supported in the api, but using it is strongly discouraged.
  • We often use pairs of upload_id/entry_id, which in many contexts allow to resolve an entry-related file on the filesystem without having to ask a database about it.
  • The pid or (coe_calc_id) is a legacy sequential interger id, previously used to identify entries. We still store the pid on these older entries for historical purposes.
  • Calculation handle or handle_id are created based on those pid. To create hashes we use :py:func:nomad.utils.hash.

Logging

There are three important prerequisites to understand about nomad-FAIRDI's logging:

  • All log entries are recorded in a central elastic search database. To make this database useful, log entries must be sensible in size, frequence, meaning, level, and logger name. Therefore, we need to follow some rules when it comes to logging.
  • We use an structured logging approach. Instead of encoding all kinds of information in log messages, we use key-value pairs that provide context to a log event. In the end all entries are stored as JSON dictionaries with @timestamp, level, logger_name, event plus custom context data. Keep events very short, most information goes into the context.
  • We use logging to inform about the state of nomad-FAIRDI, not about user behavior, input, or data. Do not confuse this when determining the log-level for an event. For example, a user providing an invalid upload file should never be an error.

Please follow the following rules when logging:

  • If a logger is not already provided, only use :py:func:nomad.utils.get_logger to acquire a new logger. Never use the build-in logging directly. These logger work like the system loggers, but allow you to pass keyword arguments with additional context data. See also the structlog docs.
  • In many context, a logger is already provided (e.g. api, processing, parser, normalizer). This provided logger has already context information bounded. So it is important to use those instead of acquiring your own loggers. Have a look for methods called get_logger or attributes called logger.
  • Keep events (what usually is called message) very short. Examples are: file uploaded, extraction failed, etc.
  • Structure the keys for context information. When you analyse logs in ELK, you will see that the set of all keys over all log entries can be quit large. Structure your keys to make navigation easier. Use keys like nomad.proc.parser_version instead of parser_version. Use module names as prefixes.
  • Don't log everything. Try to anticipate, how you would use the logs in case of bugs, error scenarios, etc.
  • Don't log sensitive data.
  • Think before logging data (especially dicts, list, numpy arrays, etc.).
  • Logs should not be abused as a printf-style debugging tool.

The following keys are used in the final logs that are piped to Logstash. Notice that the key name is automatically formed by a separate formatter and may differ from the one used in the actual log call.

Keys that are autogenerated for all logs:

  • @timestamp: Timestamp for the log
  • @version: Version of the logger
  • host: The host name from which the log originated
  • path: Path of the module from which the log was created
  • tags: Tags for this log
  • type: The message_type as set in the LogstashFormatter
  • level: The log level: DEBUG, INFO, WARNING, ERROR
  • logger_name: Name of the logger
  • nomad.service: The service name as configured in config.py
  • nomad.release: The release name as configured in config.py

Keys that are present for events related to processing an entry:

  • nomad.upload_id: The id of the currently processed upload
  • nomad.entry_id: The id of the currently processed entry
  • nomad.mainfile: The mainfile of the currently processed entry

Keys that are present for events related to exceptions:

  • exc_info: Stores the full python exception that was encountered. All uncaught exceptions will be stored automatically here.
  • digest: If an exception was raised, the last 256 characters of the message are stored automatically into this key. If you wish to search for exceptions in Kibana, you will want to use this value as it will be indexed unlike the full exception object.

We follow this recommendation of the Linux Foundation for the copyright notice that is placed on top of each source code file.

It is intended to provide a broad generic statement that allows all authors/contributors of the NOMAD project to claim their copyright, independent of their organization or individual ownership.

You can simply copy the notice from another file. From time to time we can use a tool like licenseheaders to ensure correct notices. In addition we keep an purely informative AUTHORS file.

Git/GitLab

Branches and clean version history

The master branch of our repository is protected. You must not (even if you have the rights) commit to it directly. The master branch references the latest official release (i.e. what the current NOMAD runs on). The current development is represented by version branches, named vx.x.x. Usually there are two or more of these branched, representing the development on minor/bugfix versions and the next major version(s). Ideally these version branches are also not manually push to.

Instead you develop on feature branches. These are branches that are dedicated to implement a single feature. They are short lived and only exist to implement a single feature.

The lifecycle of a feature branch should look like this:

  • create the feature branch from the last commit on the respective version branch that passes CI

  • do your work and push until you are satisfied and the CI passes

  • create a merge request on GitLab

  • discuss the merge request on GitLab

  • continue to work (with the open merge request) until all issues from the discussion are resolved

  • the maintainer performs the merge and the feature branch gets deleted

Submodules

We currently use git submodules to manage NOMAD internal dependencies (e.g. parsers). All dependencies are python packages and installed via pip to your python environement.

This allows us to target (e.g. install) individual commits. More importantly, we can address commit hashes to identify exact parser/normalizer versions. On the downside, common functions for all dependencies (e.g. the python-common package, or nomad_meta_info) cannot be part of the nomad-FAIRDI project. In general, it is hard to simultaneously develop nomad-FAIRDI and NOMAD-coe dependencies.

Another approach is to integrate the NOMAD-coe sources with nomad-FAIRDI. The lacking availability of individual commit hashes, could be replaces with hashes of source-code files.

We use the master branch on all dependencies. Of course feature branches can be used on dependencies to manage work in progress.

Keep a clean history

While working on a feature, there are certain practices that will help us to create a clean history with coherent commits, where each commit stands on its own.

  git commit --amend

If you committed something to your own feature branch and then realize by CI that you have some tiny error in it that you need to fix, try to amend this fix to the last commit. This will avoid unnecessary tiny commits and foster more coherent single commits. With amend you are basically adding changes to the last commit, i.e. editing the last commit. If you push, you need to force it git push origin feature-branch --force-with-lease. So be careful, and only use this on your own branches.

  git rebase <version-branch>

Lets assume you work on a bigger feature that takes more time. You might want to merge the version branch into your feature branch from time to time to get the recent changes. In these cases, use rebase and not merge. Rebase puts your branch commits in front of the merged commits instead of creating a new commit with two ancestors. It basically moves the point where you initially branched away from the version branch to the current position in the version branch. This will avoid merges, merge commits, and generally leave us with a more consistent history. You can also rebase before creating a merge request, which basically allows no-op merges. Ideally the only real merges that we ever have, are between version branches.

  git merge --squash <other-branch>

When you need multiple branches to implement a feature and merge between them, try to use squash. Squashing basically puts all commits of the merged branch into a single commit. It basically allows you to have many commits and then squash them into one. This is useful if these commits were made just to synchronize between workstations, due to unexpected errors in CI/CD, because you needed a save point, etc. Again the goal is to have coherent commits, where each commits makes sense on its own.

Often a feature is also represented by an issue on GitLab. Please mention the respective issues in your commits by adding the issue id at the end of the commit message: My message. #123.

We tag releases with vX.X.X according to the regular semantic versioning practices. After releasing and tagging the version branch is removed. Do not confuse tags with version branches. Remember that tags and branches are both Git references and you can accidentally pull/push/checkout a tag.

The main NOMAD GitLab-project (nomad-fair) uses Git-submodules to maintain its parsers and other dependencies. All these submodules are places in the /dependencies directory. There are helper scripts to install (./dependencies.sh) and commit changes to all submodules (./dependencies-git.sh). After merging or checking out, you have to make sure that the modules are updated to not accidentally commit old submodule commits again. Usually you do the following to check if you really have a clean working directory.

  git checkout something-with-changes
  git submodule update
  git status