Skip to content

How to define workflows

The built-in abstract workflow schema

Workflows are an important aspect of data as they explain how the data came to be. Let's first clarify that workflow refers to a workflow that already happened and that has produced input and output data that are linked through tasks that have been performed . This often is also referred to as data provenance or provenance graph.

The following shows the overall abstract schema for worklows that can be found in nomad.datamodel.metainfo.workflow (blue):

workflow schema

The idea is that workflows are stored in a top-level archive section along-side other sections that contain the inputs and outputs. This way the workflow or provenance graph is just additional piece of the archive that describes how the data in this (or other archives) is connected.

Let'c consider an example workflow. Imagine a geometry optimization and ground state calculation performed by two individual DFT code runs. The code runs are stored in NOMAD entries geom_opt.archive.yaml and ground_state.archive.yaml using the run top-level section.

Example workflow

Here is a logical depiction of the workflow and all its tasks, inputs, and outputs.

example workflow

Simple workflow entry

The following archive shows how to create such a workflow based on the given schema. Here we only model the GeometryOpt and GroundStateCalculation as two tasks with respective inputs and outputs that use references to entry archives of the respective code runs.

workflow2:
  inputs:
    - name: input system
      section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
  outputs:
    - name: relaxed system
      section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
    - name: ground state calculation of relaxed system
      section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'
  tasks:
    - name: GeometryOpt
      inputs:
        - name: input system
          section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
      outputs:
        - name: relaxed system
          section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'

    - name: GroundStateCalculation
      inputs:
        - name: input system
          section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
      outputs:
        - name: ground state
          section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'

Nested workflows in one entry

Since a Workflow instance is also a Tasks instance due to inheritance, we can nest workflows. Here we detailed the GeometryOpt as a nested workflow:

workflow2:
  inputs:
    - name: input system
      section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
  outputs:
    - name: relaxed system
      section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
    - name: ground state calculation of relaxed system
      section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'
  tasks:
    - name: GeometryOpt
      m_def: nomad.datamodel.metainfo.workflow.Workflow
      inputs:
        - name: input system
          section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
      outputs:
        - name: relaxed system
          section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
      tasks:
        - inputs:
            - section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
          outputs:
            - section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/1'
            - section: '../upload/raw/geom_opt.archive.yaml#/run/0/calculation/0'
        - inputs:
            - section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/1'
          outputs:
            - section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/2'
            - section: '../upload/raw/geom_opt.archive.yaml#/run/0/calculation/1'
        - inputs:
            - section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/2'
          outputs:
            - section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/3'
            - section: '../upload/raw/geom_opt.archive.yaml#/run/0/calculation/2'
    - name: GroundStateCalculation
      inputs:
        - name: input system
          section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
      outputs:
        - name: ground state
          section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'

Nested Workflows in multiple entries

Typically, we want to colocate our individual workflows with their inputs and outputs. In the case of the geometry optimization, we might want to put this into the archive of the geometry optimization code run. So the geom_opt.archive.yaml might contain its own section workflow2 that only contains the GeometryOpt workflow and uses local references to its inputs and outputs:

workflow2:
  name: GeometryOpt
  inputs:
    - name: input system
      section: '#/run/0/system/0'
  outputs:
    - name: relaxed system
      section: '#/run/0/system/-1'
  tasks:
    - inputs:
        - section: '#/run/0/system/0'
      outputs:
        - section: '#/run/0/system/1'
        - section: '#/run/0/calculation/0'
    - inputs:
        - section: '#/run/0/system/1'
      outputs:
        - section: '#/run/0/system/2'
        - section: '#/run/0/calculation/1'
    - inputs:
        - section: '#/run/0/system/2'
      outputs:
        - section: '#/run/0/system/3'
        - section: '#/run/0/calculation/2'
run:
  - program:
      name: 'VASP'
    system: [{}, {}, {}]
    calculation: [{}, {}, {}]

When we want to detail the complex workflow, we now need to refer to a nested workflow in a different entry. This cannot be done directly, because Workflow instances can only contain Task instances and not reference them. Therefore, we added a TaskReference section definition that can be used to create proxy instances for tasks and workflows:

workflow2:
  inputs:
    - name: input system
      section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/0'
  outputs:
    - name: relaxed system
      section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
    - name: ground state calculation of relaxed system
      section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'
  tasks:
    - m_def: nomad.datamodel.metainfo.workflow.TaskReference
      task: '../upload/raw/geom_opt.archive.yaml#/workflow2'
    - name: GroundStateCalculation
      inputs:
        - name: input system
          section: '../upload/raw/geom_opt.archive.yaml#/run/0/system/-1'
      outputs:
        - name: ground state
          section: '../upload/raw/ground_state.archive.yaml#/run/0/calculations/0'

Extending the workflow schema

The abstract workflow schema above allows us to build generalized tools for workflows, like workflow searches, navigation in workflow, graphical representations of workflows, etc. But, you can still augment the given section definitions with more information through inheritance. These information can be specialized references to denote inputs and outputs, can be additional workflow or task parameters, and much more.

In this example, we created a special workflow section definition GeometryOptimization that defines a parameter threshold and an additional reference to the final calculation of the optimization:

definitions:
  sections:
    GeometryOptimizationWorkflow:
      base_section: nomad.datamodel.metainfo.workflow.Workflow
      quantities:
        threshold:
          type: float
          unit: eV
        final_calculation:
          type: runschema.calculation.Calculation

workflow2:
  m_def: GeometryOptimizationWorkflow
  final_calculation: '#/run/0/calculation/-1'
  threshold: 0.029
  name: GeometryOpt
  inputs:
    ...