How to parse tabular data¶
Refer to the Reference guide for the full list of annotations connected to this parser and to the Tabular parser tutorial for a detailed description of each of them.
Preparing the tabular data file¶
NOMAD and Excel
support multiple-sheets data manipulations and imports. Each quantity in the schema will be annotated with a source path composed by sheet name and column header. The path to be used with the tabular data displayed below would be Sheet1/My header 1
and it would be placed it the tabular
annotation, see schema annotations.
In the case there is only one sheet in the Excel file, or when using a .csv
file that is a single-sheet format, the sheet name is not required in the path.
The data sheets can be stored in one or more files depending on the user needs. Each sheet can independently be organized in one of the following ways:
1) Columns:
each column contains an array of cells that we want to parse into one quantity. Example: time and temperature arrays to be plotted as x and y.
2) Rows:
each row contains a set of cells that we want to parse into a section, i. e. a set of quantities. Example: an inventory tabular data file (for substrates, precursors, or more) where each column represents a property and each row corresponds to one unit stored in the inventory.
3) Rows with repeated columns:
in addition to the mode 2), whenever the parser detects the presence of multiple columns (or multiple sets of columns) with same headers, these are taken as multiple instances of a subsection. More explanations will be delivered when showing the schema for such a structure. Example: a crystal growth process where each row is a step of the crystal growth and the repeated columns describe the "precursor materials", that can be more than one during such processes and they are described by the same "precursor material" section.
Furthermore, we can insert comments before our data, we can use a special character to mark one or more rows as comment rows. The special character is annotated within the schema in the tabular
annotation, see schema annotations:
Inheriting the TableData base section¶
TableData
can be inherited adding the following lines in the yaml schema file:
EntryData
is usually also necessary as we will create entries from the section we are defining.
TableData
provides a customizable checkbox quantity, called fill_archive_from_datafile
, to turn the tabular parser on
or off
.
To avoid the parser running everytime a change is made to the archive data, it is sufficient to uncheck the checkbox. It is customizable in the sense that if you do not wish to see this checkbox at all, you can configure the hide
parameter of the section's m_annotations
to hide the checkbox. This in turn sets the parser to run everytime you save your archive. To hide it, add the following lines:
MySection:
base_sections:
- nomad.datamodel.data.EntryData
- nomad.parsing.tabular.TableData
m_annotations:
eln:
hide: ['fill_archive_from_datafile']
Be cautious though! Turning on the tabular parser (or checking the box) on saving your data will cause losing/overwriting your manually-entered data by the parser!
Importing data in NOMAD¶
After writing a schema file and creating a new upload in NOMAD (or using an existing upload), it is possible to upload the schema file. After creating a new Entry out of one section of the schema, the tabular data file must be dropped in the quantity designated by the FileEditQuantity
annotation. After clicking save the parsing will start. In the Overview page of the NOMAD upload, new Entries are created and appended to the Processed data section. In the Entry page, clicking on DATA tab (on top of the screen) and in the Entry lane, the data is populated under the data
subsection.
Hands-on examples of all tabular parser modes¶
In this section eight examples will be presented, containing all the features available in tabular parser. Refer to the Tutorial for more comments on the implications of the structures generated by the following yaml files.
1. Column mode, current Entry, parse to root¶
The first case gives rise to the simplest data archive file. Here the tabular data file is parsed by columns, directly within the Entry where the TableData
is inherited and filling the quantities in the root level of the schema (see dedicated how-to to learn how to inherit tabular parser in your schema).
Important
data_file
quantity, i.e. the tabular data file name, is located in the same Entry of the parsed quantities.- double check that
mapping_options > sections
contains the right path. It should point to the (sub)section where the quantities are decorated withtabular
annotation, i. e., the one to be filled with tabular data (root
in this case). - quantities parsed in
column
mode must have theshape: ['*']
attribute, that means they are arrays and not scalars.
definitions:
name: 'My test ELN 1'
sections:
MySection1:
base_sections:
- nomad.datamodel.data.EntryData
- nomad.parsing.tabular.TableData
m_annotations:
eln:
quantities:
data_file:
type: str
default: test.xlsx
m_annotations:
tabular_parser:
parsing_options:
comment: '#'
mapping_options:
- mapping_mode: column
file_mode: current_entry
sections:
- '#root'
browser:
adaptor: RawFileAdaptor
eln:
component: FileEditQuantity
my_array_quantity_1:
type: str
shape: ['*']
m_annotations:
tabular:
name: "My header 1"
my_array_quantity_2:
type: str
shape: ['*']
m_annotations:
tabular:
name: "My header 2"
2. Column mode, current Entry, parse to my path¶
The parsing mode presented here only differs from the previous for the sections
annotations. In this case the section that we want to fill with tabular data can be nested arbitrarily deep in the schema and the sections
annotation must be filled with a forward slash path to the desired section, e. g. my_sub_section/my_sub_sub_section
.
Important
data_file
quantity, i.e. the tabular data file name, is located in the same Entry of the parsed quantities.- double check that
mapping_options > sections
contains the right path. It should point to the (sub)section where the quantities are decorated withtabular
annotation, i. e., the one to be filled with tabular data. - the section to be parsed can be arbitrarily nested, given that the path provided in
sections
reachs it (e. g.my_sub_sec/my_sub_sub_sec
). - quantities parsed in
column
mode must have theshape: ['*']
attribute, that means they are arrays and not scalars.
definitions:
name: 'My test ELN 2'
sections:
MySection2:
base_sections:
- nomad.datamodel.data.EntryData
- nomad.parsing.tabular.TableData
m_annotations:
eln:
quantities:
data_file:
type: str
default: test.xlsx
m_annotations:
tabular_parser:
parsing_options:
comment: '#'
mapping_options:
- mapping_mode: column
file_mode: current_entry
sections:
- my_sub_section_2
browser:
adaptor: RawFileAdaptor
eln:
component: FileEditQuantity
sub_sections:
my_sub_section_2:
section: '#/MySubSection2'
MySubSection2:
m_annotations:
eln:
quantities:
my_array_quantity_1:
type: str
shape: ['*']
m_annotations:
tabular:
name: "My header 1"
my_array_quantity_2:
type: str
shape: ['*']
m_annotations:
tabular:
name: "My header 2"
3. Row mode, current Entry, parse to my path¶
The current is the first example of parsing in row mode. This means that every row of the excel file while be placed in one instance of the section that is defined in sections
. This section must be decorated with repeats: true
annotation, it will allow to generate multiple instances that will be appended in a list with sequential numbers. Instead of sequential numbers, the list can show specific names if label_quantity
annotation is appended to the repeated section. This annotation is included in the how-to example. The section is written separately in the schema and it does not need the EntryData
inheritance because the instances will be grafted directly in the current Entry. As explained below, it is not possible for row
and current_entry
to parse directly in the root because we need to create multiple instances of the selected subsection and organize them in a list.
Important
data_file
quantity, i.e. the tabular data file name, is located in the same Entry of the parsed quantities.- double check that
mapping_options > sections
contains the right path. It should point to the (sub)section where the quantities are decorated withtabular
annotation, i. e., the one to be filled with tabular data. - the section to be parsed can be arbitrarily nested, given that the path provided in
sections
reaches it (e. g.my_sub_sec/my_sub_sub_sec
). - quantities parsed in
row
mode are scalars. - make use of
repeats: true
in the subsection within the parent sectionMySection
. label_quantity
annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance.
definitions:
name: 'My test ELN 3'
sections:
MySection3:
base_sections:
- nomad.datamodel.data.EntryData
- nomad.parsing.tabular.TableData
m_annotations:
eln:
quantities:
data_file:
type: str
default: test.xlsx
m_annotations:
tabular_parser:
parsing_options:
comment: '#'
mapping_options:
- mapping_mode: row
file_mode: current_entry
sections:
- my_repeated_sub_section_3
browser:
adaptor: RawFileAdaptor
eln:
component: FileEditQuantity
sub_sections:
my_repeated_sub_section_3:
repeats: true
section: '#/MySubSection3'
MySubSection3:
m_annotations:
eln:
more:
label_quantity: '#/data/my_quantity_1'
quantities:
my_quantity_1:
type: str
m_annotations:
tabular:
name: "My header 1"
my_quantity_2:
type: str
m_annotations:
tabular:
name: "My header 2"
4. Column mode, single new Entry, parse to my path¶
One more step of complexity is added here: the parsing is not performed in the current Entry, but a new Entry it automatically generated and filled.
This structure foresees a parent Entry where we collect one or more tabular data files and possibly other info while we want to separate a specific entity of our data structure in another searchable Entry in NOMAD, e. g. a substrate Entry or a measurement Entry that would be collected inside a parent experiment Entry. We need to inherit SubSect
class from EntryData
because these will be standalone archive files in NOMAD. Parent and children Entries are connected by means of the ReferenceEditQuantity
annotation in the parent Entry schema. This annotation is attached to a quantity that becomes a hook to the other ones, It is a powerful tool that allows to list in the overview of each Entry all the other referenced ones, allowing to build paths of referencing available at a glance.
Important
data_file
quantity, i.e. the tabular data file name, is located in the parent Entry, the data is parsed in the child Entry.- double check that
mapping_options > sections
contains the right path. It should point to the (sub)section where the quantities are decorated withtabular
annotation, i. e., the one to be filled with tabular data. - the section to be parsed can be arbitrarily nested, given that the path provided in
sections
reachs it (e. g.my_sub_sec/my_sub_sub_sec
) - quantities parsed in
column
mode must have theshape: ['*']
attribute, that means they are arrays and not scalars. - inherit also the subsection from
EntryData
as it must be a NOMAD Entry archive file.
definitions:
name: 'My test ELN 4'
sections:
MySection4:
base_sections:
- nomad.datamodel.data.EntryData
- nomad.parsing.tabular.TableData
m_annotations:
eln:
quantities:
data_file:
type: str
default: test.xlsx
m_annotations:
tabular_parser:
parsing_options:
comment: '#'
mapping_options:
- mapping_mode: column
file_mode: single_new_entry
sections:
- my_subsection_4
browser:
adaptor: RawFileAdaptor
eln:
component: FileEditQuantity
sub_sections:
my_subsection_4:
section:
m_annotations:
eln:
quantities:
my_ref_quantity:
type: '#/MySubSection4'
m_annotations:
eln:
component: ReferenceEditQuantity
MySubSection4:
base_sections:
- nomad.datamodel.data.EntryData
m_annotations:
eln:
quantities:
my_array_quantity_1:
type: str
shape: ['*']
m_annotations:
tabular:
name: "My header 1"
my_array_quantity_2:
type: str
shape: ['*']
m_annotations:
tabular:
name: "My header 2"
5. Row mode, single new Entry, parse to my path¶
Example analogous to the previous, where the new created Entry contains now a repeated subsection with a list of instances made from each line of the tabular data file, as show in the Row mode, current Entry, parse to my path case.
Important
data_file
quantity, i.e. the tabular data file name, is located in the parent Entry, the data is parsed in the child Entry.- double check that
mapping_options > sections
contains the right path. It should point to the (sub)section where the quantities are decorated withtabular
annotation, i. e., the one to be filled with tabular data. - the section to be parsed can be arbitrarily nested, given that the path provided in
sections
reachs it (e. g.my_sub_sec/my_sub_sub_sec
) - quantities parsed in
row
mode are scalars. - inherit also the subsection from
EntryData
as it must be a NOMAD Entry archive file. - make use of
repeats: true
in the subsection within the parent sectionMySection
. label_quantity
annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance.
definitions:
name: 'My test ELN 5'
sections:
MySection5:
base_sections:
- nomad.datamodel.data.EntryData
- nomad.parsing.tabular.TableData
m_annotations:
eln:
quantities:
data_file:
type: str
default: test.xlsx
m_annotations:
tabular_parser:
parsing_options:
comment: '#'
mapping_options:
- mapping_mode: row
file_mode: single_new_entry
sections:
- my_subsection_5/my_repeated_sub_section
browser:
adaptor: RawFileAdaptor
eln:
component: FileEditQuantity
sub_sections:
my_subsection_5:
section:
m_annotations:
eln:
quantities:
my_ref_quantity:
type: '#/MySubSection5'
m_annotations:
eln:
component: ReferenceEditQuantity
MySubSection5:
base_sections:
- nomad.datamodel.data.EntryData
m_annotations:
eln:
more:
label_quantity: '#/data/my_quantity_1'
sub_sections:
my_repeated_sub_section:
repeats: true
section:
quantities:
my_quantity_1:
type: str
m_annotations:
tabular:
name: "My header 1"
my_quantity_2:
type: str
m_annotations:
tabular:
name: "My header 2"
6. Row mode, multiple new entries, parse to root¶
The last feature available for tabular parser is now introduced: multiple_new_entries
. It is only meaningful for row
mode because each row of the tabular data file will be placed in a new Entry that is an instance of a class defined in the schema, this would not make sense for columns, though, as they usually need to be parsed all together in one class of the schema, for example the "timestamp" and "temperature" columns in a spreadsheet file would need to lie in the same class as they belong to the same part of experiment.
A further comment is needed to explain the combination of this feature with root
. As mentioned before, using root
foresees to graft data directly in the present Entry. In this case, this means that a manyfold of Entries will be generated based on the only class available in the schema. These Entries will not be bundled together by a parent Entry but just live in our NOMAD Upload as a spare list. They might be referenced manually by the user with ReferenceEditQuantity
in other archive files. Bundling them together in one overarching Entry already at the parsing stage would require the next and last example to be introduced.
Important
data_file
quantity, i.e. the tabular data file name, is located in the parent Entry, the data is parsed in the children Entries.- double check that
mapping_options > sections
contains the right path. It should point to the (sub)section where the quantities are decorated withtabular
annotation, i. e., the one to be filled with tabular data. - quantities parsed in
row
mode are scalars. - inherit also the subsection from
EntryData
as it must be a NOMAD Entry archive file. - make use of
repeats: true
in the subsection within the parent sectionMySection
. label_quantity
annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance.
definitions:
name: 'My test ELN 6'
sections:
MySection6:
base_sections:
- nomad.datamodel.data.EntryData
- nomad.parsing.tabular.TableData
m_annotations:
eln:
more:
label_quantity: '#/data/my_quantity_1'
quantities:
data_file:
type: str
default: test.xlsx
m_annotations:
tabular_parser:
parsing_options:
comment: '#'
mapping_options:
- mapping_mode: row
file_mode: multiple_new_entries
sections:
- '#root'
browser:
adaptor: RawFileAdaptor
eln:
component: FileEditQuantity
my_quantity_1:
type: str
m_annotations:
tabular:
name: "My header 1"
my_quantity_2:
type: str
m_annotations:
tabular:
name: "My header 2"
7. Row mode, multiple new entries, parse to my path¶
As anticipated in the previous example, row
mode in connection to multiple_new_entries
will produce a manyfold of instances of a specific class, each of them being a new Entry. In the present case, each instance will also automatically be placed in a ReferenceEditQuantity
quantity lying in a subsection defined within the parent Entry, coloured in plum in the following example image.
Important
data_file
quantity, i.e. the tabular data file name, is located in the same Entry, the data is parsed in the children Entries.- double check that
mapping_options > sections
contains the right path. It should point to the (sub)section where the quantities are decorated withtabular
annotation, i. e., the one to be filled with tabular data. - the section to be parsed can be arbitrarily nested, given that the path provided in
sections
reachs it (e. g.my_sub_sec/my_sub_sub_sec
) - quantities parsed in
row
mode are scalars. - inherit also the subsection from
EntryData
as it must be a standalone NOMAD archive file. - make use of
repeats: true
in the subsection within the parent sectionMySection
. label_quantity
annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance.
definitions:
name: 'My test ELN 7'
sections:
MySection7:
base_sections:
- nomad.datamodel.data.EntryData
- nomad.parsing.tabular.TableData
m_annotations:
eln:
quantities:
data_file:
type: str
default: test.xlsx
m_annotations:
tabular_parser:
parsing_options:
comment: '#'
mapping_options:
- mapping_mode: row
file_mode: multiple_new_entries
sections:
- my_repeated_sub_section_7
browser:
adaptor: RawFileAdaptor
eln:
component: FileEditQuantity
sub_sections:
my_repeated_sub_section_7:
repeats: true
section:
m_annotations:
eln:
quantities:
my_ref_quantity:
type: '#/MySubSection7'
m_annotations:
eln:
component: ReferenceEditQuantity
MySubSection7:
base_sections:
- nomad.datamodel.data.EntryData
m_annotations:
eln:
more:
label_quantity: 'my_quantity_1'
quantities:
my_quantity_1:
type: str
m_annotations:
tabular:
name: "My header 1"
my_quantity_2:
type: str
m_annotations:
tabular:
name: "My header 2"
8. The Sub-Subsection nesting schema¶
If the tabular data file contains multiple columns with exact same name, there is a way to parse them using row
mode. As explained in previous examples, this mode creates an instance of a subsection of the schema for each row of the file. Whenever column with same name are found they are interpreted as multiple instances of a sub-subsection nested inside the subsection. To build a schema with such a feature it is enough to have two nested classes, each of them bearing a repeats: true
annotation. This structure can be applied to each and every of the cases above with row
mode parsing.
Important
- make use of
repeats: true
in the subsection within the parent sectionMySection
and also in the sub-subsection withinMySubSect
. label_quantity
annotation uses a quantity as name of the repeated section. If it is not provided, a sequential number will be used for each instance.
definitions:
name: 'My test ELN 8'
sections:
MySection8:
base_sections:
- nomad.datamodel.data.EntryData
- nomad.parsing.tabular.TableData
m_annotations:
eln:
quantities:
data_file:
type: str
default: test.xlsx
m_annotations:
tabular_parser:
parsing_options:
comment: '#'
mapping_options:
- mapping_mode: row
file_mode: current_entry
sections:
- my_repeated_sub_section_8
browser:
adaptor: RawFileAdaptor
eln:
component: FileEditQuantity
sub_sections:
my_repeated_sub_section_8:
repeats: true
section: '#/MySubSection8'
MySubSection8:
m_annotations:
eln:
more:
label_quantity: '#/data/my_quantity_1'
quantities:
my_quantity_1:
type: str
m_annotations:
tabular:
name: "My header 1"
sub_sections:
my_repeated_sub_sub_section:
repeats: true
section:
more:
label_quantity: my_quantity_2
quantities:
my_quantity_2:
type: str
m_annotations:
tabular:
name: "My header 2"
9. Not possible implementations¶
Some combinations of mapping_options
, namely file_mode
, mapping_mode
, and sections
, can give rise to not interpretable instructions or not useful data structure. For the sake of completeness, a brief explanation of the five not possible cases will be provided.
9.1 Row mode, current Entry, parse to root¶
row
mode always requires a section instance to be populated with one row of cells from the tabular data file. Multiple instances are hence generated from the rows available in the file. The instances are organized in a list and the list must be necessarily hosted as a subsection in some parent section. That's why, within the parent section, a path in sections
must be provided different from root
.
9.2 Column mode, single new Entry, parse to root¶
This would create a redundant Entry with the very same structure of the one where the data_file
quantity is placed, the structure would furthermore miss a reference between the two Entries. A better result is achieved using a path in sections
that would create a new Entry and reference it in the parent one.
9.3 Row mode, single new Entry, parse to root¶
As explained in the first section of not possible cases, when parsing in row mode we create multiple instances that cannot remain as standalone floating objects. They must be organized as a list in a subsection of the parent Entry.
9.4 Column mode, multiple new entries, parse to root¶
This case would create a useless set of Entries containing one array quantity each. Usually, when parsing in column mode we want to parse together all the columns in the same section.
9.5 Column mode, multiple new entries, parse to my path¶
This case would create a useless set of Entries containing one array quantity each. Usually, when parsing in column mode we want to parse together all the columns in the same section.