Download data
A common use-case for the NOMAD API is to download large amounts of NOMAD data. In this how-to guide, we use curl and API endpoints that stream .zip files to download many resources with a single request directly from the command line.
Prerequisites¶
Here is some background information to understand the examples better.
curl¶
To download resources from a REST API using curl, you can utilize the powerful command-line tool to send HTTP requests and retrieve the desired data. Curl provides a simple and efficient way to interact with RESTful APIs, allowing you to specify the necessary headers, parameters, and authentication details. Whether you need to download files, retrieve JSON data, or access other resources, curl offers a flexible and widely supported solution for programmatically fetching data from REST APIs.
Raw files vs processed data¶
We are covering two types of resources: raw files and processed data. The former is organized into uploads and sub directory. The organization depends on how the author was providing the files. The later is organized by entries. Each NOMAD entry has corresponding structured data.
Endpoints that target raw files typically contain raw
, e.g. uploads/<id>/raw
or entries/raw/query
. Endpoints that target processed data contain archive
(because we call the entirety of all processed data the NOMAD Archive), e.g.
entries/<id>/archive
or entries/archive/query
.
Entry vs upload¶
API endpoints for data download either target entries or uploads. For both types
of entities, endpoints for raw files and processed data (as well as searchable metadata)
exist. API endpoint paths start with the entity, e.g. uploads/<id>/raw
or entries/<id>/raw
.
Download a whole upload¶
Let's assume you want to download an entire upload. In this example the upload id is
wW45wJKiREOYTY0ARuknkA
.
curl -X GET "http://localhost:8000/fairdi/nomad/latest/api/v1/uploads/wW45wJKiREOYTY0ARuknkA/raw" -o download.zip
This will create a download.zip
file in the current folder. The zip file will contain
the raw file directory of the upload.
The used uploads/<id>/raw
endpoint is only available for published uploads. For those,
all raw files have already been
packed into a zip file and this endpoint simply lets you download it. This is the simplest
and most reliable download implementation.
Alternatively, you can download specific files or sub-directories. This method is available for all uploads. Including un-published uploads.
curl -X GET "http://localhost:8000/fairdi/nomad/latest/api/v1/uploads/wW45wJKiREOYTY0ARuknkA/raw/?compress=true" -o download.zip
This endpoint looks very similar, but is implemented very differently. Note that we
put an empty path /
to the end of the URL, plus a query parameter compress=true
.
The path can be replaced with any directory or file path in the upload; /
would denote the
whole upload. The query parameter says that we want to download the whole directory
as a zip file, instead of an individual file. This traverses through all files and
creates a zip file on the fly.
Download a whole dataset¶
Now let's assume that you want to download all raw files that are associated with
all the entries of an entire dataset. In this example the dataset DOI is
10.17172/NOMAD/2023.11.17-2
.
curl -X POST "http://localhost:8000/fairdi/nomad/latest/api/v1/entries/raw/query" \
-H 'Content-Type: application/json' \
-d '{
"query": {
"datasets.doi": "10.17172/NOMAD/2023.11.17-2"
}
}' \
-o download.zip
This time, we use the entries/raw/query
endpoint that is based on entries and not on uploads. Here, we
select entries with a query. In the example, we query for the dataset DOI, but you
can replace this with any NOMAD search query (look out for the <>
symbol on the
search interface). The zip file will contain all raw files from all the
directories that have the mainfile of one of the entries that match the queries.
This might not necessarily download all uploaded files. Alternatively, you can use a query to get all upload ids and then use the method from the previous section:
curl -X POST "http://localhost:8000/fairdi/nomad/latest/api/v1/entries/query" \
-H 'Content-Type: application/json' \
-d '{
"query": {
"datasets.doi": "10.17172/NOMAD/2023.11.17-2"
},
"pagination": {
"page_size": 0
},
"aggregations": {
"upload_ids": {
"terms": {
"quantity": "upload_id"
}
}
}
}'
The last command will print JSON data that contains all the upload ids. It uses
the entries/query
endpoint that allows you to query NOMAD's search.
It does not return any results (page_size: 0
),
but performs an aggregation over all search results and collects the upload ids
from all entries.
Download some processed data for a whole dataset¶
Similar to raw files, you can also download processed data. This is also an
entry based operation based on a query. This time we also specify a required
to explain which parts of the processed data, we are interested in:
curl -X POST "http://localhost:8000/fairdi/nomad/latest/api/v1/entries/archive/download/query" \
-H 'Content-Type: application/json' \
-d '{
"query": {
"datasets.doi": "10.17172/NOMAD/2023.11.17-2"
},
"required": {
"metadata": {
"entry_id": "*",
"mainfile": "*",
"upload_id": "*"
},
"results": {
"material": "*"
},
"run": {
"system[-1]": {
"atoms": "*"
}
}
}
}' \
-o download.zip
Here we use the entries/archive/download/query
endpoint. The result is a zip file
with one json file per entry. There are no directories and the files are named
<entry-id>.json
. To associate the json files with entries, you should require
information that tells you more about the entries, e.g. required.metadata.mainfile
.
See also the How to access processed data how-to guide.