arXiv:2407.16867 (2024)
From Text to Insight: Large Language Models for Materials Science Data Extraction
Mara Schilling-Wilhelmi, Martiño Ríos-García, Sherjeel Shabih, María Victoria Gil, Santiago Miret, Christoph T. Koch, José A. Márquez and Kevin Maik Jablonka
The vast majority of materials science knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling efficient extraction of structured, actionable data from unstructured text by non-experts. While applying LLMs to materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This review provides a comprehensive overview of LLM-based structured data extraction in materials science, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and materials science expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven materials research. The insights presented here could significantly enhance how researchers across disciplines access and utilize scientific information, potentially accelerating the development of novel materials for critical societal needs.
npj Computational Materials volume 8, (2022)
The NOMAD Artificial-Intelligence Toolkit: turning materials-science data into knowledge and understanding
Luigi Sbailò, Ádám Fekete, Luca M. Ghiringhelli & Matthias Scheffler
We present the Novel-Materials-Discovery (NOMAD) Artificial-Intelligence (AI) Toolkit, a web-browser-based infrastructure for the interactive AI-based analysis of materials-science findable, accessible, interoperable, and reusable (FAIR) data. The AI Toolkit readily operates on the FAIR data stored in the central server of the NOMAD Archive, the largest database of materials-science data worldwide, as well as locally stored, users' owned data. The NOMAD Oasis, a local, stand-alone server can be also used to run the AI Toolkit. By using Jupyter notebooks that run in a web-browser, the NOMAD data can be queried and accessed; data mining, machine learning, and other AI techniques can be then applied to analyze them. This infrastructure brings the concept of reproducibility in materials science to the next level, by allowing researchers to share not only the data contributing to their scientific publications, but also all the developed methods and analytics tools. Besides reproducing published results, users of the NOMAD AI toolkit can modify the Jupyter notebooks toward their own research work.
Nature Reviews Physics volume 3, 724 (2021)
An AI-toolkit to develop and share research into new materials
Luca M. Ghiringhelli
Probably the biggest challenge in materials science is the discovery or design of new materials that exhibit exceptional performance for a desired function, or uncovering new properties of known materials. AI methods can be used to identify patterns and trends from big data to these ends. In materials science, these big data are a complex, hierarchical structure of experimental measures and theoretical estimates. Since 2014, the Novel Materials Discovery (NOMAD) Laboratory has established a materials data infrastructure, based on a large repository of materials data, and provides AI tools and training for researchers to freely access this resource, in compliance with the FAIR principles — that data should be findable, accessible, interoperable and reusable (or recyclable).