Training data is the base resource for both the generation, such as during the selection of prototypes and elements, of new materials and the development of ML models that are used to predict the stability and key properties of the generated materials. The following points should be considered when deciding upon the usefulness of a training dataset of 2D materials computed features:
preferably computed using the same physics-based method;
having information about the uncertainty of the employed computational method to generate data;
having traceable material structure information (i.e., space group, lattice parameters, and atom coordinates);
having well-established stability information on compounds;
having sufficiently large number of data instances;
having ample diversity in compound structures and chemical compositions.
In consideration of the above points, to generate the V2DB, we collected the training data from the C2DB23, an openly accessible 2D materials database with the DFT-calculated structures and information on the stability, electronic, and magnetic properties of compounds that have been obtained via a trustworthy workflow. The C2DB contains the data on 3331 2D materials. We filtered the DFT-computed data using the following criteria:
include materials that belong to the selected prototypes (Fig. 2a);
include materials that contain only the selected chemical elements (Fig. 2b);
include materials that have dynamic stability data (Supplementary Information Sections 1.2 and 1.3);
include materials that hßave DFT(PBE) band gap data.
A total of 2226 2D materials from the C2DB have successfully met all of the above criteria.
Selection of prototypes
The crystal prototypes are used as categorical references of structures. Materials that have the same prototype show similar build architecture in terms of the crystal symmetry (i.e., space group and positions of atoms in the unit-cell). The selection of prototypes is the first step of the new compound generation process, and together with the selection of elements step that is explained next, it concurrently determines the borders of the chemical search space of the compounds. We followed the below steps during the selection of prototypes:
identification of prototypes;
grouping prototypes according to unit-cell configurations (Fig. 2a);
selection of prototypes by unit-cell groups (Table 1).
The prototypes were carefully defined in the training data, and therefore we used them as are. In other situations when the prototype information is not explicitly provided, for instance in an emerging new dataset of 2D materials, the compounds that satisfy the below conditions can be considered to exist with the same prototype structure:
In the next step, the identified prototypes are grouped according to the number of “A” (positively charged; cations) and “B” (negatively charged; anions) type atoms found in the unit cell. Finally, a cutoff for the maximum number of atoms per unit-cell is set. The number of atoms allowed in a unit cell exponentially affects the total number of new materials to be generated (see Table 1). For this reason, it is advised to set a small cutoff value, unless there is sufficiently large and homogeneously distributed data among different compositions of materials.
As shown in Fig. 2a, by setting a cutoff of six atoms per unit-cell, four different unit-cell groups are selected: AB, ABB, AABB, AABBBB. These unit-cell groups accommodate all of the 22 different crystal prototypes that are used as base structures during the process of candidate 2D material generation.
Selection of elements
Along with the crystal prototypes, the chemical elements are the most important descriptors of materials. They are effective during both, in defining the size of chemical search space, and in developing the new ML models that are aimed at the prediction of material properties. To reliably predict the properties of newly generated 2D materials, the chemical elements of the generated compounds must be sufficiently present in the training data. Therefore, the chemical elements are selected by deciding on a minimum number of instances of different compounds within the training data that contain those elements. Accordingly, there is usually a trade-off between the total number of materials to be generated and the accuracy of the ML models to be developed. High threshold values for the minimum number of instances for chemical elements will naturally decrease the total number of newly generated compounds, but they will facilitate a better estimation of the properties of materials using AI methods. Thus, the threshold values can be adjusted according to own discretion.
After an analysis of the number of occurrences of the chemical elements in the 2D materials of the prototypes, we selected elements that appeared in a minimum of ten different DFT-calculated compounds (see Supplementary Fig. 4). As shown in Fig. 2b, after applying this threshold we selected 42 A-type and 10 B-type, therefore a total of 52 different chemical elements from the periodic table.
Brute-force elemental substitution
We applied a brute-force elemental substitution method to generate a systematic and complete library of new 2D materials. Our brute-force elemental substitution algorithm systematically generates all possible candidate materials through the substitution of elements shown in Fig. 2b within the groups of prototypes shown in Fig. 2a. First, the atoms are classified into two categories of A and B depending on their anticipated electronic charges. Atoms from the “A” category always have positive charges, whereas atoms from the “B” category always have negative charges. Using these two categories, we defined simple substitution rules as follows:
Using a brute-force elemental substitution approach on the selected list of prototypes and chemical elements, we generated 72,522,240 new, but yet unfiltered 2D materials.
The brute-force method generates all possible compositions of 2D compounds and disregards the symmetry of individual crystals. Therefore, the symmetrically identical duplicates should be identified and unified by removing the exact copies of compounds. For this purpose, we developed an algorithm that detects the duplicates in the generated 2D materials. The algorithm assigns symmetry labels to compounds based on the crystal prototypes and the atomic compositions. All compounds that share the same prototype structure have the same symmetry label. To define the symmetry labels for prototypes, we analyzed the symmetry in the unit-cell by considering pairs of atoms that belong to the same charge category, as shown in Fig. 2b. The 2D material is labeled with “Z” when no pairs of atoms are matched in its unit-cell (Table 1). The labels “X” and “Y” are used when pairs of A- and B-type atoms, respectively, are identified as symmetric according to a cut-plane that divides the unit cell into two symmetric pieces. For all the crystal prototypes and unit-cells considered in the current study, we used a total of five symmetry labels, as explained below:
Z: No symmetry in the prototype.
X: “A” atoms are symmetric.
Y: “B” atoms are symmetric.
XY: All “A” and “B” atoms are symmetric.
XYY: “A” atoms are symmetric, two “BB” atom couples are symmetric independently (specifically for AABBBB unit-cell group structures).
Using the symmetry filtering, we detected and removed the duplicate copies from the whole set of generated compounds. A total of 10,321,920 structurally unique 2D materials have sifted through the symmetry filter.
The neutrality filter considers all of the possible charge combinations of constituting elements for the compound under investigation, as calculated by different charge contributions of its A- and B-type atoms. In a condition when a no charge neutral composition can be achieved between the constituting atoms of the compound, the 2D material is removed from the list of candidates. We used Greenwood’s tabulated data54 as a reference of possible charge states of the atoms in our 2D material compositions. In addition to a reference table of elemental electrical charges, we applied the following criteria:
After applying the neutrality filter, the compounds that are not charge neutral are removed from the list of candidates, and a total of 9,732,136 2D materials remained.
One of the important challenges of virtual material discovery is the uncertainty of the experimental synthesizability of the newly designed compounds. An approach to mitigate this uncertainty is to estimate the stability of the newly designed material. In simple terms, the stability of a material is defined as the ability to maintain the designed atomic configuration under specific physical and chemical conditions. A recent study55 demonstrated the applicability of an ML approach to predict the thermodynamic stability of 2D materials. However, a more comprehensive and accurate stability filter requires the inclusion of the three key factors of energy, phonon, and dynamic stability56. Considering these three factors, we developed ML models in order to identify the likely stable materials from the newly generated and filtered library of candidate 2D materials. To determine the stability of the compounds, we used the following three criteria on the ML-predicted properties, that is “is stable”, “heat of formation (ΔH)”, and “energy above convex hull (ΔHhull)”:
“Is stable” is a binary property data that is derived from a combination of thermodynamic and dynamic stability levels of materials as learned from the training data (see Supplementary Information Section 1.2). Fundamentally, both the ΔH and ΔHhull must be negative in order to consider that material as thermodynamically stable. However, noting the accuracy of the DFT methodology using the PBE functional is ~0.2 eV/atom57, we used 0.2 eV/atom as a high cutoff in order to maximize the recall. Yet, we note that it is possible to use the low cutoff of –0.2 eV/atom to maximize the precision of the ML model. After applying the stability filter, a total of 316,505 2D materials have remained as likely stable candidates. The technical details of our ML models that are developed and used for the task of stability filtering are provided in the Supplementary Information Section 1.2.
ML model development
All machine learning models are developed using the scikit-learn machine learning library on python 3.6. A separate ANN has been trained for each target material property. Importantly, as input features for our ML models, we used only the basic element level information, which is non-DFT-calculated and can directly be extracted from the atomic composition of materials. The following features are used for our ML models (see Supplementary Information Section 2.1):
Atom per unit-cell: total number of atoms in the unit-cell of the 2D compound.
Prototype vector: one-hot vector of the 2D crystal prototypes.
Chemical composition vector: a vector with a ratio information of each chemical element within the unit-cell of the 2D compound, as calculated individually for A- and B-type atoms.
Electronegativity vector: the geometric mean of the Pauling scale electronegativity of the chemical elements within the unit-cell of the 2D compound, as calculated individually for A- and B-type atoms.
We tuned each ML model independently by optimizing the hyper-parameters using a grid search method (see Supplementary Information Section 2.2). We used a 20-Fold cross-validation technique for evaluating our ML models. To reduce bias, we trained the final ML models, which are used for the identification of likely stable 2D material candidates and the prediction of their key properties, using the entire DFT dataset. It should be noted that the labeling procedure applied here is deterministic and only the basic features with element level information are used. Therefore, it is expected that the predicted properties for materials that have the same prototype and the same chemical formula will be labeled with exactly the same values. Further information on the development of ML models is provided in the Supplementary Information Section 2.
ML model validation
Very recently, 2Dmatpedia24, a new 2D material database has been announced. The 2Dmatpedia contains a total of 6351 2D compounds with properties that were calculated using DFT. Although there are some differences in the threshold parameters used for the DFT calculations, essentially the 2Dmatpedia and the C2DB have a similar computational methodology, in terms of the use of PBE exchange-correlation functional, atomic pseudopotentials, and structural optimizations. Therefore, we used the 2Dmatpedia database to validate the accuracy of our ML model for the prediction of the band gap, which is the only comparable DFT-calculated property between 2Dmatpedia and C2DB. To screen for the mutual materials in our V2DB and the 2Dmatpedia, we first developed an algorithm that compares the materials based on their chemical formula and the space group. However, the formula and space group information are not sufficient enough to confirm the identicality of the materials from the two sources. Therefore, we also paired the materials by comparing their three-dimensional structure views. As a result, we identified a total of 103 matched materials from the two databases of V2DB and 2Dmatpedia. Twenty-seven of these compounds were not found in C2DB, therefore they were not included in the training set. The predicted band gaps of the matched materials have MAE of 0.438 eV. Considering that there is a cumulative effect of the difference between the C2DB and 2Dmatpedia with MAE = 0.132 eV, and the cross-validation error of our ML model for band gap predictions with MAE = 0.135 eV, the result is promising for the applicability of our ML model. It is also important to note that the validation data comprises only 5 out of the 22 crystal prototypes of the generated chemical space of the 2D materials. Therefore, a larger and sufficiently diverse data will provide a more comprehensive validation. The distributions of the band gap energy differences between the V2DB and 2Dmatpedia databases are provided in Supplementary Figs. 11 and 12. Additionally, we analyzed the most extreme errors (see Supplementary Tables 4 and 5) and discussed the possible weaknesses of the model for generalizability above in the “Discussion” section.