InterMat: Accelerating Band Offset Prediction in Semiconductor Interfaces with DFT and Deep Learning

We introduce a computational framework (InterMat) to predict band offsets of semiconductor interfaces using density functional theory (DFT) and graph neural networks (GNN). As a first step, we benchmark OptB88vdW generalized gradient approximation (GGA) work functions and electron affinities for surfaces against experimental data with accuracies of 0.29 eV and 0.39 eV, respectively. Similarly, we evaluate band offset values using independent unit (IU) and alternate slab junction (ASJ) models leading to accuracies of 0.45 eV and 0.22 eV, respectively. We use bulk band structure calculations with the TBmBJ meta-GGA functional to correct for band gap underestimation when predicting conduction band properties. During ASJ structure generation, we use Zur algorithm along with a unified GNN force-field to tackle the conformation challenges of interface design. At present, we have 607 surface work functions calculated with DFT, from which we can compute 183921 IU band offsets as well as 593 directly calculated ASJ band offsets. Finally, as the space of all possible heterojunctions is too large to simulate with DFT, we develop generalized GNN models to quickly predict bulk band edges with an accuracy of 0.26 eV. We show how these models can be used to predict relevant quantities including ionization potentials, electron affinities, and IU-based band offsets. We establish simple rules using the above models to pre-screen potential semiconductor devices from a vast pool of nearly 1.4 trillion candidate interfaces. InterMat is available at website: https://github.com/usnistgov/intermat


Introduction
Interfaces are critical for a variety of technological applications including semiconductor transistors and diodes, solid-state lighting devices, solar-cells, data-storage and battery applications [1][2][3][4][5][6][7][8] .In particular, the continued scaling of semiconductor devices towards the atomic limit 9 makes interface properties even more important and a focus area of recent investments in research and development including the Creating Helpful Incentives to Produce Semiconductors (CHIPS) act 10 .While interfaces are ubiquitous, predicting even basic interface properties from bulk data or chemical models remains challenging.There have been numerous scientific efforts to model interfaces with a variety of techniques including density functional theory (DFT) [11][12][13][14][15][16][17][18][19][20] , force-field [21][22][23][24] , tight-binding [25][26][27][28] and machine learning techniques 19,[29][30][31] .However, to the best of our knowledge, there is no systematic investigation of interfaces for a large class of structural variety and chemical compositions.Most of the previous efforts focus on a limited number of interfaces, and hence there is a need for a dedicated infrastructure for data-driven interface materials design.Some of the key quantities for determining interface properties are: equilibrium geometries, energetics, work functions, ionization potentials, electron affinities, band offsets, carrier effective masses, mobilities, and thermal conductivities.Calculations of band offsets and band-alignment at semiconductor heterojunctions are of special interest for device design.Semiconductor device transport and performance depend critically on valence band offsets (∆E v ) and conduction band offsets (∆E c ), as well as interfacial roughness and defects [32][33][34] .Based on the band-alignment, heterostructures can be categorized into three classes: (i) type-I (straddling gap), (ii) type-II (staggered gap), and (iii) type-III (broken gap).The type-I heterostructures are used for transistors, lasers and light-emitting diode (LED) applications, type-II are used for photoabsorbers and photocatalysts, and type-III are used for tunneling field effect transistors.
Experimentally, band offsets can be measured using optical spectroscopy, X-ray photoelectron spectroscopy (XPS), ultraviolet photoelectron spectroscopy (UPS), and electrical measurements 5 .However, these experiments can be quite time and resource consuming.Additionally, the variability across multiple reported measurements can be reasonably high.For example, the reported AlN/GaN interface ∆E v varies from 0.57 eV to 1.36 eV with reported uncertainties of up to 0.24 eV 34 .In this respect, the computation of band offsets can serve as a complementary tool to experimental analysis.Nevertheless, the calculation of band offsets is rather challenging 35 and has been an area of research for about a century [36][37][38] .Density functional theory (DFT) calculations are one of the most widely used techniques for predicting band offsets, as they can describe the electronic and atomic structures at the interface in a self-consistent manner.There are two main approaches to predicting band offsets using DFT.The first is to directly simulate the interface using either an alternating slab-junction (ASJ)/superlattice or surface terminated junction (STJ)/slab vacuum geometry, either of which requires a computationally expensive calculation for each pair of materials.Alternatively, the independent unit (IU)/electron affinity/Anderson's model 16,39 requires only independent surface calculations of each material, greatly reducing computational cost but ignoring specific interface effects.ASJ models were shown to be most accurate in Ref. 16 , but IU models are surprisingly competitive.
Importantly, the generation of an atomistic interface geometry is a challenging task due to the high number of possible conformations and configurations.There are several important factors determining an interface such as: the selection of the lattice alignment, the relative orientation/displacement between surfaces, the separation distance, point/line defects, and the presence of interfacial charge transfer.Several previous tools have attempted to address this challenge, including MPInterfaces 18 , TribChem 40 and QuantumATK 41 packages.
Moreover, DFT calculations of interfaces require initial pre-relaxed bulk structures which in this work are obtained from the Joint Automated Repository for Various Integrated Simulations (JARVIS)-DFT 42,43 database containing nearly 80000 bulk 3D and 1100 2D materials.The JARVIS-DFT originated about 5 years ago and contains millions of properties materials and has carefully converged atomic structures with tight convergence parameters, various exchange-correlation functionals such as OptB88vdW 44 , TBmBJ 45 , R2SCAN 46 and HSE06 47 .JARVIS-DFT contains metallic, semiconducting, insulator, superconductor , high-strength, topological, solar, thermoelectric, piezoelectric , dielectric, two-dimensional, magnetic, porous, defect and various other classes of bulk materials 48,49 .We have also previously looked at the band alignment of layered two dimensional materials using JARVIS-DFT 19 .However, three dimensional systems with chemical bonding between the materials require much greater effort, as the interfacial bonding has a much greater effect on the interface properties, and the determination of even a single interface structure is a challenging task.Out of the above material class combinations, semiconductor-semiconductor are of special interest for this work.As DFT calculations can be time-consuming for surfaces and interfaces machine-learning (ML)/deep learning (DL) techniques based on DFT data can be used to accelerate atomistic predictions 50,51 .Such models have often been applied for bulk property predictions and their applicability for defects and interfaces remains an open question.Several machine learning tools available in JARVIS such as classical force-field inspired descriptors (CFID) 43 , atomistic line graph neural network (ALIGNN) 52,53 , computer vision for atomistic images (AtomVision) 54 and natural language processing for chemistry (ChemNLP) 55 can be used in this regards to accelerate the interface design tasks.In particular, ALIGNN has been used to develop several fast surrogate models for property predictions as well as a unified force-field for fast structure optimizations.
Most importantly, for all the above predictions, it is important to benchmark and quantify error with respect to experimental data to gain confidence in the prediction methodology.This work addresses the above challenges and provides a streamlined framework for semiconductor interface design (InterMat).Although focusing on semiconductors, this work has relevance to other applications such as battery, data-storage, and solar-cell devices.We believe that this work will be a precursor to more thorough theoretical and experimental investigations of semiconductor interfaces.

Results and discussion
A schematic overview of InterMat along with the combinatorial problem of interfaces is shown in Fig. 1.InterMat can be used to generate surface and interface structures, perform multi-fidelity calculations to predict properties, analyze and benchmark data against experiments, and train and utilize machine learning models based on the resulting data.Initial atomic structures can be obtained from the JARVIS-DFT repository.As an example of the combinatorial challenge of interfaces, we can consider starting with the 20901 semiconductors in the JARVIS-DFT databsae with OptB88vdW band gaps between 0.1 eV and 6 eV.Using a maximum Miller index (M) of 1, 2, and 3, the number of symmetrically distinct surface slabs are 186847, 591639 and 1642584, respectively.Using these surfaces, the number of binary interface systems that can be generated are 17.5 billion, 175 billion and 1.4 trillion.Including the possibility of different atomic surface terminations, reconstructions, and defects further complicates matters.
It is unrealistic to analyze such a large search space to find combinations for device applications by using conventional experimental or computational techniques.We will instead use ALIGNN models trained on bulk materials to guide and prioritize DFT calculations among the vast pool of candidate surfaces and interfaces.We develop machine learning models for fast predictions of valence band maxima (VBM) and conduction band minima (CBM) using ALIGNN on JARVIS-DFT bulk material dataset.These predictions can be further used for fast IU based band alignment using electron affinity/Anderson's rule 39 .To assess the strengths and limitations of such models, we develop a surface dataset for independent unit (IU) models and an interface dataset for alternate slab junction (ASJ)/superlattice models using DFT.We particularly focus on industrially relevant semiconductors including group IV (C, Si, Ge etc.), III-IV (AlN, GaN, GaAs, GaP, InSb etc.), II-VI (CdS, CdSe, ZnO, ZnS etc.).We also assess the strengths and limitations of IU and ASJ models against experimental measurements.This DFT dataset can then be fed back into the ALIGNN models to further improve accuracy.

DFT surface dataset: work function, electron affinity, ionization potential and surface energy
We develop a dataset of non-polar unreconstructed slab surfaces using the JARVIS-DFT workflow and bulk material dataset.Examples of silicon and gallium arsenide bulk atomic structures are shown in Fig. 2a and Fig. 2b respectively.Next, we generate surface slab structures with a thickness of 1.6 nm and vacuum padding of 1.2 nm, as shown in Fig. 2c.Recently, it was shown that vacuum and slab thicknesses of at least 10 Å are sufficient for surface models 76 .During the DFT calculations, the converged k-point 77 values from the relevant bulk calculations are used for surfaces.We optimize the internal coordinates of these surfaces keeping the cell volume constant.

InterMat
Fig. 1 Schematic overview of the workflow.InterMat can be used to generate surface and interface geometric structures (Geom.),perform multi-fidelity calculations (such as density functional theory, force-field, tight-binding and machine learning) to predict properties (such as surface energy, interface formation energy, band offset, work function, ionization potential, electron affinity), analyze and benchmark data against experiments, and utilize machine learning models for such data.The number of possible semiconductor-semiconductor interfaces is exceedingly large.The workflow aims to provide a toolkit to generate interface structures and use multi-fidelity methods to accelerate interface/heterostructure design.
experimental measurements from the literature.The surface energy (γ) can be calculated using the formula: where (E slab ) is the total energy of the relaxed slab model, (N bulk ) is the number of bulk-like atoms in the slab model, (E bulk ) is the energy per atom in the bulk material, and (A) is the surface area of the slab model.The factor of 2 accounts for the fact that there are two surfaces in the non-polar slab model (top and bottom).We obtain the valence band maximum (VBM) and vacuum level (E vac ) of surface slabs from DFT calculations using the OptB88vdW functional.The work function is obtained by subtracting the vacuum level from the Fermi level (φ = E vac − E F ). Similarly, the ionization potential is the difference between the VBM and E vac .Then, we add the electronic bandgap (E g ) of the bulk material to the ionization potential to get the electron affinity (EA, χ).Semi-local DFT has proven quite effective in describing the valence bands of materials but is known to underestimate band gaps.For accurate prediction of both valence and conduction bands, particularly in materials with complex electronic interactions, higher-level theories like many-body perturbation theory (e.g.GW calculations) might be necessary, but are computationally very expensive.In order to address this problem in a more computationally efficient manner, we make use of bulk band gaps from the JARVIS-DFT database computed using the TBmBJ metaGGA functional.TBmBJ predictions can provide band gap descriptions with accuracy close to more expensive methods but at an order of magnitude less computational cost 78 , which is important for high-throughput studies.We calculate surface conduction band quantities by first calculating the valence band at the GGA-level using OptB88vdW and then add to that the TBmBJ bulk gap to get the conduction band minimum (CBM).Performing full surface calculations using hybrid functionals or GW is beyond the scope of the present work because of excessive computational cost, but we plan to provide further tests of those approaches in the future 14,15 .
In order to benchmark our surface dataset, we compare work functions, electron affinities, and surface energies of several dozen surfaces with experimental data in Table .1.We find excellent agreement for the work functions, with a mean absolute error value of 0.29 eV, consistent with previous benchmarking efforts 79 .Similarly, we obtain a mean absolute error of 0.39 eV and 0.34 Jm −2 for the electron affinity and surface energy, respectively.Currently, we have performed calculations on 607 surfaces using the workflow, and the dataset is still growing.Using, these 607 surfaces, 183921 IU-band offsets can be predicted.Also, we plan to include reconstructed and polar surfaces in the future.

DFT Interface dataset: alternate slab junction (ASJ) band alignment
We next consider explicit DFT calculations of interfaces, which first require generating candidate interface structures.This can be done in either of the following two ways: 1) by attaching the two surface slabs together without vacuum padding, creating a superlattice or alternating slab junction (ASJ) structure, or 2) by attaching the two surface slabs with vacuum padding, creating a surface terminated junction (SJT) structure.We have focused on the ASJ approach 16,80 .After obtaining surface slab structures as discussed in the previous section, we generate the interfaces following the Zur et.al. algorithm 81 .The Zur algorithm generates a number of superlattice transformations within a specified maximum surface area and also evaluates the length and angle between film and substrate superlattice vectors to determine if they can match within a tolerance.This algorithm is applicable to different crystal structures and their surface orientations.We use a maximum lattice mismatch of 8 %, maximum area of 300 Å 2 , and maximum angle tolerance of 1 degree.
Note that in previous studies, lattice mismatch of 20 % has been reported 14,82 .An example of the application of the algorithm to the Si(110)/GaAs(110) interface is shown in Fig. 2d with several lattice length and angle mismatches (∆u and ∆θ ) as well as maximum area.After eliminating structures with area higher than max-area tolerance and structures with mismatch angle more than the specified angle threshold, we then choose the remaining structure (if any) with the minimum mismatch lattice vector lengths.The Zur algorithm determines a candidate unit cell, but the relative alignment of the structures in the in-plane, as well as the slab terminations still need to be decided.For the in-plane alignment, we perform a grid search of possible options with a spacing interval of 0.05 fractional coordinates to determine the initial structure for further relaxation.Doing such a large number of calculations with DFT would be prohibitive, so we use ALIGNN-FF 53 to identify the starting in-plane alignment.ALIGNN-FF is a universal force field ML model developed using JARVIS-DFT data with 307113 structures and can be used to model combinations of 89 elements from the periodic table.An example of ALIGNN-FF predictions of an in-plane grid search is shown in Fig. 2e.For the Si/GaAs(110) case, we also perform corresponding DFT calculations as shown in Fig. 2f.Here high-peaks (yellow color using magma colormap) usually represent too close atoms during the translation operations, which should be avoided.Clearly, the DFT contours are smoother than ALIGNN-FF because of its relatively rough potential energy surface (PES).Nevertheless, the minimums of the contours, which are of interest for in-plane alignments, closely resemble each other.As the ALIGNN-FF accuracy increases with more data, we expect to get much smoother PES in future.After the ALIGNN-FF calculations to select the initial alignment, a full DFT relaxation is performed.For computational purposes, it is important to have a unique identifier for an interface.While generating the interfaces, we use a naming convention to include a) material IDs (such as JVASP-1002 for Si and JVASP-1174 for GaAs), b) film and substrate Miller indices (such as 110 for each), c) film and substrate thickness values (such as 16 Å each), d) separation between these two surface slabs (such as 2.5 Å for an ASJ model, 18 Å for STJ interface models), e) relative displacement in xy-plane (such as a displacement vector of [0.5, 0.2]), f) calculator method (such as DFT (VASP), ALIGNN-FF etc.) giving rise to an interface with a name such as: Interface-JID1_JID2_film_miller_M1_sub_miller_M2_film_thickness_T1_subs_thickness_T2_separation_S_disp _X_Y_vasp (where JID1 is JVASP-1002, JID2 is JVASP-1174, M1 and M2 are both 1_1_0, T1 and T2 are 16, S is 2.5, X is 0.5, Y is 0.2).Such a scheme helps to reproduce the unique interfaces.Of course, realistically, more complex parameters for an interface can be important such as terminations, reconstructions, misfit-dislocation, vacancies on the interface etc., but they can be easily included in the naming scheme as well later.
After selecting a good guess of the interface using the above approach, DFT calculations are performed to calculate quantities such as the interface formation energy and the band offset value.During the DFT calculations, the more converged k-point grids and energy cutoffs of the two constituent bulk materials 77 is used.An example of the Si(110) and GaAs(110) interface is shown in Fig. 3a.Furthermore, we can project the electron density of states across the cell dimension in Fig. 4 to show how electronic states are distributed along the interface region.We observe the GaAs gap decrease near the silicon region.Such analysis can help to understand the local band alignment and atomic character, which are important for device modeling.
After determining the optimized geometric structure for the interface using DFT, we obtain the interface formation energy and valence band offset data using the formalism detailed in Ref. 83 and Ref. 11   Si(110)/GaAs(110) and AlN(001)/GaN(001) in Fig. 3.The interface formation energy (γ f ) is calculated using the formula: where γ interface formation energy, E tot is the total energy of the superlattice, µ i is the chemical potential of the specie i, n i is the number of atoms of the specie i, and A is the interface unit cell area.Using the bulk materials energy per atom in its most stable form in JARVIS-DFT and OptB88vdW functional, we obtain an interface formation energy of -0.056 Jm −2 for the Si(110)/GaAs(110) system.A negative formation energy suggests a feasible formation of the interface.Moreover, such interface formation energies with varying chemical potentials of the constituent elements can provide information about the thermodynamic stability of the interface in different growth conditions.Such detailed tasks for individual interfaces will be carried in future.In Fig. 3a, we show the atomic structure of the ASJ based heterostructure of Si(110)/GaAs(110).The left side (with blue atoms) represents the Si and the right side is the GaAs region.In Fig. 3c, we show the electrostatic potential profile, averaged in-plane, of the interface.The approximately sinusoidal profile on both regions represents the presence of atomic layers.The cyan lines show the region used to define the repeat distance, L, used for averaging in each material (see below).The red and green lines show the average potential profiles for the left and right parts using the repeat distance.The valence band offset (∆E v ) of an interface between semiconductor A and B, ∆E v is obtained using eq. 4. The difference in the averages for the left and right parts gives the ∆V term.Now the bulk VBMs of the left and right parts are also calculated to determine the ∆E.The sum of these two quantities gives the valence band offset that can be compared to experiments.
Here, E A v (E B v ) represents the position of the VBM with respect to the average electrostatic potential in the bulk material A (B), and ∆V represents dipole potential or the difference between the macroscopic-averaged electrostatic potential between A and B.Moreover, V is the average along the repeat unit L of V , which is the planar averaged electrostatic potential: V is given by: where, L is the distance between repeat units and S is the area which is parallel to the interface.The corresponding conduction-band offset can be determined by using TBmBJ band-gap values from the respective bulk calculations or experimental data.We will use the convention that a positive value of the valence-band offset at an interface A/B indicates that the VBM is higher in material B. For the GaAs/Si interface we obtain ∆E v of 0.31 eV and 0.39 eV using OptB88vdW and R2SCAN functionals, respectively, which is in close agreement with the experimental value of 0.   potential profile is shown in Fig. 3d.In contrast to flat average potential values in Fig. 3c, we observe inclined profiles for this system indicating the presence of a constant electric field.We fit lines for both sides and extrapolate to the interface.The difference of the lines at the interface gives ∆V .The calculation of ∆E remains the same as the non-polar case.These calculations are automated in the workflow, however, it is important to check that the slab is thick enough to define a bulk-like region where V (z) is linear.Now, in the for a smaller number of systems using the HSE06 functional.In the future, we plan to carry out HSE06 calculations for surfaces as well as interfaces to further improve the quality of predictions.These benchmarks will also be available in the JARVIS-Leaderboard platform 103 as well.Out of numerous possible combinations, only 593 DFT calculations of ASJ-based interfaces are available right now and the database is still growing.

DFT-based independent unit (IU) band alignment
IU band alignment, also known as Anderson's rule 39 , predicts semiconductor band offsets at interfaces using only the IP and EA data from independent surface calculations.For a semiconductor heterojunction between A and B, the conduction band offset is given by: Similarly, the valence band offset is given by: In Fig. 6a, we show the DFT-based IU band alignments for a set of well-known semiconductor surfaces.We also include dotted lines for the energy levels of H 2 and H 2 O, which are relevant for photo-catalyst applications.We compare the DFT based IU band offsets for 8 interfaces in Table 2 against experiments.We find a mean absolute error of 0.45 eV which is similar to a value of 0.32 eV as found in ref. 16 for different systems.

ALIGNN-based IU alignment from bulk data
We seek to accelerate the prediction of band edges using ML models, but the absolute prediction of band edges relative to vacuum requires DFT calculations with surfaces, which are too computationally expensive to create a robust dataset.JARVIS-DFT contains a much larger dataset of three dimensional materials with CBM and VBM values.Here, the CBMs and VBMs are simply the band edges written out by VASP for the bulk materials dataset using OptB88vdW.These band edges use the VASP convention that the average electrostatic potential of a unit cell is set to zero, and are not directly comparable to experimental values.A surface calculation with explicit vacuum is necessary to align the VBM/CBM to vacuum.We first train an ML model using ALIGNN based on these bulk quantities, but we will then show that this model is somewhat surprisingly also useful for predictions of band edges relative to vacuum.
To train the ALIGNN model, we split each bulk VBM/CBM dataset into 90:5:5 train:validation:test parts.We train on 90 % train data and evaluate the validation and test data using ALIGNN.We find a mean absolute error (MAE) of 0.28 ev for CBM and VBM.We note that the CBM/VBM data can vary from -10 to 10 eV (as shown in Fig. 5a,b) suggesting that the model should be reasonable for predictions.The mean absolute deviation (MAD) for the CBM and VBM are 2.08 eV and 2.67 eV, respectively, so the MAD:MAE is nearly 10 relative to a trivial baseline model.Out of several other material properties trained using ALIGNN 52 , CBM/VBM models has one of the highest MAD:MAE ratios, especially compared to other electronic properties like the band gap.We show the CBM and VBM prediction models in Fig. 5a and Fig. 5b respectively.We believe with more data using the active learning loop we can further increase the MAD:MAE in future 50 .
Next, we evaluate the bulk-trained ALIGNN models on the DFT surface dataset and show results in Fig. 5c and Fig. 5c for CBM and VBM respectively.We note that the training data does not include any surfaces.Nevertheless, most of the data points are on x = y line suggesting excellent agreement.We find MAE values of 0.55 eV and 0.96 eV for VBM and CBM respectively.This level of agreement is surprising because the value of the averaged electrostatic potential will change as the ratio of vacuum thickness to slab thickness changes.We also note that the error in CBM is higher than that of VBM, perhaps an explanation for why predicting band gaps of materials using machine learning is ever-standing difficult problem.
Up to this point, our model can only predict quantities relative to a cell-averaged electrostatic potential, which cannot be directly compared to experiment.However, we observe that our ALIGNN model predictions of the bulk VBM/CBM are in fact strongly correlated with the VBM/CBM values calculated with respect to vacuum using DFT calculations of surface slabs.We can get useful predictions of the vacuum-aligned VBM by subtracting a heuristic constant value of 10 eV from the ALIGNN predictions, and adding the bulk TBmBJ band gap to get the corresponding CBM.We visualize this IU based band alignment using ALIGNN model in Fig. 6b.We observe that the overall trends of DFT and ALIGNN closely resemble each other.For these surfaces, we calculate the classification accuracy of the heterostructures in type-I, type-II and type-III heterostructures.We find precision scores of 66.7 %, 66.1 % and 58.2 % respectively.Precision is defined as the fraction of relevant instances among all of the retrieved instances.The classification precision scores are based on DFT optimized surfaces, which are not available for all the materials in the database.So, we generate structures directly from the bulk counterparts, relax them using ALIGNN-FF and then predict the electron affinity and ionization potentials using the procedure mentioned above.In this way, we find precision scores of 55.0 %, 63.4 % and 60.0 % respectively suggesting that structure optimization of surfaces has an impact on the ALIGNN predictions.The random baseline is 1/3 = 33 %, which is more than 2 times lower than what we achieve.
As an example of the type of analysis that can be done with this data, we analyze all hetrostructures where the film is silicon.As shown in Fig. 7, we identify which elements in the second semiconductor make it most likely that the hetrostructrue will have a Type-1/straddling band alignment appropriate for diode applications.We find these elements to be Al, P, S, N, O, Li which is consistent with known silicon devices.Many other analyses are possible, we provide this data in hopes that it will be useful to the community.
Sufficiently high precision scores suggest that such models can be used for pre-screening applications followed by density functional theory calculations and experiments.Also, as the DFT bulk, surface and interface dataset is growing continuously, there is plenty of scope to improve the model performance in the future.For the 1.4 trillion semiconductor interfaces, we find 294 billion as type-I, 322 billion as type-II and rest as type-III heterostructures using the ALIGNN+IU model.The results suggest that finding type-I interfaces for transistor applications is more challenging than other heterostructure applications.Having such a large number of options and further screened for desirable properties such as effective masses, dielectric, piezoelectric, thermoelectric properties etc. can be helpful for technological applications.We emphasize the point that AI models should be considered as a pre-screening step only and would require thorough DFT and/or experimental validation.
Given the lack of surface-specific training data, the level of agreement with the unseen surface data is surprising, however, there are discrepancies in some cases.In other words, the model has never seen bonding environments that occur on slab surfaces, and extrapolating to such environments is challenging.The goal here is to obtain a fast model that can be used for quick screening of surfaces, with subsequent DFT calculations for confirmation.It may be possible to finetune this ALIGNN model with a surface dataset to further improve the accuracy of the model, but we leave that for future work.Nevertheless, the close resemblance in alignment predictions is promising and suggests that our models can be useful.Similar successful extrapolations for bulk-trained models were observed in Ref. 104 , which demonstrates that vacancy energies can be predicted from a ML model fit to bulk crystal data only.We clarify that we are not using a ALIGNN model to predict either the OptB88vdW or TBmBJ band gaps in this work.We are simply looking up the bulk band gaps in the JARVIS-DFT database, so this does not contribute to the error.However, we do have models for these quantities with MAEs of 0.14 eV and 0.31 eV respectively (see Ref 52,103 ).For materials not in the JARVIS-DFT database, we could use these models to predict the gaps.
In summary, we have provided a computational framework and dataset for investigating interface systems using multi-fidelity computational approaches.We have developed one of the largest datasets, containing 607 surfaces, 183921 IU-band offsets, and 593 ASJ interface band offset using DFT.Using universal graph neural network models, we have quickly screened potential semiconductor device candidates as transistors from a pool of 1.4 trillion possible interfaces, which would not have been possible using conventional computational or experimental techniques.Although we have applied this framework for semiconductors, it can be useful for other technological applications as well.After pre-screening, we have shown and benchmarked this streamlined workflow for band offset predictions using the independent unit and alternate slab junction models.This work paves the way for the application of materials design approach to interface systems.All of the tools and datasets developed in this work will be distributed publicly in the spirit of open-science.

Methods
Graph neural networks are trained using Atomistic Line Graph Neural Network (ALIGNN) framework 52 which uses PyTorch and deep graph library (DGL).Such GNN models can be used for graph level prediction (such as total energy of the system, bandgap etc.) or node level predictions (such forces, charges, atomic magnetic moments etc.)In ALIGNN, a crystal structure is represented as a graph using atomic elements as nodes and atomic bonds as edges.Each node in the atomistic graph is assigned 9 input node features based on its atomic species: electronegativity, group number, covalent radius, valence electrons, first ionization energy, electron affinity, block and atomic volume.The inter-atomic bond distances are used as edge features with radial basis function up to 8 Å cut-off and a 12nearest-neighbor (N).This atomistic graph is then used for constructing the corresponding line graph using interatomic bond-distances as nodes and bond-angles as edge features.ALIGNN uses edge-gated graph convolution for updating nodes as well as edge features using a propagation function ( f ) for layer (l), atom features (h), and node (i), details of which can be found in Ref. 52,53 : ALIGNN is trained for 500 epochs and with default parameters in the package.We use a 90:5:5 training:validation:testing randomly distributed data split for the CBM and VBM of the bulk materials dataset.The data splits and corresponding identifiers used during the training are made available in the figshare repository.While ALIGNN was used as surrogate/property prediction model at a graph level, in the later version, we also included atomwise/nodewise property predictions such as forces.These forces are directly derived from the energies hence leading to force-field development (ALIGNN-FF).ALIGNN-FF was trained on the JARVIS-DFT dataset.Note that, we did not need to modify the ALIGNN model for surfaces because these GNN are based on the local environment around each atom only.We have used the same ALIGNN model in molecules 52 and metal-organic frameworks 105 also without changing any architecture and still leading to accurate results.Next, JARVIS-DFT is a collection of 80000 diverse materials primarily using OptB88vdW in VASP software following strict protocols for convergence etc.In addition to the datasets, JARVIS-DFT is seamlessly integrated with the JARVIS-tools package for setting up calculations and performing analysis using a variety of multi-fidelity and multi-scale simulation approaches.ALIGNN-FF was trained on 307811 bulk structures with 1 million forces obtained from SCF relaxation step for materials in the JARVIS-DFT.ALIGNN-FF was shown to capture both structural and chemical diversity with reasonable accuracy especially for structure optimization.DFT calculations were carried out using the Vienna Ab-initio simulation package (VASP) software 106,107 with OptB88vdW 44 , TBmBJ 45 and R2SCAN 46 functionals using the workflow given on our 'jarvis-tools' GitHub page (https://github.com/usnistgov/jarvis).We use the OptB88vdW functional, which gives accurate lattice parameters for both vdW and non-vdW (3D-bulk) solids.The crystal structure was optimized until the forces on the ions were less than 0.01 eV/Å and energy less than 10 −6 eV.Also, we calculate the local potential containing ionic plus Hartree contributions to determine the vacuum potential (VAC) of surface slabs.The VAC is subtracted from the valence band maxima (VBM) and conduction band minima (CBM) to enable the comparison of band-diagrams of individual slabs in band-alignment diagrams.The converged k-points and cut-off for the bulk materials were also used for the corresponding surface slab models.The ASJ based interface structures were generated using Zur algorithm 81 available in the JARVIS-Tools.For a quick scan of xy-displacements for surfaces in the interfaces, ALIGNN-FF was used.

Fig. 2
Fig. 2 Structure generation criteria selection and initial xy-plane scan with ALIGNN-FF for Si(110)/GaAs(110) interface.a) atomic structure of silicon (Si), b) atomic structure of gallium arsenide (GaAs), c) surfaces (110) generated from the bulk structures of Si (left) and GaAs (right), d) candidate interface parameters from Zur algorithm: mismatch in x-direction (u), mismatch in y-direction (v) and maximum allowed area to generate suitable structures.e) ALIGNN-FF and f) DFT energy as a function of displacement in xy-plane of interface.

Fig. 3
Fig. 3 Atomic structures and band-alignment using the average electrostatic potential of semiconductor interfaces.a) atomic structure view of Si/GaAs(110), b) atomic structure view of polar interface AlN/GaN(001), c) electrostatic potential profile for non-polar interface Si/GaAs(110), d) average electrostatic potential profile for polar interface of AlN/GaN (001).The cyan lines are used to find the repeat unit layers for the left and right parts.The red and green lines show the potential averaged over the repeat distances of the left and right slabs, respectively.The dotted vertical blue line marks the interface.
respectively.As an example, we show a detailed analysis of J o u r n a l N a me , [ y e a r ] , [ v o l .] , 1-14 | 5 z-direction

Fig. 4
Fig.4Atom projected density of states for the system, separated along the z direction, normal to the interface.Red, green and blue colors represent gallium, arsenic and silicon atom contributions respectively.
J o u r n a l N a me , [ y e a r ] , [ v o l .] , 1-14 | 6

Fig. 5
Fig. 5 ALIGNN based regression models for a) VBM for JARVIS-DFT 3D/bulk materials test set data, b) CBM for bulk materials test set data, c) slab surface VBMs not part of the training set to evaluate extrapolation strength, d) slab surface CBMs not part of the training set to evaluate extrapolation strength.

Fig. 6 AFig. 7
Fig. 6 A few examples of band alignment based on the ionization potential (IP) and electron affinity (EA) for Anderson's/independent unit (IU) model.a) Density functional theory based alignments can be used to obtain band offsets.b) Fast ALIGNN based band alignment predictions.The numbers in the green bars represent electron affinity while that in the blue bars represent ionization potentials.The trained models based band alignment will be available at JARVIS-Heterostructure website soon.
J o u r n a l N a me , [ y e a r ] , [ v o l .] , 1-14 | 11 function (φ ) values against

Table . 2
we compare some of the ASJ based valence band offsets (∆E v ) with experimental measurements.We find a mean absolute error of 0.22 eV and 0.23 eV for OptB88vdW and R2SCAN respectively, which is comparable a value of 0.16 eV from to Liberto et.al.

Table 2
Valence band offsets (in eV) of a few independent unit (IU)/Anderson's model and alternating slab-junction (ASJ) based semiconductor/semiconductor interfaces with OptB88vdW (OPT) and R2SCAN functionals in comparison to previously reported experiments.Here ID, Miller and P represent a JARVIS-DFT identifier, Miller index and polar surface interfaces respectively.