Formalizing Chemical Physics using the Lean Theorem Prover

Chemical theory can be made more rigorous using the Lean theorem prover, an interactive theorem prover for complex mathematics. We formalize the Langmuir and BET theories of adsorption, making each scientific premise clear and every step of the derivations explicit. Lean's math library, mathlib, provides formally verified theorems for infinite geometries series, which are central to BET theory. While writing these proofs, Lean prompts us to include mathematical constraints that were not originally reported. We also illustrate how Lean flexibly enables the reuse of proofs that build on more complex theories through the use of functions, definitions, and structures. Finally, we construct scientific frameworks for interoperable proofs, by creating structures for classical thermodynamics and kinematics, using them to formalize gas law relationships like Boyle's Law and equations of motion underlying Newtonian mechanics, respectively. This approach can be extended to other fields, enabling the formalization of rich and complex theories in science and engineering.


Introduction
Theoretical derivations in the scientic literature are typically written in a semi-formal fashion, and rely on human peer reviewers to catch mistakes.When these theories are implemented in soware, the translation from mathematical model to executable code also requires humans to catch errors.This reects the gap between mathematical equations describing models in science and the soware written to encode these. 1 This occurs because the computer doesn't understand relationships among the scientic concepts and mathematical objects under study, it simply executes the code given it.Here, we recommend an alternative: interactive theorem provers that enable the mathematics and programming of science to be expressed in a rigorous way, with the logic checked by the computer.

Theorem provers for chemical theory
4][5][6][7][8] Formal proofs are used extensively in mathematics to prove various theories.On the other hand, scientic theory tends to use informal proofs when deriving its theories, since they are easier to write and understand (see Table 1).
Scientists are generally familiar with computer algebra systems (CAS) that can symbolically manipulate mathematical expressions (see Table 3).These systems include SymPy 9 and Mathematica. 12These systems are used frequently for scientic applications but come at the cost of being unsound, meaning they can have false conclusions.
Theorem provers are more rigorous than computer algebra systems, because they require computer-checked proofs before permitting operations, thereby preventing false statements from being proven.For example, a × b = b × a is true when a and b are scalars, but A × B s B × A when A and B are matrices.CAS impose special conditions to disallow A × B = B × A, 9 whereas theorem provers only allow changes that are proven to be valid.Theorem provers construct all of their math from a small, base kernel of mathematical axioms, requiring computer-checked proofs for objects constructed from the axioms.Even the most complicated math can be reduced back to that kernel.Since this kernel is small, verifying it by human experts or with other tools is manageable.Then, all higher-level math built and proved from the kernel is just as trustworthy.This contrasts with how CAS represent and introduce mathematics; because proofs are not required when high-level math is introduced, mistakes could enter at any level, and would require humans to catch and debug them 10 (see Table 2).
5][16][17][18][19] Before computers, this "axiomatization" of mathematics was developed by hand, in works like Principia Mathematica by Alfred North Whitehead and Bertrand Russell 20 the aim is to write down a minimal list of fundamental assumptions (axioms), and then systematically derive all of mathematics from those axioms.Computers play a key role in modern formalization efforts because they can store and verify massive libraries of interconnected theorems collaboratively written by hundreds of mathematicians. 21n analogous program to "axiomatize" physics was famously articulated as Hilbert's sixth problem. 22Recent reviews have discussed progress and unsolved questions on this "endless road" to describe how all of physics can be derived from a minimal set of axioms. 23,24Our vision is somewhat distinct from thiswe are inspired by Paleo's ideas for formalizing physics theories 25 as a collection of proofs, instead of aiming to represent science as a single edice emerging from one set of axioms (though this structure may emerge in the future).In particular, we ask "How can we formally represent a collection of proofs/derivations using an interactive theorem prover?"Theorem provers have previously been used to formalize derivations in physics: theorems from Newton's Principia, 26,27 versions of relativity theory, 28,29 electromagnetic optics, 30 and geometrical optics 31 have been described and proved using proof assistants.Articial intelligence tools for scientic discovery have also used theorem provers in designing optical quantum experiments, 32 as well as for rediscovering and deriving scientic equations from data and background theory. 33ere we focus on formalizing fundamental theories in the chemical sciences.Progress toward axiomatizing thermodynamics began with Carathéodory in 1909, 34 with recent developments by Lieb and Yngvason. 35But broadly, these questions have not been addressed using theorem provers to check the mathematics, which have seen limited use in the chemical sciences.One notable application by Bohrer 36 uses a proof assistant that reasons about differential equations and control algorithms 37 to describe and prove properties of chemical reactors.

The Lean theorem prover
We have selected the Lean theorem prover 38 for its power as an interactive theorem prover, the coverage of its mathematics library, mathlib, 39 and the supportive online community of Lean enthusiasts 40 with an aim to formalize the entire undergraduate math curriculum. 21,41Interesting projects in modern mathematics have emerged from its foundations, including Perfectoid Spaces, 42 Cap Set Problem 43 and Liquid Tensor 44 have garnered attention in the media. 45A web-based game, the Natural Number Game, 46 has been widely successful in introducing newcomers to Lean.As executable code, Lean proofs can be read by language modeling algorithms that nd patterns in math proof databases, enabling automated proofs of formal proof statements, including International Math Olympiad problems. 47,48e anticipate that Lean is expressive enough to formalize diverse and complex theories across quantum mechanics, uid mechanics, reaction rate theory, statistical thermodynamics, and more.Lean gets its power from its ability to dene mathematical objects and prove their properties, rather than just assuming premises for the sake of individual proofs.Lean is based on Type theory 49,50 where both mathematical objects and the relation between them are modeled with types (see Fig. S1 in the ESI †).Everything in Lean is a term of a Type, and Lean checks to make sure that the Types match.Natural numbers, real numbers, functions, Booleans, and even proofs are types; examples of terms with these types include the number 1, Euler's number, f(x) = x 2 , TRUE, and the proof of BET theory, respectively.Lean is also    ,11 vs. computer algebra systems 9,12,13 expressive enough to allow us to dene new types, just like mathematicians do, 38 which allows us to dene specic scientic theories and prove statements about them.
In this paper, we show how formalizing chemical theories may look, by demonstrating the tools of Lean through illustrative proofs in the chemical sciences.First, we introduce variables, types, premises, conjectures, and proof steps through a simple derivation of the Langmuir adsorption model.Next, we show how functions and denitions can be used to prove properties of mathematical objects by revising the Langmuir adsorption model through denitions and showing it has zero loading at zero pressure.Finally, we turn to more advanced topics, such as using geometric series to formalize the derivation of the BET equation and using structures to dene and prove relationships in thermodynamics and motion.

Methods
Lean has a small kernel, based on dependent type theory, 49,50 with just over 6000 lines of code that allows it to instantialize a version of the Calculus of Inductive Constructions (CoIC). 51,52The strong normalizing characteristic of the CoIC 53 creates a robust programming language that is consistent.The CoIC creates a constructive foundation for mathematics allowing the entire eld of mathematics to be built off of just 6000 lines of code.
In Section 3 we outline the proofs formalized using Lean version 3.51.1.We host proofs on a website that provides a semiinteractive platform connecting to the Lean codes in our GitHub repository (https://atomslab.github.io/LeanChemicalTheories/).An extended methods section introducing Lean is in the ESI † Section 5.1.

Langmuir adsorption: introducing Lean syntax and proofs
We begin with an easy proof to introduce Lean and the concept of formalization.The Langmuir adsorption model describes the loading of adsorbates onto a surface under isothermal conditions. 54Several derivations have been developed; [54][55][56][57] here we consider the original kinetic derivation. 54First, we present a derivation of the Langmuir model given by the eqn (6), as LaTeX equations, then transfer this into Lean and rigorously prove it.We also discuss how these proofs can be improved to be more robust.
The Langmuir model assumes that all sites are thermodynamically equivalent, the system is at equilibrium, and that adsorption and desorption rates are rst order.The adsorption and desorption rates are given by eqn (1) and eqn (2), respectively.
The symbols r ad , k ad , p A , and [S] represent the rate of adsorption, the adsorption rate constant, the pressure of the adsorbate gas, and the concentration of available sites on the surface, respectively.
In the desorption equation, r d stands for the rate of desorption, k d signies the desorption rate constant, and [A ad ] represents the concentration of adsorbed molecules.Aer assuming equlilbrium, eqn (2), r ad = r d , and with some rearrangement, we get eqn (3).
Using the site balance [S 0 ] = [S] + [A ad ], where [S 0 ] represents the total concentration of available sites, we arrive at eqn (4).
We can rearrange eqn (4) into, eqn (5).§ Using the denition of the fraction of adsorption, q ¼ ½A ad ½S 0 , and the denition of the equilibrium constant, arrive at the familiar Langmuir absorption equation, eqn (6).
This informal proof is done in natural language, and it doesn't explicitly make clear which equations are premises to the proof and which are intermediate steps.While the key steps from the premises to the conclusion are shown, the ne details of the algebra are excluded.In contrast, Lean requires premises and conjecture to be precisely dened and requires that each rearrangement and cancellation is shown or performed computationally using a tactic.The next part shows how this proof is translated into Lean.
As shown in Fig. 1, every premise must be explicitly stated in Lean, along with the nal conjecture and proof tactics used to show that the conjecture follows from the premises.Lean is an interactive theorem prover, meaning that the user is primarily responsible for setting up the theorem and writing the proof steps, while Lean continuously checks the work and provides feedback to the user.The central premises of the proof are expressions of adsorption rate (hrad), desorption rate (hrd), the equilibrium relation (hreaction), and the adsorption site balance (hS 0 ).Additional premises include the denition of adsorption constant (hK) and surface coverage (hq) from the rst four premises, as well as mathematical constraints (hc1, hc2, and § The manuscript we rst submitted for peer review included a typo in eqn (5), with [S 0 ] appearing as [S].Neither the authors nor the peer reviewers detected this; it was identied by a community member who accessed the paper on arXiv.Of course, Lean catches such typos immediately.hc3) that appear during the formalization.The model assumes the system is in equilibrium, so the adsorption rate, r ad = k_ad × P × S and desorption rate, r d = k_d × A are equal to each other, where k ad and k d are the adsorption and desorption rate constant respectively, S is concentration of empty sites, and A is the concentration of sites occupied by A. Aer begin, a sequence of tactics rearranges the goal state until the conjecture is proved.Note when performing division, Lean is particular to require that the denominator terms are nonzero.
An interesting part of the proof is that only certain variables or their combinations are required to be not zero.When building this proof, Lean imports the real numbers and the formalized theorems and tactics for them in mathlib.Lean does not permit division by zero, and it will ag issues when a number is divided by another number that could be zero.Consequently, we must include additional hypotheses hc1-hc3 in order to complete the proof.These provide the minimum mathematical requirements for the proof; more strict constraints requiring rate constants and concentrations to be positive would also suffice.These ambiguities are better addressed by using denitions and structures, which enable us to prove properties about the object.Nonetheless, this version of the Langmuir proof is still a machine-readable, executable, formalized proof.
Though this is a natural way to write the proof, we can condense the premises by using local denitions.For instance, the rst two premises hrad and hrd can be written into hreaction to yield k_ad × P × S = k_d × A and we can also write expressions of hq and hK in the goal statement.While hrad, hrd, hq, and hK each have scientic signicance, in this proof, they are just combinations of real numbers.Alternative versions of this proof are described in ESI Section 5.2.1.†

Langmuir revisited: introducing functions and denitions in Lean
Functions in Lean are similar to functions in imperative programming languages like Python and C, in that they take in arguments and map them to outputs.However, functions in Lean (like everything in Lean) are also objects with properties that can be formally proved.
Formally, a function is dened as a mapping of one set (the domain) to another set (the co-domain).The notation for a function is given by the arrow "/".For instance, the function, conventionally written as Y = f(X) or Y(X), maps from set X to set Y is written as X / Y in arrow notion.{Importantly, the arrow "/" is also used to represent the conditional statement (if-then) in logic, but this is not Window," while the right side shows variables and goals at each step in the "Tactic State".When the user places the cursor at one of the numbered locations in the "Code Window," VSCode displays the "Tactic State" of the proof.Lean allows the use of Unicode symbols, so we use "S 0 " to represent the total concentration of adsorption sites without needing underscores.The turnstile symbol represents the state of the goal after each step.As each tactic is applied, hypotheses and/or goals are updated in the tactic state as the proof proceeds.For clarity, we only show the hypothesis that changes after a tactic is applied and how that changes the goal.As an example, the goal state is the same in steps 1 and 2 since the first tactic rewrites (rw) the equation of adsorption (hrad) and desorption (hrd) into the premise that equilibrium (hreaction) exists.Next, we rewrite (rw), simplify (field_simp), and otherwise rearrange the variables to exactly equal the goal state (steps 3-5).When the proof is finished, a celebratory message and party emoji appears (6).
{ These types are easily extended to functionals, which are central to density functional theory.A function that takes a function as an input can be dened by ðℝ/ℝÞ/ℝ.a duplication of syntax.Because everything is a term of Type in Lean, functions map type X to type Y; when each type is a proposition, the resulting function is an if-then statement.
As stated in the introduction, Lean's power comes from the ability to dene objects globally, not just postulate them for the purpose of local proof.When a mathematical object is formally dened in Lean, multiple theorems can be written about it with certainty that all proofs pertain to the same object.In Lean, we use def to dene new objects and then prove statements about these objects.The def command has three parts: the arguments it takes in (the properties of the object), the type of the output, and the proof that the object has such a type.In Lean: For instance, we can dene a function that doubles a natural number: The l symbol comes from lambda calculus and is how an explicit function is dened.Aer the lambda symbol is the variable of the function, n with type ℕ. Aer the comma is the actual function.By hand, we would write this as f(n) = n + n.This function doubles any natural number, as the name suggests.We could use it, for example, to show: In the previous section, we showed an easy-to-read derivation of Langmuir adsorption, and in ESI Section 5.2.1, † we improved the proof using local denitions.Here, we improve it further by dening the Langmuir model as an object in Lean and then showing the kinetic derivation of that object.This way, the object dening the single-site Langmuir model can be reused in subsequent proofs, and all are certain to refer to the same object.
We dene the model as a function that takes in pressure as a variable.Given a pressure value, the function will compute the fractional occupancy of the adsorption sites.In Lean, this looks like (https://atomslab.github.io/LeanChemicalTheories/adsorption/langmuir_kinetics.html#langmuir_single_site_model): The l symbol comes from l-calculus 58 and is one way to construct functions.It declares that P is a real number that can be specied.When the real number is specied, it will take the place of P in the equation.The denition also requires the equilibrium constant to be specied.kWith this, the kinetic derivation of Langmuir can be set up in Lean like this (https://atomslab.github.io/LeanChemical Theories/adsorption/langmuir_kinetics.html#langmuir_single_ site_kinetic_derivation): This derivation is almost exactly like the proof in ESI Section 5.2.1; the only difference is the use of the Langmuir model as an object.Aer the langmuir_single_site_model simplies to the Langmuir equation, the proof steps are the same.
Using the denition makes it possible to write multiple theorems about the same Langmuir object.We can also prove that the Langmuir expression has zero loading at zero pressure, and in the future we can show that it has a nite loading in the limit of innite pressure, and converges to Henry's Law in the limit of zero pressure (https://atomslab.github.io/LeanChemicalTheories/adsorption/langmuir_kinetics.html# langmuir_zero_loading_at_zero_pressure).Denitions and structures, as we will see in later sections, are crucial to building a web of interconnected scientic objects and theorems.

BET adsorption: formalizing a complex proof
Brunauer, Emmett, and Teller introduced the BET theory of multilayer adsorption (see Fig. 2) in 1938. 59We formalize this derivation, beginning with eqn (26) from the paper, which is shown here in eqn (7): Here A is the total area adsorbed by all (innite) layers expressed as a sum of innite series: and V is the total volume adsorbed is given by: The variables y, x and C are expressed in the original paper as shown through eqn ( 10)-( 12): x = PC L , where C L = e EL/RT /g (11) where a 1 , b 1 , and g are tted constants, E 1 is the heat of adsorption of the rst layer, E L is for the second (and higher) layers (also the same as heat of liquefaction of the adsorbate at constant temperature), R is the universal gas constant, and T is temperature.In eqn (10) and (11), everything besides the pressure term is constant, since we are dealing with an isotherm, so we group the constants together into one term.These constants, along with the surface area of the zeroth layer, given by s 0 , saturation pressure, and the three constraints are dened using the constant declaration in Lean.Mathematical objects can also be dened in other ways such as def, class or structure 38 but for this proof we will use constant which is convenient for such simple objects.We will illustrate later in our thermodynamics proof how constants can be merged into a Lean structure for reusability.
In Lean, this is (https://atomslab.github.io/LeanChemicalTheories/adsorption/BETInnite.html#C_L): With these constant declarations, we can now dene y, x, and C in Lean as (https://atomslab.github.io/LeanChemicalTheories/adsorption/BETInnite.html#BET_rst_layer_ adsoprtion_rate): Since y and x are both functions of pressure, their denitions require pressure as an input.Alternatively, the input can be omitted if we want to deal with x as a function rather than as a number.Notice that the symbols we declared using constant do not need to be supplied in the inputs as they already exist in the global workspace.
We formalize eqn (7) by recognizing that the main math behind the BET expression is an innite sequence that describes the surface area of adsorbed particles for each layer.The series is dened as a function that maps the natural numbers to the real numbers; the natural numbers represent the indexing.It is dened in two cases: if the index is zero, it outputs the surface area of the zeroth layer, and if the index is the n + 1, it outputs x n+1 s 0 C.
In Lean, we dene this sequence as (https:// atomslab.github.io/LeanChemicalTheories/adsorption/BETInnite.html#seq): Where s i is the surface area of the i th layer, C and x are given by eqn (12) and eqn (11), respectively, and s 0 is the surface area of the zeroth layer.The zeroth layer is the base surface and is constant.
We now have the area and volume equations both in terms of geometric series with well-dened solutions.The BET equation is dened as the ratio of volume absorbed to the volume of a complete unimolecular layer, given by eqn (14).
The main transformation in BET is simplifying this sequence into a simple fraction which involves solving the geometric series.The main math goal is given by eqn (15).
Before doing the full derivation, we prove eqn (15), which we call sequence_math.In Lean, this is (https:// atomslab.github.io/LeanChemicalTheories/adsorption/BETInnite.html#sequence_math):Here q is fraction of the surface adsorbed, V is total volume adsorbed, V 0 is the volume of a complete unimolecular layer adsorbed in unit area, s i is the surface area of the i th layer, s 0 is the surface area of the zeroth layer, and x and C are constants that relates heats of adsorption of the molecule in layers.
In Lean, the apostrophe aer the sum symbol denotes an innite sum, which is dened to start at zero since it is indexed by the natural numbers, which start at zero.Since the innite sum of eqn (15) starts at one, we add one to all the indexes, k, so that when k is zero, we get one, etc.We also dene two new theorems that derive the solution to these geometric series with an index starting at one.Aer expanding seq, we use those two theorems, and then rearrange the goal to get two sides that are equal.We also use the tag lemma instead of theorem, just to communicate that it is a lower-priority theorem, intended to prove other theorems.The tag lemma has no functional difference from theorem in Lean, it's purpose is for mathematicians to label proofs.
With this we can formalize the derivation of eqn (7).First we dene eqn (7) as a new object and then prove a theorem showing we can derive this object from the sequence.In Lean, the denition looks like this (https://atomslab.github.io/LeanChemicalTheories/adsorption/ BETInnite.html#brunauer_26): Here, we explicitly dene this as a function, because we want to deal with eqn (7) normally as a function of pressure, rather then just a number.Now we can prove a theorem that formalizes the derivation of this equation (https:// atomslab.github.io/LeanChemicalTheories/adsorption/BETInnite.html#brunauer_26_from_seq): Unlike the Langmuir proof introduced earlier in Fig. 1, the BET uses denitions that allow reusability of those denitions across the proof structure.The proof starts by showing that seq is summable.This means the sequence has some innite sum and the P ′ symbol is used to get the value of that innite series.
We show in the proof that both seq and k*seq is summable, where the rst is needed for the area sum and the second is needed for the volume sum.Aer that, we simplify our denitions, move the index of the sum from zero to one so we can simplify the sequence, and apply the BET.sequence_math lemma we proved above.Finally, we use the eld_simp tactic to rearrange and close the goal.With that, we formalized the derivation of eqn (7), just as Brunauer, et al. did in 1938.
In the ESI, † we continue formalizing BET theory by deriving eqn (28) from Brunauer et al.'s paper, given by eqn ( 16) This follows from recognizing that 1/C L = P 0 .While Brunauer, et al. attempt to show this in the paper, we discuss the trouble with implementing the logic they present.Instead, we show a similar proof that eqn (7) approaches innity as pressure approaches 1/C L , and assume as a premise in the derivation of eqn ( 16) that 1/C L h P 0 .

Classical thermodynamics and gas laws: introducing Lean structures
Lean is so expressive because it enables relationships between mathematical objects.We can use this functionality to precisely dene and relate scientic concepts with mathematical certainty.We illustrate this by formalizing proofs of gas laws in classical thermodynamics.
We can prove that the ideal gas law, PV = nRT follows Boyle's Law, P 1 V 1 = P 2 V 2 , following the style of our derivation of Langmuir's theory: demonstrating that a conjecture follows from the premises (https://atomslab.github.io/LeanChemicalTheories/thermodynamics/boyles_law.html).However, this proof style doesn't facilitate interoperability among proofs and limits the mathematics that can be expressed.z in contrast, we can prove the same, more systematically, by rst formalizing the concepts of thermodynamic systems and states, extending that system to a specic ideal gas system, dening Boyle's Law in light of these thermodynamic states, and then proving that the ideal gas obeys Boyle's Law (see Fig. 3).
Classical thermodynamics describes the macroscopic properties of thermodynamic states and relationships between them. 60,61We formalize the concept of "thermodynamic system" by dening a Lean structure called thermo_system over the real numbers, with thermodynamic properties (e.g., pressure, volume, etc.) dened as functions from a type to the real numbers a/ℝ.Here, a is meant to represent a general indexing type.It could be the natural numbers if we wanted to use those to represent states of the system, real numbers to represent time, or anything else.The only requirement is that a is nontrivial, meaning it has at least two different elements.In Lean, this is (https://atomslab.github.io/LeanChemicalTheories/thermodynamics/basic.html#thermo_system): We dene six descriptions of the system: isobaric (constant pressure); isochoric (constant volume); isothermal (constant temperature); adiabatic (constant energy); closed (constant mass); and isolated (constant mass and energy).Each of these conditions has the type Prop, or proposition, considering them to be assertions about the system.We formally dene these by stating that, for all (c) pairs of states n and m, the property at those states is equal.We dene these six descriptions to take in a thermo_system since we need to specify what system we are ascribing this property to.In Lean, this is (https://atomslab.github.io/LeanChemicalTheories/thermodynamics/basic.html#isobaric): We dene an isolated system as just a closed system and (^) adiabatic, rather than using the universal quantier (c), since it would be redundant.Now that the basics of a thermodynamic system have been dened, we can dene models that attempt to describe the system mathematically.These models can be dened as another structure, which extends the thermo_system structure.When a structure extends another structure, it inherits the properties of the structure it extended.This allows us to create a hierarchy of structures so we don't have to redene properties repeatedly.The most well-known model is the ideal gas model, which comes with the ideal gas law equation of state.We dene the ideal gas model to have two properties, the universal gas constant, R, and the ideal gas law.In the future, we plan to add more properties to the denition, especially as we expand on the idea of energy.We dene the ideal gas law as an equation relating the products of pressure and volume to the product of temperature, amount of substance, and the gas constant.In Lean, this is (https:// atomslab.github.io/LeanChemicalTheories/thermodynamics/basic.html#ideal_gas): To dene a system modeled as an ideal gas, we write in Lean: (M: ideal_gas ℝ).Now we have a system, M, modeled as an ideal gas.
Boyle's law states that the pressure of an ideal gas is inversely proportional to the system's volume in an isothermal and closed system. 62This is mathematically given by eqn (17), where P is pressure, V is volume, and k is a constant whose value is dependent on the system.Fig. 3 Thermodynamic system in Lean.Here the thermo_system and ideal_gas are Lean structures that describe different kinds of thermodynamic systems like isobaric, isochoric, isothermal etc. using Lean definitions to proof theorems relating to the gas laws.
In Lean, we dene Boyle's law as (https:// atomslab.github.io/LeanChemicalTheories/thermodynamics/basic.html#boyles_law): We use the existential operator (d) on k, which can be read as there exists a k, because each system has a specic constant.We also dene the existential before the universal, so it is logically correct.Right now, it reads, there exists a k, such that for all states, this relation holds.If we write it the other way, it would say for all states, there exists a k, such that this relation holds.The second way means that k is dependent on the state of the system, which isn't true.The constant is the same for any state of a system.Also, even though Boyle's law is a statement about an ideal gas, we dene it as a general system so, in the future, we can look at what assumptions are needed for other models to obey Boyle's law.
Next, we prove a couple of theorems relating to the relations that can be derived from Boyle's law.From eqn (17), we can derive a relation between any two states, given by eqn (18), where n and m are two states of the system.
The rst theorem we prove shows how eqn (18) follows from eqn (17).In Lean this looks like (https://atomslab.github.io/LeanChemicalTheories/thermodynamics/ basic.html#boyles_law_relation): The right arrow can be read as implies, so the statement says that Boyle's law implies Boyle's relation.This is achieved using modus ponens, introducing two new names for the universal quantier, then rewriting Boyle's law into the goal by specializing Boyle's law with n and m.We also want to show that the inverse relation holds, such that eqn (18) implies eqn (17).In Lean, this is (https://atomslab.github.io/LeanChemicalTheories/thermodynamics/basic.html#boyles_law_relation'): We begin in the same way by using modus ponens and simplifying Boyle's law in the form of eqn (17).Next, we satisfy the existential by providing an old name.In our proof, we use P 1 V 1 as an old name for k, then we specialize the relation with n and 1 and close the goal.
Finally, with these two theorems, we show that Boyle's law can be derived from the ideal gas law under the assumption of an isothermal and closed system.In Lean, this is (https:// atomslab.github.io/LeanChemicalTheories/thermodynamics/basic.html#boyles_from_ideal_gas): This proof is completed by using the second theorem for Boyle's relation and simplifying the ideal gas relation using the two iso constraints.
We have implemented this framework to prove both Charles' and Avogadro's law (https://atomslab.github.io/LeanChemicalTheories/thermodynamics/basic.html) illustrating the interoperability of these proofs.In the future, we plan to dene energy and prove theorems relating to it, including the laws of thermodynamics. 63

Kinematic equations: calculus in Lean
Calculus and differential equations are ubiquitous in chemical theory, and much has been formalized in mathlib.To illustrate Lean's calculus capabilities and motivate future formalization efforts, we formally prove that the kinematic equations follow from calculus-based denitions of motion, assuming constant acceleration.The analysis of physical equations of motion, particularly those based on Newtonian mechanics, is strongly related to the formulation of many theories in chemical physics, including reaction kinetics, 64 diffusion and transport phenomena 65 and molecular dynamics. 66These concepts are essential for understanding chemical reactions and how molecules move and interact.
The equations of motion are a set of two coupled differential equations that relate the position, velocity, and acceleration of an object in an n-dimensional vector space. 67The differential equations are given by eqn (19) and (20), where x, v, and a represent position, velocity, and acceleration, respectively (bold type face signies a vector quantity).All three variables are parametric equations, where each dimension of the vector is a function of time.** As in the thermodynamics section, we can dene a structure, motion, to encompass these concepts (Fig. 4).This structure denes three new elements: position, velocity, and acceleration, which are functions, and two differential equations relating these three functions.This structure also requires the vector space to form an inner product space, which is a real or complex vector space with an operator (the inner product) over the eld.The inner product is a generalization of the dot product for any vector space.
** These proofs could also be constructed using partial differential equations, but mathlib doesn't currently have enough theorems for partial derivatives.
By requiring inner_product_space, the motion structure inherits all of inner_product_space's properties and allows us to access the calculus theorems in mathlib.In Lean, this is (https:// atomslab.github.io/LeanChemicalTheories/physics/kinematic_equations.html#motion): K represents a eld that we require to be either a real ðℝÞ or complex ðℂÞ number, and E symbolizes a general vector eld.In mathematics, a eld is an algebraic structure with addition, subtraction, multiplication, and division operations.Our vector space could be an n-dimensional Euclidean vector space, but we instead use a general vector eld to be as general as possible.This allows us to describe motion in a Euclidean vector space, as well as a hyperbolic vector space or a vector space with special properties.
In Lean, if a function is not differentiable at a point, the derivative at that point returns zero.† † During our rst formalization attempt, we tried to dene a function to be constant by setting its derivative to zero.However, df/dx = 0 may also arise if a function is not differentiable at that point.To avoid this edge case, we dene another structure to require the equations of motion to be n-times continuously differentiable everywhere.We only require the equations to be n-times differentiable instead of innitely differentiable for generality reasons, however, a theorem can instantiate this structure and assume innite differentiability.We also declare this as a separate structure, instead of in the motion structure, to allow future proofs that require the equations to be n-times continuously differentiable on a set or an interval rather than everywhere (e.g., a molecular mechanics force eld with a non-smoothed cutoff is not differentiable at that point).That way, depending on the theorem, the user can choose the appropriate extension.In Lean, this structure looks like (https://atomslab.github.io/LeanChemicalTheories/physics/kinematic_equations.html#motion_cont_diff_everywhere): The eld contdiff states that for all n, dened as a natural number including positive innity, and for all m, dened as a natural number, if m is less than n, then the m th derivative of position is continuously differentiable n-times.
When acceleration is constant, this set of differential equations has four useful analytical solutions, the kinematic equations, eqn ( 21)- (24), where the subscript naught denotes variables evaluated at t = 0.
Fig. 4 Kinematics in Lean.Here we define motion as Lean structure that represents the relation between position, velocity, and acceleration through differential equations that are proved using definitions of derivative functions.† † Likewise, division by zero is dened to return zero instead of something like "undened" or "NaN".In Lean and other theorem provers, the symbol/is not used for mathematical division but instead points to a function called real.div.This function returns x/y if y is not zero, and 0 if y equals 0. Another case is the real square root function (real.sqrt),which outputs a real number for any input, even negatives since it is dened as ℝ/ℝ.These conventions may be unfamiliar to scientists and engineers, but they are used for convenience and won't lead to contradictions in a proof.Any invalid step in a proof involving these conventions will be caught when invoking a theorem not true for its denition.We wrestled with this convention for some time before nding clarity in this blog post archived for ref. 68.
Under the assumption of one-dimensional motion, these equations simplify to the familiar introductory kinematic equations.Eqn (24), also known as the Torricelli equation, uses the shorthand square to represent the dot product, v 2 (t) h v(t) × v(t).
With this, we can now begin deriving the four kinematic equations.The rst three derivations for eqn ( 21)-( 23), all use the same premises, given below: The rst line contains four premises to declare the eld and vector space the motion space is dened on.The next line denes a motion space, M. The third line contains two premises, a variable, A, which represents the value of constant acceleration, and n, the number of times position can be differentiated.When applying these theorems, the top function, which means positive innity in Lean, can be used to specify n.The nal line is a premise that assumes acceleration is constant.The lambda function is constant because A is not a function of t, so for any value of t, the function outputs the same value, A. The three kinematic equations in Lean (https://atomslab.github.io/LeanChemicalTheories/physics/kinematic_equations.html#const_accel) are given below (note, the premises are omitted since they have already been given above).
The $ symbol indicates scalar multiplication, such as when a vector is multiplied by a scalar.We normally use the $ symbol for the dot product, but Lean uses the inner function for the dot product.Also, "velocity 0" means the velocity function evaluated at 0. Lean uses parentheses for orders of operations, not for function inputs, so f(x) in normal notation converts to f x in Lean.The proofs of the rst two theorems use the two differential equations from the motion structure and the antiderivative, whose formalization we explain in the ESI † (these theorems weren't available in mathlib at the time of writing, so we proved them ourselves).The third theorem is proved by rearranging the previous two theorems.
Because we declared the eld is_R_or_C, the above proofs hold for both real and complex time.However, we were unable to prove eqn (24), due to the complex conjugate that arises when simplifying the proof.Eqn (24) uses the inner product, a function that takes in two vectors from a vector space and outputs a scalar.If the vector space is a Euclidean vector space, this is just the dot product.The inner product is semi-linear, linear in its rst argument, eqn (25), but sesquilinear in its second argument, eqn (26).
hax + by,zi = ahx,zi + bhy,zi hx,ay The bar denotes the complex conjugate: for a complex number, g = a + bi, the complex conjugate is: For the proof of eqn (24), we get to a form where one of the inner products has an addition in the second term that we have to break up, and no matter which way we rewrite the proof line, one of the inner products ends up with addition in the second term.To proceed, we instead dened the nal kinematic equation to hold only for real time.In Lean, this looks like (https://atomslab.github.io/LeanChemicalTheories/physics/kinematic_equations.html# real_const_accel'''): While we haven't proved that eqn (24) doesn't hold for complex time, we encountered difficulties and contradictions when attempting to prove the complex case.Thus, eqn (24) currently only holds for real time.
An imaginary-time framework can be used to derive equations of motion from non-standard Lagrangians 69,70 to examine hidden properties in classical and quantum dynamical systems in the future.By exploring these proofs in both real and complex time, we illustrate how a proof in one case can be adapted for related cases.Here, four proofs for real numbers can be easily extended to complex numbers by changing the type declared up front, and the validity of the proofs in the more general context is immediately apparent.

Conclusions and outlook
In this paper, we demonstrate how interactive theorem proving can be used to formally verify the mathematics in science and engineering.We found that, although formalization is slower and more challenging than writing hand-written derivations, our resulting proofs are more rigorous and complete.We observed that in some cases, translating scientic statements into formal language revealed hidden assumptions behind the mathematical derivations.For example, we make explicit common implicit assumptions, such as the denominator must not be zero when we deal with division.Furthermore, we reveal, in a more abstract way, we have attempted to reveal the formal denitions of equations, such as exactly how pressure is dened as a function or the assumptions of differentiability needed for kinematics.All of these are a result of formalizing these theorems.We concur with others who have discussed the limitations of hand-written proofs and their reliability; 2,71,72 formalized proofs can provide greater assurance and robustness.
Importantly, we emphasize that while our proofs are veried to be mathematically correct, this verication does not extend to the external world.This distinction between syntax (logical relationships among words and arguments in a language) and semantics (whether words are meaningful or arguments are true, according to external reality) in scientic reasoning has been emphasized by logicians such as Alfred Tarski 73 and Rudolf Carnap. 74,75For scientists and engineers, whether a theory is true or meaningful is rst and foremost about whether observational data support itlogical correctness of the derivation is required, of course, but this is typically assumed.Indeed, when one of us described our BET proof to an experimentalist in adsorption, their reply was, "but BET isn't accurate."They knew that BET theory does not semantically match experiments in many contexts (in fact, much literature has discussed when BET analysis should not be applied, for instance 76 ).BET theory has been a useful conceptual model for the eld, but nonetheless relies on approximations that oen dri far from reality.In this work, we only claim to rigorously establish the syntax of the theories we describe.Nonetheless, Lean operating with input/ output functions can receive data from the external world, which may open possibilities for semantically grounding its logical conclusions in certain contexts, as well.
The Lean theorem prover is especially powerful, as it facilitates the re-use of theorems and the construction of higher-level mathematical objects from lower-level ones.We showed how this feature can be leveraged in science proofs; aer a fundamental theory is formally veried, it can then be used in the development of other theories.This can be approached in two ways: denitions can be directly reused in subsequent proofs, and structures can enable hierarchies of related concepts, from general to more specic.Thus we have not just proved a few theorems about scientic objects but have begun to create an interconnected structure of formally veried proofs relating elds of science.
While learning Lean and writing the proofs appearing here, we routinely asked ourselves, "How do I close this goal?I wish there was a way to automate this."In fact, the rst vision for computerassisted proofs in the 1950s and 60s was to automate the process fully; 77 interactive theorem provers that "merely" check humanwritten proofs didn't appear until later.But historically, automated theorem provers (ATPs) made progress on narrow classes of problems (e.g., problems in rst-order logic 78 ) but couldn't address proofs in advanced math (except when problems are described in such simple terms, like the Robbins Conjecture 79 ).In short, theorem proving is like searching for a path from premises to conjecture, but in a realm with an "innite action space 48 traditional algorithms have been inadequate.For complex proofs, interactive theorem provers (ITPs) have been more successful, because they facilitate human creativity in writing proofs, while leveraging the rigor of the computer for checking them and providing feedback to the user.Modern ITPs also use the computer for small-scale automation via tactics; the human provides strategy while the computer executes tactics.Complex tactics sometimes blur the line between automated and interactive theorem proving.For example, Isabelle (an ITP) has the Sledgehammer tactic, 80 which takes the current proof state and attempts to transform it into an equivalent problem in rstorder logic, which can then be efficiently solved using an ATP.
Recent approaches have leveraged machine learning to expand the capabilities of automated theorem proving.Theorem proving can be framed as a reinforcement learning problem, 81,82 in which an agent is to learn an effective theorem proving policy via rewards from successfully proving theorems."Autoformalization" refers to the translation of informal proofs into formal proofs, akin to translating text from one language to another (but with extremely strict requirements on the formal side). 83Theorem proving can also be framed as a next-word-prediction problem ("auto-complete" for math proofs) in which a database of formal math proofs is used to train a language model to predict the next word in the proof.Large language models (LLMs) like ChatGPT 84,85 have some emergent reasoning abilities 86 but oen make mistakes and cannot be trusted.By connecting language models with ITPs to provide feedback, training them on proof databases like mathlib, and deploying them as part of traditional search algorithms, progress has been made toward automating proofs in Lean, 47,87,88 even to the point of generating correct solutions to International Math Olympiad problems. 89his interplay between creative but unreliable generative algorithms and the strict logic of a proof-checking system may be a model for future AI-driven discovery in science, especially for discovering new theories.An early example of this is AI-Descartes, in which a symbolic regression algorithm generates equations to match experimental data, which is then combined with an automated theorem prover to establish the equations' "derivability" with respect to a scientic theory. 33However, in this work, each theory required human expertise to be expressed in formal language, and reliance on an automated theorem prover limited the scope of theories to those expressible in rst-order logic.AI tools that can autoformalize the informal scientic literature, generate novel theories, and auto-complete complex proofs could open new avenues for automating theory discovery.LLMs have demonstrated capabilities in solving chemistry problems, 90,91 as well as answering scientic question-and-answer problems invoking quantitative reasoning. 92However, LLMs are unreliable they famously "hallucinate" (generate falsehoods) and are biased or unreliable evaluators of their own outputs. 93,94Pairing them with external tools [95][96][97] improves their capabilities; theorem provers could play a role like that.How will these models be trained?We suggest two avenues: training on human-written databases of formal proofs in science and engineering (which are yet to be written) and leveraging interactive feedback from Lean through tools like LeanDojo. 88[100] Our next goals are to continue building out classical thermodynamics, formalize statistical mechanics, and eventually construct proofs relating the two elds.We are also interested in laying the foundations for classical mechanics in Lean and formalizing more difficult proofs like Noether's theorem 101 (a basis for deriving conservation laws) or establishing the 2nd law of thermodynamics axiomatically. 35he proofs in this paper were written in Lean 3, 72 because the extensive mathlib library was only available in Lean 3 when we began.While Lean 3 was designed for theorem proving and management of large-scale proof libraries, the new version, Lean 4, 102 is a functional programming language for writing proofs and programs, as well as proofs about programs. 102,103The Lean community nished porting mathlib to Lean 4 and Lean 3 is now deprecated; we recommend future proofs should be written in Lean 4, which is more capable, versatile, and easy to use compared to Lean 3.With Lean 4, we are bridging formally-correct proofs with executable functions for bug-free scientic computing; we will be elaborating on that in future work.A statement assumed to be true that the proof follows from.It can also be thought of as the conditions or prerequisites for the theorem to hold.We emphasize that the math community uses "hypothesis" somewhat differently than the scientic community If all four sides of a rectangle have the same length, it is a square.The hypotheses would be "the shape is a rectangle (has four sides and four equal, right angles)" and "all sides have the same length" Conjecture A statement which is proposed to be true, but no proof has been found yet Goldbach's conjecture: every even number greater than 2 is the sum of two prime numbers.This hasn't been proven true or false yet Proof A sequence of logical steps which conclude that a statement is true from its hypotheses The proof of the Pythagorean theorem using geometry Function An expression that denes a relation between a set of inputs and a set of outputs f(x) = x 2 relates (or maps) the set of real numbers x to their square Tactic A command used to construct or manipulate proofs.Tactics in Lean provide a way to automate certain proof steps or apply predened proof strategies to make the process of constructing formal proofs more efficient and convenient rw is "rewrite", a simple tactic that performs substitution for equalities.Ring is a more complex tactic for automatically closing goals requiring numerous algebraic operations, without the user specifying all the steps Type A type can be thought of as a set, or category, that contains terms.In other programming languages, types dene the category of data certain objects have (e.g.oats, strings, integers).Types in Lean work this way, too, and have more features: they can depend on values, as well as be the subject of proofs The natural numbers are a type.The booleans (true and false) are also a type.We hope these expository proofs in adsorption, thermodynamics, and kinematics will inspire others to consider what proofs and derivations could be formalized in their elds of expertise.Virtually all mathematical concepts can be established using dependent type theory; the density functionals, partial derivatives, N-dimensional integrals, and random variables appearing in our favorite theories should be expressible in Lean.Just as an ever-growing online community of mathematicians and computer scientists is building mathlib, 40 we anticipate a similar group of scientists building a library of formally-veried scientic theories and engineering mathematics.To join, start learning Lean, join the online community, and see what we can prove!

Fig. 1 A
Fig. 1 A formalization of Langmuir's adsorption model, shown as screenshots from Lean operating in VSCode.The left side of the figure shows the "CodeWindow," while the right side shows variables and goals at each step in the "Tactic State".When the user places the cursor at one of the numbered locations in the "Code Window," VSCode displays the "Tactic State" of the proof.Lean allows the use of Unicode symbols, so we use "S 0 " to represent the total concentration of adsorption sites without needing underscores.The turnstile symbol represents the state of the goal after each step.As each tactic is applied, hypotheses and/or goals are updated in the tactic state as the proof proceeds.For clarity, we only show the hypothesis that changes after a tactic is applied and how that changes the goal.As an example, the goal state is the same in steps 1 and 2 since the first tactic rewrites (rw) the equation of adsorption (hrad) and desorption (hrd) into the premise that equilibrium (hreaction) exists.Next, we rewrite (rw), simplify (field_simp), and otherwise rearrange the variables to exactly equal the goal state (steps 3-5).When the proof is finished, a celebratory message and party emoji appears(6).

Fig. 2
Fig. 2 Langmuir model vs. BET model.The BET model, unlike Langmuir, allows particles to create infinite layers on top of previously adsorbed particles.Here q is fraction of the surface adsorbed, V is total volume adsorbed, V 0 is the volume of a complete unimolecular layer adsorbed in unit area, s i is the surface area of the i th layer, s 0 is the surface area of the zeroth layer, and x and C are constants that relates heats of adsorption of the molecule in layers.

ℝ
Functions from integers to reals are also a type Term Terms are members of a type Considering the type of natural numbers, then numbers like 1, 2, 3, and 8 are terms of that type ℕ Symbol for the set of natural numbers The numbers 0, 1, 2, 3, 4, .ℤ Symbol for the set of integers The numbers, ., −3, −2, −1, 0, 1, 2, .ℚ Symbol for the set of rational numbers Numbers that include, Symbol for the set of real numbers −1, 3.6, Euler's number, p, ffiffi ffi 2 p , etc. ℂ Symbol for the set of complex numbers −1, 5 + 2i, ffiffi ffi 2 p þ 5i, etc. c Logical symbol for "for all" dLogical symbol for "there exists"

Table 1
Comparison of hand-written and formalized proofs

Table 3
Glossary of mathematical terms and symbols evident truth which is assumed to be true and doesn't require proof Two sets are equal if they have the same elements; this is not proved, it is assumed Theorem A theorem is a proposition or statement in math that can be demonstrated to be true by accepted mathematical operations and argumentsThe Pythagorean theorem, a 2 + b 2 = c 2 for all right triangles