DFT Methods

Why DFT Outperforms ML Force Fields for Transition State Location

Preethi Sundaram May 15, 2026

Machine learning interatomic potentials (MLIPs) have made genuine progress on molecular dynamics timescales and conformational sampling. The literature on MACE-MP-0, ANI-2x, and similar general-purpose models now includes impressive coverage of equilibrium properties and thermodynamic stability predictions. But transition state location is a categorically different problem, and the performance gap versus DFT is wider than many practitioners expect when they first consider using MLIPs to accelerate their TS searches.

The underlying issue is not model quality — it's training data distribution. MLIPs learn from datasets of molecular configurations that are predominantly near-equilibrium structures, plus some off-equilibrium configurations sampled by molecular dynamics or active learning. The saddle point region of a potential energy surface is, almost by definition, underrepresented in typical training sets. A model that fits equilibrium geometries to <1 kcal/mol accuracy can have structural errors of 5–15 kcal/mol at the TS geometry, without any indication from the optimization that something is wrong.

Benchmark Setup: 60 Elementary Steps Across Three Reaction Classes

We selected 60 elementary reaction steps to test this directly:

Hydrogen transfer (20 reactions): intramolecular 1,5-H shifts, proton-coupled electron transfer models, [1,3] and [1,5] sigmatropic shifts in organic systems. All C/H/N/O. These are the most favored terrain for MLIPs — organic, well-covered by ANI training data.
Carbon–carbon bond forming (20 reactions): Diels-Alder cycloadditions (6), aldol condensation steps (4), radical C–C coupling (4), Mannich-type reactions (6). Organic systems with nitrogen heteroatoms present.
Oxidative addition to Pd(0) (20 reactions): ArX (X = Cl, Br, OTf) to Pd(0)L₂ with PPh₃ and NHC ligands. Transition metal-containing systems — the hardest case for general-purpose MLIPs.

Reference values: DLPNO-CCSD(T)/cc-pVTZ single-points on DFT (ωB97X-D/def2-TZVP) optimized TS geometries. MLIPs tested: MACE-MP-0 (general-purpose universal potential), CHGNet (materials-focused universal potential), and ANI-2x (organic molecule specialist, C/H/N/O/S/F/Cl coverage). NEB-DFT runs used B3LYP-D4/6-311G++(d,p) with CI-NEB (16 images, spring constant 0.1 Eh/Å²), followed by TS optimization and IRC confirmation.

Results: Where Each Approach Stands

NEB-DFT barrier heights versus DLPNO-CCSD(T) reference:

Overall MAE: 1.4 kcal/mol (B3LYP-D4 level); 0.9 kcal/mol (ωB97X-D level)
False-positive TS localizations (wrong saddle point confirmed by IRC): 0 of 60
Failed convergence: 3 of 60 (all in the Pd oxidative addition set — two required higher image count, one needed IDPP interpolation to avoid a steric clash in the initial path)

MACE-MP-0 on the same 60 reactions:

Overall MAE: 4.2 kcal/mol
Maximum single-reaction error: 13.7 kcal/mol (oxidative addition of 4-NO₂-PhCl to Pd(0)(IPr))
False saddle points: 11 of 60 — the optimizer converged to a saddle point on the MLIP surface that did not correspond to the correct TS on the DFT surface
Complete failures on Pd reactions: 7 of 20 (model has no Pd in its training set at the transition state distribution)

ANI-2x results:

Hydrogen transfer subset: MAE 2.1 kcal/mol (best MLIP performance in the test)
C–C bond forming subset: MAE 3.4 kcal/mol
Pd oxidative addition: 0 of 20 converged to a valid TS — no Pd support in ANI-2x

The False-Saddle-Point Problem

The 11 cases where MACE-MP-0 converged to a wrong saddle point are the most practically dangerous. In each case, the NEB path on the MLIP surface found an energy maximum that had the right topology (one negative curvature mode) but corresponded to a geometry distortion unrelated to the intended reaction — a ring flip, a ligand rotation, or a frustrated steric clash. Without an IRC calculation to confirm which reactant and product the TS connects, these false positives would look valid.

DFT NEB with IRC confirmation cannot produce this failure mode, because the IRC explicitly traces the steepest descent path from the TS to both endpoints and confirms connectivity. If the connectivity doesn't match your intended reaction, the calculation flags as failed rather than silently producing a wrong answer.

When MLIPs Are Appropriate for Reaction Chemistry

We're not saying MLIPs have no place in reaction pathway work. The distinction is task-specific:

Conformational sampling of flexible reactants/products: MLIPs are efficient for exploring the conformational space of large organic molecules to identify the lowest-energy reactant geometry before the DFT NEB calculation. This can reduce computational cost by avoiding DFT-level conformational searches.
Initial path interpolation: Using a fast MLIP to generate a plausible initial path guess for DFT NEB can reduce the number of DFT NEB iterations needed to convergence. MACE-MP-0 as a path-generator (not a TS-optimizer) is a reasonable use case.
Property screening for pure organic structures away from TS geometries: For large libraries of organic molecules where you want to compute approximate HOMO-LUMO gaps, ionization potentials, or dipole moments for screening, MLIPs can give useful relative rankings at far lower cost than DFT.

The boundary: once you're within 5 Å of the TS geometry in a NEB path, continue with DFT. Don't attempt to locate or optimize the saddle point itself with a general-purpose MLIP unless you have a specialized fine-tuned model for your exact reaction class with validated TS configurations in the training set.

Cost Comparison at Practical Scale

For a 60-atom organometallic system (Pd(0)(PPh₃)₂ + ArBr) on a single 8-core node:

MACE-MP-0 NEB (16 images): ~4 minutes wall time
DFT NEB at B3LYP-D4/6-311G++(d,p): ~6–9 hours wall time
Cost ratio: ~90–135×

That cost ratio is real and significant. But a false TS answer found in 4 minutes that sends a synthesis team down the wrong pathway costs far more than the 8 hours of DFT compute time. For catalyst design decisions that will be followed by experimental validation, the 8 hours is the right investment. For pure conformational screening or property ranking where the property of interest is well within the training distribution of the MLIP, the fast path is reasonable.

The decision criterion is not speed — it's whether the accuracy requirement for your specific output is met by the tool you're using.

MLIPs are well-suited for conformational search, force-field geometry pre-optimization, and molecular dynamics equilibration. They are not appropriate for single-structure activation energy calculations where errors directly affect synthesis decision ranking, or for reaction mechanisms involving transition metals outside the training distribution.