One of our hopes for strong AI is that it will accelerate scientific research for the benefit of everyone, helping researchers explore more ideas, test them faster, and turn discoveries into impact.
Over the past year, we’ve been working closely with scientists across math, physics, biology, and computer science to understand where AI can help—and where it still falls short. Last month, we published a paper that compiles early case studies across math, physics, biology, computer science, astronomy, and materials science in which GPT‑5 helped researchers showing how GPT‑5 has already begun contributing to real scientific work. With GPT‑5.2, we’re starting to see those gains become more consistent and more reliable.
GPT‑5.2 Pro and GPT‑5.2 Thinking are our strongest models yet for scientific and mathematical work.
Strong mathematical reasoning is a foundation for reliability in scientific and technical work. It enables models to follow multi-step logic, keep quantities consistent, and avoid subtle errors that can compound in real analyses—from simulations and statistics to forecasting and modeling. Improvements on benchmarks like FrontierMath reflect not a narrow skill, but stronger general reasoning and abstraction, capabilities that carry directly into scientific workflows such as coding, data analysis, and experimental design.
These capabilities are also closely tied to progress toward general intelligence. A system that can reliably reason through abstraction, maintain consistency across long chains of thought, and generalize across domains is exhibiting traits that are foundational to AGI—not task-specific tricks, but broad, transferable reasoning skills that matter across science, engineering, and real-world decision-making.
We believe GPT‑5.2 Pro and GPT‑5.2 Thinking are the world’s best models for assisting and accelerating scientists. On GPQA Diamond, a graduate-level Google-proof Q&A benchmark, GPT‑5.2 Pro achieves 93.2%, followed closely by GPT‑5.2 Thinking at 92.4%.
In GPQA Diamond(opens in a new window), models answer multiple choice questions about physics, chemistry, and biology. No tools were enabled and reasoning effort was set to maximum.
On FrontierMath (Tier 1–3), an evaluation of expert-level mathematics, GPT‑5.2 Thinking set a new state of the art, solving 40.3% of problems.
In FrontierMath(opens in a new window), models solve expert-level mathematics problems. A Python tool was enabled and reasoning effort was set to maximum.
Case study
This result suggests a useful direction for how AI systems can support scientific research, particularly in domains with axiomatic theoretical foundations such as mathematics and theoretical computer science. In settings like these, frontier models can help explore proofs, test hypotheses, and identify connections that might otherwise take substantial human effort to uncover.
At the same time, these systems are not independent researchers. Expert judgment, verification, and domain understanding remain essential. Even highly capable models can make mistakes or rely on unstated assumptions. But they can also produce detailed, structured arguments that merit careful human study and refinement. Making reliable progress with AI therefore depends on workflows that keep validation, transparency, and collaboration firmly in the loop.
Viewed as a case study, this result illustrates an emerging mode of research practice. Models like GPT‑5.2 can serve as tools for supporting mathematical reasoning and accelerating early-stage exploration, while responsibility for correctness, interpretation, and context remains with human researchers. Used carefully, such systems may help streamline significant aspects of theoretical work without displacing the central role of human judgment in scientific inquiry.


