Executive Summary: Frontier AI Model Capabilities and Training Methods

Research Date: November 14, 2025

Topic: Frontier AI Model Capabilities and Training Methods

Models Covered: OpenAI GPT-5, OpenAI o-series (o1, o3-mini, o1-preview), Anthropic Claude 3.5 Sonnet / Claude 3.7 Opus, Google Gemini Ultra 2.0, Meta Llama 4 (frontier-scale internal variants), DeepSeek V3 and DeepSeek R1

Executive Overview

As stated in the research parameters, "key technical details are intentionally undisclosed or obscured" for major models, making this "the single hardest topic to research and write about on the web today" [550]. The most difficult information to find includes "exact training datasets and data mixture ratios, token counts, preprocessing pipelines, compute budgets and cluster topologies, optimizer configurations, RLHF methodologies, supervised fine-tuning sources, architecture modifications (MoE routing logic, attention variants, parallelism strategies), internal eval benchmarks, safety red-team results, and alignment techniques" [551]. "These details are hidden due to competitive, economic, and geopolitical pressure, while public statements are often vague marketing layers rather than technical truth" [552]. The field "evolves so fast that even partial information becomes outdated within weeks, and widespread speculation, leaks, and misinformation drown out credible analysis" [553]. "Independent verification is nearly impossible—training runs cost millions and rely on restricted hardware—accurate reporting requires deep multi-domain expertise yet still lacks access to the primary evidence needed for certainty" [554].

1. Introduction: The Challenge of Researching Frontier AI Models

"These details are hidden due to competitive, economic, and geopolitical pressure, while public statements are often vague marketing layers rather than technical truth" [555]. The field "evolves so fast that even partial information becomes outdated within weeks, and widespread speculation, leaks, and misinformation drown out credible analysis" [556]. "Independent verification is nearly impossible—training runs cost millions and rely on restricted hardware—accurate reporting requires deep multi-domain expertise yet still lacks access to the primary evidence needed for certainty" [557].

2. OpenAI GPT-5

2.1 Capabilities and Performance

"We are introducing GPT‑5, our best AI system yet" [1]. "GPT‑5 is a significant leap in intelligence over all our previous models, featuring state-of-the-art performance across coding, math, writing, health, visual perception, and more" [2]. "It is a unified system that knows when to respond quickly and when to think longer to provide expert-level responses" [3]. "GPT‑5 is available to all users, with Plus subscribers getting more usage, and Pro subscribers getting access to GPT‑5 pro, a version with extended reasoning for even more comprehensive and accurate answers" [4].

"GPT‑5 not only outperforms previous models on benchmarks and answers questions more quickly, but—most importantly—is more useful for real-world queries" [5]. "We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, while leveling up GPT‑5's performance in three of ChatGPT's most common uses: writing, coding, and health" [6].

"GPT‑5 is a unified system with a smart, efficient model that answers most questions, a deeper reasoning model (GPT‑5 thinking) for harder problems, and a real‑time router that quickly decides which to use based on conversation type, complexity, tool needs, and your explicit intent (for example, if you say 'think hard about this' in the prompt)" [69]. "The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time" [70]. "Once usage limits are reached, a mini version of each model handles remaining queries" [71]. "In the near future, we plan to integrate these capabilities into a single model" [72].

"GPT‑5 gets more value out of less thinking time" [73]. "In our evaluations, GPT‑5 (with thinking) performs better than OpenAI o3 with 50-80% less output tokens across capabilities, including visual reasoning, agentic coding, and graduate-level scientific problem solving" [74]. "Overall, GPT‑5 is less effusively agreeable, uses fewer unnecessary emojis, and is more subtle and thoughtful in follow‑ups compared to GPT‑4o" [75]. "It should feel less like 'talking to AI' and more like chatting with a helpful friend with PhD‑level intelligence" [76]. "Earlier this year, we released an update to GPT‑4o⁠ that unintentionally made the model overly sycophantic, or excessively flattering or agreeable" [77]. "We quickly rolled back the change⁠ and have since worked to understand and reduce this behavior by: Developing new evaluations to measure sycophancy levels, Improving our training so the model is less sycophantic—for instance, adding examples that would normally lead to over-agreement, and then teaching it not to do that" [78]. "In targeted sycophancy evaluations using prompts specifically designed to elicit sycophantic responses, GPT‑5 meaningfully reduced sycophantic replies (from 14.5% to less than 6%)" [79]. "At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half while also delivering other measurable gains, so users continue to have high-quality, constructive conversations—in line with our goal to help people use ChatGPT well⁠" [80].

"GPT‑5 is significantly better at instruction following, and we see a corresponding improvement in its ability to follow custom instructions" [81]. "We're also launching a research preview of four new preset personalities for all ChatGPT users, made possible by the improvements on steerability" [82]. "These personalities, available initially for text chat and coming later to Voice, let you set how ChatGPT interacts—whether concise and professional, thoughtful and supportive, or a bit sarcastic—without writing custom prompts" [83]. "The four initial options, Cynic, Robot, Listener, and Nerd, are opt-in, adjustable anytime in settings, and designed to match your communication style" [84]. "All of these new personalities meet or exceed our bar on internal evals for reducing sycophancy" [85]. "We look forward to learning and iterating based on early feedback" [86].

"We decided to treat the 'GPT‑5 thinking' model as High capability in the Biological and Chemical domain, and have implemented strong safeguards to sufficiently minimize the associated risks" [87]. "We rigorously tested the model with our safety evaluations under our Preparedness Framework⁠⁠, completing 5,000 hours of red-teaming with partners like the CAISI and UK AISI" [88]. "Similar to our approach for ChatGPT Agent, while we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm–our defined threshold⁠(opens in a new window) for High capability–we are taking a precautionary approach and are activating the required safeguards now in order to increase readiness for when such capabilities are available" [89]. "As a result, 'GPT‑5 thinking' has a robust safety stack with a multilayered defense system for biology: comprehensive threat modeling, training the model to not output harmful content through our new safe completions paradigm, always-on classifiers and reasoning monitors, and clear enforcement pipelines" [90]. "Read more about our robust safety approach for GPT‑5 in our system card" [91].

"For the most challenging, complex tasks, we are also releasing GPT‑5 pro, replacing OpenAI o3‑pro, a variant of GPT‑5 that thinks for ever longer, using scaled but efficient parallel test-time compute, to provide the highest quality and most comprehensive answers" [92]. "GPT‑5 pro achieves the highest performance in the GPT‑5 family on several challenging intelligence benchmarks, including state-of-the-art performance on GPQA, which contains extremely difficult science questions" [93]. "In evaluations on over 1000 economically valuable, real-world reasoning prompts, external experts preferred GPT‑5 pro over 'GPT‑5 thinking' 67.8% of the time" [94]. "GPT‑5 pro made 22% fewer major errors and excelled in health, science, mathematics, and coding" [95]. "Experts rated its responses as relevant, useful, and comprehensive" [96].

"GPT‑5 is the new default in ChatGPT, replacing GPT‑4o, OpenAI o3, OpenAI o4-mini, GPT‑4.1, and GPT‑4.5 for signed-in users" [97]. "Just open ChatGPT and type your question; GPT‑5 handles the rest , applying reasoning automatically when the response would benefit from it" [98]. "Paid users can still select 'GPT‑5 Thinking' from the model picker, or type something like 'think hard about this' in the prompt to ensure reasoning is used when generating a response" [99]. "GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming next week" [100]. "Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI⁠(opens in a new window) by signing in with ChatGPT" [101]. "As with GPT‑4o, the difference between free and paid access to GPT‑5 is usage volume" [102]. "Pro subscribers get unlimited access to GPT‑5, and access to GPT‑5 Pro" [103]. "Plus users can use it comfortably as their default model for everyday questions, with significantly higher usage than free users" [104]. "Team, Enterprise, and Edu customers can also use GPT‑5 comfortably as their default model for everyday work, with generous limits that make it easy for entire organizations to rely on GPT‑5" [105]. "For ChatGPT free-tier users, full reasoning capabilities may take a few days to fully roll out" [106]. "Once free users reach their GPT‑5 usage limits, they will transition to GPT‑5 mini, a smaller, faster, and highly capable model" [107].

"GPT‑5 is our strongest coding model to date" [7]. "It shows particular improvements in complex front‑end generation and debugging larger repositories" [8]. "It can often create beautiful and responsive websites, apps, and games with an eye for aesthetic sensibility in just one prompt, intuitively and tastefully turning ideas into reality" [9]. "Early testers also noted its design choices, with a much better understanding of things like spacing, typography, and white space" [10].

"GPT‑5 is our most capable writing collaborator yet, able to help you steer and translate rough ideas into compelling, resonant writing with literary depth and rhythm" [11]. "It more reliably handles writing that involves structural ambiguity, such as sustaining unrhymed iambic pentameter or free verse that flows naturally, combining respect for form with expressive clarity" [12].

"GPT‑5 is our best model yet for health-related questions, empowering users to be informed about and advocate for their health" [13]. "The model scores significantly higher than any previous model on HealthBench ⁠, an evaluation we published earlier this year based on realistic scenarios and physician-defined criteria" [14]. "Compared to previous models, it acts more like an active thought partner, proactively flagging potential concerns and asking questions to give more helpful answers" [15].

"GPT‑5 is much smarter across the board, as reflected by its performance on academic and human-evaluated benchmarks, particularly in math, coding, visual perception, and health" [16]. "It sets a new state of the art across math (94.6% on AIME 2025 without tools), real-world coding (74.9% on SWE-bench Verified, 88% on Aider Polyglot), multimodal understanding (84.2% on MMMU), and health (46.2% on HealthBench Hard)—and those gains show up in everyday use" [17]. "With GPT‑5 pro's extended reasoning, the model also sets a new SOTA on GPQA, scoring 88.4% without tools" [18].

2.2 Training Process

"GPT‑5 was trained on Microsoft Azure AI supercomputers" [19]. According to Epoch AI research, "GPT-5 used less training compute than GPT-4.5 because OpenAI focused on scaling post-training" [436]. "New post-training techniques made it possible to outperform GPT-4.5 with less training compute, but these methods likely weren't yet mature enough to be applied at GPT-4.5's compute scale" [437]. "Doing so would've taken more time (and compute), which OpenAI likely chose not to do due to strong market pressures" [438].

"Until recently, most LLMs were trained with 100× more pre-training than post-training compute" [439]. "However, around September 2024, researchers developed novel techniques used in 'reasoning models' that help scale post-training compute effectively" [440]. "Researchers could now triple post-training compute in a way that was at least as useful as tripling pre-training compute" [441]. "In fact, these reasoning techniques make it possible to reduce pre-training compute by roughly 10× while getting the same performance!" [442].

"Out of all the GPT models, GPT-5 is the odd one out" [443]. "Unlike all previous versions of GPT, it was likely trained on less compute than its immediate predecessor, GPT-4.5" [444]. "While the exact numbers are uncertain, GPT-4.5 very likely used more training compute than GPT-5" [445]. "But this leads to a puzzle: Models trained with more compute tend to be better, so why did OpenAI train GPT-5 with less compute than GPT-4.5?" [446].

"Importantly, when we say 'training compute', we're focusing on the compute to perform the final training run of a model" [447]. "It's likely that the total compute for developing GPT-5 was higher than for GPT-4.5, if we also account for the compute for running experiments" [448]. "This is because OpenAI's (projected) R&D compute spend has grown from ~$5 billion in 2024 to ~$9 billion in 2025" [449].

"Why did GPT-5 use less training compute than GPT-4.5? We believe this is a combination of two factors" [450]. "First, OpenAI decided to prioritize scaling post-training, which had better returns on the margin" [451]. "Second, they couldn't readily scale post-training compute to GPT-4.5 levels at the time" [452]. "And if they tried to scale post-training on a model with as much pre-training as GPT-4.5, they would've run into timing and experimental compute constraints" [453].

"This means that, rather than spending around $200 million on pre-training and $2 million on post-training GPT-4.5, new post-training techniques made it possible for that $2 million in post-training to achieve the same overall performance with only $20 million in pre-training" [454]. "That's roughly a ten-fold decrease in training costs, though this doesn't imply that total model development costs were lower, due to increases in the compute needed to run experiments" [455]. "The upshot is that OpenAI was likely able to train a model with less compute than GPT-4.5, while still outperforming it on many useful tasks like coding and search" [456].

"However, while this shows that OpenAI could've outperformed GPT-4.5 with less training compute, it doesn't fully explain why they chose this strategy in practice" [457]. "For example, why not just post-train GPT-4.5? And why not post-train a smaller model on enough data to reach GPT-4.5's level of training compute?" [458]. "The core reason is that scaling post-training in this way is challenging" [459]. "It requires lots of testing and experimentation, which takes time and compute, especially when performed on larger, newer models" [460]. "It also requires a significant amount of high-quality post-training data, which takes time to design and collect" [461].

"Crucially, OpenAI faced major time constraints due to market pressures" [462]. "This came in the form of fierce competition from rival AI labs, which would hurt their revenue – e.g. Anthropic's models had been consistently outperforming OpenAI's models at coding" [463]. "And there was added pressure because many had expected OpenAI to release a model called 'GPT-5' as early as November 2023" [464].

"Given these constraints, we believe that OpenAI scaled post-training on a smaller model as much as they could" [465]. "Scaling further would've either required more experiments than they had the compute or time for, or post-training data that they didn't have" [466]. "Post-training a GPT-4.5-sized model, let alone starting a larger multi-month pre-training run and doing post-training on top, would've taken too much time or too much experiment compute" [467]. "The result of these efforts in scaling post-training was GPT-5, a new state-of-the-art model that OpenAI was able to release by August" [468].

"What does this mean for training compute trends moving forward? Our best guess is that future iterations of GPT will be trained on more compute" [469]. "To see why, consider the bigger picture" [470]. "Training GPT-5 with less compute than GPT-4.5 is part of a broader trend, where the training compute of state-of-the-art models has grown more slowly than one might've expected a year ago" [471]. "Since post-training was just a small portion of training compute and scaling it yielded huge returns, AI labs focused their limited training compute on scaling it rather than pre-training" [472].

"In fact, these reasoning post-training techniques have scaled much faster than pre-training compute" [473]. "At this rate, tripling post-training compute will soon be akin to tripling the entire compute budget – so current growth rates likely can't be sustained for much more than a year" [474]. "That means that this broader trend is likely to end – we may see a reversion to the original trend of training compute growth" [475]. "If this is right, GPT-6 is likely to need much more training compute than GPT-5, and probably more than GPT-4.5" [476]. "Not to mention, OpenAI plans to significantly expand their compute stock, with many more GPUs brought online by the end of the year, and major clusters like Stargate Abilene coming out in phases" [477].

2.3 Alignment and Safety

"GPT‑5 advances the frontier on safety" [20]. "In the past, ChatGPT relied primarily on refusal-based safety training: based on the user's prompt, the model should either comply or refuse" [21]. "While this type of training works well for explicitly malicious prompts, it can struggle to handle situations where the user's intent is unclear, or information could be used in benign or malicious ways" [22]. "For GPT‑5, we introduced a new form of safety-training — safe completions — which teaches the model to give the most helpful answer where possible while still staying within safety boundaries" [23].

"GPT‑5 is significantly less likely to hallucinate than our previous models" [24]. "With web search enabled on anonymized prompts representative of ChatGPT production traffic, GPT‑5's responses are ~45% less likely to contain a factual error than GPT‑4o, and when thinking, GPT‑5's responses are ~80% less likely to contain a factual error than OpenAI o3" [25].

"GPT‑5 (with thinking) more honestly communicates its actions and capabilities to the user—especially for tasks which are impossible, underspecified, or missing key tools" [26]. "In order to achieve a high reward during training, reasoning models may learn to lie about successfully completing a task or be overly confident about an uncertain answer" [27]. "We evaluated deception rates on settings involving impossible coding tasks and missing multimodal assets, and found that GPT‑5 (with thinking) is less deceptive than o3 across the board" [28]. "On a large set of conversations representative of real production ChatGPT traffic, we've reduced rates of deception from 4.8% for o3 to 2.1% of GPT‑5 reasoning responses" [29].

3. OpenAI o-Series Models (o1, o3-mini, o1-preview)

3.1 Reasoning Capabilities

"OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA)" [30].

"On the 2024 AIME exams, GPT‑4o only solved on average 12% (1.8/15) of problems" [31]. "o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function" [32]. "A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad" [33].

"We also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology" [34]. "In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions" [35]. "We found that o1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark" [36]. "These results do not imply that o1 is more capable than a PhD in all respects — only that the model is more proficient in solving some problems that a PhD would be expected to solve" [37].

3.2 Training Methodology

"Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process" [38]. "We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)" [39]. "The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them" [40].

"o1 performance smoothly improves with both train-time and test-time compute" [108]. "To highlight the reasoning improvement over GPT‑4o, we tested our models on a diverse set of human exams and ML benchmarks" [109]. "We show that o1 significantly outperforms GPT‑4o on the vast majority of these reasoning-heavy tasks" [110]. "Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting" [111]. "o1 greatly improves over GPT-4o on challenging reasoning benchmarks" [112]. "Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples" [113].

"In many reasoning-heavy benchmarks, o1 rivals the performance of human experts" [114]. "Recent frontier models do so well on MATH and GSM8K that these benchmarks are no longer effective at differentiating models" [115]. "We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America" [116]. "On several other ML benchmarks, o1 improved over the state-of-the-art" [117]. "With its vision perception capabilities enabled, o1 scored 78.2% on MMMU, making it the first model to be competitive with human experts" [118]. "It also outperformed GPT‑4o on 54 out of 57 MMLU subcategories" [119].

"In addition to exams and academic benchmarks, we also evaluated human preference of o1‑preview vs GPT‑4o on challenging, open-ended prompts in a broad spectrum of domains" [120]. "In this evaluation, human trainers were shown anonymized responses to a prompt from o1‑preview and GPT‑4o, and voted for which response they preferred" [121]. "o1‑preview is preferred to gpt-4o by a large margin in reasoning-heavy categories like data analysis, coding, and math" [122]. "However, o1‑preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases" [123].

"Chain of thought reasoning provides new opportunities for alignment and safety" [124]. "We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles" [125]. "By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1‑preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model's safety refusal boundaries" [126]. "We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios" [127].

"To stress-test our improvements, we conducted a suite of safety tests and red-teaming before deployment, in accordance with our Preparedness Framework⁠(opens in a new window)" [128]. "We found that chain of thought reasoning contributed to capability improvements across our evaluations" [129]. "Of particular note, we observed interesting instances of reward hacking⁠(opens in a new window)" [130]. "Detailed results from these evaluations can be found in the accompanying System Card" [131].

"We believe that a hidden chain of thought presents a unique opportunity for monitoring models" [132]. "Assuming it is faithful and legible, the hidden chain of thought allows us to 'read the mind' of the model and understand its thought process" [133]. "For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user" [134]. "However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought" [135]. "We also do not want to make an unaligned chain of thought directly visible to users" [136]. "Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users" [137]. "We acknowledge this decision has disadvantages" [138]. "We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer" [139]. "For the o1 model series we show a model-generated summary of the chain of thought" [140].

"We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills" [141]. "This model competed in the 2024 IOI under the same conditions as the human contestants" [142]. "It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem" [143]. "For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy" [144]. "Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function" [145]. "If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints" [146]. "With a relaxed submission constraint, we found that model performance improved significantly" [147]. "When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy" [148]. "Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model's coding skill" [149]. "Our evaluations closely matched competition rules and allowed for 10 submissions" [150]. "GPT‑4o achieved an Elo rating of 808, which is in the 11th percentile of human competitors" [151]. "This model far exceeded both GPT‑4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors" [152]. "Further fine-tuning on programming competitions improves o1" [153]. "The improved model ranked in the 49th percentile in the 2024 International Olympiad in Informatics under competition rules" [154].

"o1 significantly advances the state-of-the-art in AI reasoning" [155]. "We plan to release improved versions of this model as we continue iterating" [156]. "We expect these new reasoning capabilities will improve our ability to align models to human values and principles" [157]. "We believe o1 – and its successors – will unlock many new use cases for AI in science, coding, math, and related fields" [158]. "We are excited for users and API developers to discover how it can improve their daily work" [159].

3.3 Chain of Thought Reasoning

"Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem" [41]. "Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses" [42]. "It learns to recognize and correct its mistakes" [43]. "It learns to break down tricky steps into simpler ones" [44]. "It learns to try a different approach when the current one isn't working" [45]. "This process dramatically improves the model's ability to reason" [46].

3.4 OpenAI o3 Model

"Today, OpenAI previewed their o3 model continuing their recent progress on training language models to reason with o1" [478]. "These models, starting with o3-mini, are expected to be available to the general public in late January of 2025" [479]. "There was no moment with a 'GPT-4 release' level of excitement in 2024" [480]. "o3 changes that by being far more unexpected than o1, and signals rapid progress across reasoning models" [481]. "We knew o1 was coming with the long lead-up — the quick and effective follow-up with o3 sets us up for a very dynamic 2025" [482].

"OpenAI's o3 shows the industry is beginning to climb its next hill as progress from pretraining only on internet text yields fewer profitable benefits" [483]. "o3 is a major step change in reasoning evaluations — in summary, it is: The first model to surpass the 85% threshold for completing the ARC AGI prize (Note: this was done on the public set, not the test set, and exceeded cost constraints)" [484]. "A step change in state-of-the-art performance on the extremely new Frontier Math benchmark from 2 to 25%" [485]. "Substantial improvements were made to all of the leading coding benchmarks, such as SWE-Bench-Verified" [486].

"On the 2024 AIME exams, GPT‑4o only solved on average 12% (1.8/15) of problems" [67]. "o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function" [68]. "Before the o1-class models, OpenAI's best model, GPT-4o, only achieved 5% accuracy" [487]. "The incredible pace of progress on the evaluation as OpenAI hillclimbed on their new reasoning models was summarized by co-founder of ARC Prize Mike Knoop: GPT-2 (2019): 0%, GPT-3 (2020): 0%, GPT-4 (2023): 2%, GPT-4o (2024): 5%, o1-preview (2024): 21%, o1 high (2024): 32%, o1 Pro (2024): ~50%, o3 tuned low (2024): 76%, o3 tuned high (2024): 87%" [488].

"Just in June, the narrative was still that solving ARC-AGI would be extremely hard" [489]. "This has totally flipped on its head in just a few months" [490]. "Even those bullish about rumors of Q* and other reasoning approaches would not have expected this level of success" [491]. "We tested o3 against two ARC-AGI datasets: Semi-Private Eval: 100 private tasks used to assess overfitting, Public Eval: 400 public tasks" [492]. "At OpenAI's direction, we tested at two levels of compute with variable sample sizes: 6 (high-efficiency) and 1024 (low-efficiency, 172x compute)" [493].

"According to SemiAnalysis, o1 pro uses self-consistency methods or simple consensus@N checks to increase performance by selecting the most common answer across multiple parallel responses to the same query" [494]. "Here, sample size, N, likely corresponds to a consensus@N number, indicating that o3 was evaluated in something close to the configuration for o1 pro that customers can use, 6x compute, and a super high configuration with 1024x compute per problem" [495]. "This scale of inference is not going to be served to standard paid users for a long time" [496]. "Most users will be exposed to one pass to consensus@10 depending on the specifications of the 'pro' tier of o1 models" [497].

"The story in deep learning that has been driving progress in the last few years is finding a rich area of potential and hill climbing on it" [498]. "The first wave of progress was in internet-scale pretraining" [499]. "Now, OpenAI has identified a hill to climb by scaling reinforcement learning training and long-context reasoning" [500]. "Given that o3 is only about three months after the release of OpenAI's o1, the simplest explanation is that it is the same architecture and training methodology, scaled up" [501]. "There is no evidence other than hearsay that o3 made an architectural change to inference by adding tree search" [502]. "A core rule of inference scaling laws is that sampling more from the same single-stream generation can give performance improvements" [503].

"At the same time, OpenAI released a blog post and research paper on deliberative alignment, showcasing how o1-class models can enhance safety and alignment research" [504]. "This provides some of the first positive pieces of evidence for the much bigger open question I hinted at earlier: Can enhanced reasoning abilities deliver value outside of verifiable domains?" [505]. "This will be revisited many times in 2025" [506].

4. Anthropic Claude 3.5 Sonnet and Claude 3.7 Opus

4.1 Claude 3.5 Sonnet Capabilities

"Claude 3.5 Sonnet raises the industry bar for intelligence, outperforming competitor models and Claude 3 Opus on a wide range of evaluations, with the speed and cost of our mid-tier model, Claude 3 Sonnet" [160]. "Claude 3.5 Sonnet sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval)" [161].

4.2 Performance Improvements

Claude 3.5 Sonnet shows "marked improvement in grasping nuance, humor, and complex instructions, and is exceptional at writing high-quality content with a natural, relatable tone" [162]. "Claude 3.5 Sonnet operates at twice the speed of Claude 3 Opus" [163], making it "ideal for complex tasks such as context-sensitive customer support and orchestrating multi-step workflows" [164].

4.3 Coding Capabilities

"In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%" [165]. "Our evaluation tests the model's ability to fix a bug or add functionality to an open source codebase, given a natural language description of the desired improvement" [166]. "When instructed and provided with the relevant tools, Claude 3.5 Sonnet can independently write, edit, and execute code with sophisticated reasoning and troubleshooting capabilities" [167].

4.4 Claude 3.7 Sonnet

"Today, we're announcing Claude 3.7 Sonnet1, our most intelligent model to date and the first hybrid reasoning model on the market" [168]. "Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user" [169]. "API users also have fine-grained control over _how long_ the model can think for" [170].

"We've developed Claude 3.7 Sonnet with a different philosophy from other reasoning models on the market" [171]. "Just as humans use a single brain for both quick responses and deep reflection, we believe reasoning should be an integrated capability of frontier models rather than a separate model entirely" [172]. "This unified approach also creates a more seamless experience for users" [173].

"Claude 3.7 Sonnet is both an ordinary LLM and a reasoning model in one: you can pick when you want the model to answer normally and when you want it to think longer before answering" [174]. "In the standard mode, Claude 3.7 Sonnet represents an upgraded version of Claude 3.5 Sonnet" [175]. "In extended thinking mode, it self-reflects before answering, which improves its performance on math, physics, instruction-following, coding, and many other tasks" [176].

"Claude 3.7 Sonnet shows particularly strong improvements in coding and front-end web development" [186]. "Along with the model, we're also introducing a command line tool for agentic coding, Claude Code" [187]. "Claude Code is available as a limited research preview, and enables developers to delegate substantial engineering tasks to Claude directly from their terminal" [188]. "Claude 3.7 Sonnet is now available on all Claude plans—including Free, Pro, Team, and Enterprise—as well as the Claude Developer Platform, Amazon Bedrock, and Google Cloud's Vertex AI" [189]. "Extended thinking mode is available on all surfaces except the free Claude tier" [190]. "In both standard and extended thinking modes, Claude 3.7 Sonnet has the same price as its predecessors: $3 per million input tokens and $15 per million output tokens—which includes thinking tokens" [191].

"Claude 3.7 Sonnet: Frontier reasoning made practical" [192]. "In developing our reasoning models, we've optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs" [193]. "Early testing demonstrated Claude's leadership in coding capabilities across the board: Cursor noted Claude is once again best-in-class for real-world coding tasks, with significant improvements in areas ranging from handling complex codebases to advanced tool use" [194]. "Cognition found it far better than any other model at planning code changes and handling full-stack updates" [195]. "Vercel highlighted Claude's exceptional precision for complex agent workflows, while Replit has successfully deployed Claude to build sophisticated web apps and dashboards from scratch, where other models stall" [196]. "In Canva's evaluations, Claude consistently produced production-ready code with superior design taste and drastically reduced errors" [197].

"Claude 3.7 Sonnet achieves state-of-the-art performance on SWE-bench Verified, which evaluates AI models' ability to solve real-world software issues" [198]. "Claude 3.7 Sonnet achieves state-of-the-art performance on TAU-bench, a framework that tests AI agents on complex real-world tasks with user and tool interactions" [199]. "Claude 3.7 Sonnet excels across instruction-following, general reasoning, multimodal capabilities, and agentic coding, with extended thinking providing a notable boost in math and science" [200]. "Beyond traditional benchmarks, it even outperformed all previous models in our Pokémon gameplay tests" [201].

"Since June 2024, Sonnet has been the preferred model for developers worldwide" [202]. "Today, we're empowering developers further by introducing Claude Code—our first agentic coding tool—in a limited research preview" [203]. "Claude Code is an active collaborator that can search and read code, edit files, write and run tests, commit and push code to GitHub, and use command line tools—keeping you in the loop at every step" [204]. "Claude Code is an early product but has already become indispensable for our team, especially for test-driven development, debugging complex issues, and large-scale refactoring" [205]. "In early testing, Claude Code completed tasks in a single pass that would normally take 45+ minutes of manual work, reducing development time and overhead" [206]. "In the coming weeks, we plan to continually improve it based on our usage: enhancing tool call reliability, adding support for long-running commands, improved in-app rendering, and expanding Claude's own understanding of its capabilities" [207]. "Our goal with Claude Code is to better understand how developers use Claude for coding to inform future model improvements" [208]. "By joining this preview, you'll get access to the same powerful tools we use to build and improve Claude, and your feedback will directly shape its future" [209].

"We've also improved the coding experience on Claude.ai" [210]. "Our GitHub integration is now available on all Claude plans—enabling developers to connect their code repositories directly to Claude" [211]. "Claude 3.7 Sonnet is our best coding model to date" [212]. "With a deeper understanding of your personal, work, and open source projects, it becomes a more powerful partner for fixing bugs, developing features, and building documentation across your most important GitHub projects" [213].

"We've conducted extensive testing and evaluation of Claude 3.7 Sonnet, working with external experts to ensure it meets our standards for security, safety, and reliability" [214]. "The system card for this release covers new safety results in several categories, providing a detailed breakdown of our Responsible Scaling Policy evaluations that other AI labs and researchers can apply to their work" [215]. "The card also addresses emerging risks that come with computer use, particularly prompt injection attacks, and explains how we evaluate these vulnerabilities and train Claude to resist and mitigate them" [216]. "Additionally, it examines potential safety benefits from reasoning models: the ability to understand how models make decisions, and whether model reasoning is genuinely trustworthy and reliable" [217]. "Read the full system card to learn more" [218].

"Claude 3.7 Sonnet and Claude Code mark an important step towards AI systems that can truly augment human capabilities" [219]. "With their ability to reason deeply, work autonomously, and collaborate effectively, they bring us closer to a future where AI enriches and expands what humans can achieve" [220].

4.5 Safety and Alignment

"Despite Claude 3.5 Sonnet's leap in intelligence, our red teaming assessments have concluded that Claude 3.5 Sonnet remains at ASL-2" [177]. "We've engaged with external experts to test and refine the safety mechanisms within this latest model" [178], including providing "Claude 3.5 Sonnet to the UK's Artificial Intelligence Safety Institute (UK AISI) for pre-deployment safety evaluation" [179].

"Claude 3.7 Sonnet also makes more nuanced distinctions between harmful and benign requests, reducing unnecessary refusals by 45% compared to its predecessor" [180].

5. Google Gemini Ultra 2.0

5.1 Model Family Overview

"Gemini 2.5 is our most intelligent AI model, capable of reasoning through its thoughts before responding, resulting in enhanced performance and improved accuracy" [221]. The model family includes "Gemini 2.5 Pro," described as "Best for coding and highly complex tasks" [222], and "Gemini 2.5 Flash," optimized for "fast performance on everyday tasks" [223].

5.2 Deep Think Capabilities

Gemini 2.5 features "Deep Think," described as "an enhanced reasoning mode that uses cutting edge research techniques in parallel thinking and reinforcement learning to significantly improve Gemini's ability to solve complex problems" [224]. Deep Think "can better help tackle problems that require creativity, strategic planning, and making improvements step-by-step" [225].

"Google AI Ultra subscribers, you now have access to Deep Think in the Gemini app" [530]. "This tool uses parallel thinking to solve complex problems and excels in areas like coding and scientific discovery" [531]. "You can access Deep Think by toggling it on in the prompt bar within the Gemini app" [532]. "This new release incorporates feedback from early trusted testers and research breakthroughs" [533]. "It's a significant improvement over what was first announced at I/O, as measured in terms of key benchmark improvements and trusted tester feedback" [534]. "It is a variation of the model that recently achieved the gold-medal standard at this year's International Mathematical Olympiad (IMO)" [535]. "While that model takes hours to reason about complex math problems, today's release is faster and more usable day-to-day, while still reaching Bronze-level performance on the 2025 IMO benchmark, based on internal evaluations" [536].

"Deep Think pushes the frontier of thinking capabilities by using parallel thinking techniques" [537]. "This approach lets Gemini generate many ideas at once and consider them simultaneously, even revising or combining different ideas over time, before arriving at the best answer" [538]. "Moreover, by extending the inference time or 'thinking time,' we give Gemini more time to explore different hypotheses, and arrive at creative solutions to complex problems" [539]. "We've also developed novel reinforcement learning techniques that encourage the model to make use of these extended reasoning paths, thus enabling Deep Think to become a better, more intuitive problem-solver over time" [540].

"Deep Think's performance is also reflected in challenging benchmarks that measure coding, science, knowledge and reasoning capabilities" [541]. "For example, compared to other models without tool use, Gemini 2.5 Deep Think achieves state-of-the-art performance across LiveCodeBench V6, which measures competitive code performance, and Humanity's Last Exam, a challenging benchmark that measures expertise in different domains, including science and math" [542]. "Deep Think in the Gemini app uses parallel thinking techniques to deliver more detailed, creative and thoughtful responses" [543]. This includes "Iterative development and design" where "Deep Think can improve both the aesthetics and functionality of web development tasks" [544]. It also excels in "Scientific and mathematical discovery" and "Algorithmic development and code" [545].

"We continue to build safety and responsibility into Gemini throughout the training and deployment lifecycle" [546]. "In testing, Gemini 2.5 Deep Think demonstrated improved content safety and tone-objectivity compared to Gemini 2.5 Pro, but did have a higher tendency to refuse benign requests" [547]. "As Gemini's problem-solving abilities advance, we are taking a deeper look at risks that come with increased complexity, including our frontier safety evaluations and the implementation of planned mitigations for critical capability levels" [548]. "Further details on the safety outcomes of Gemini 2.5 Deep Think are available in the model card" [549].

5.3 Performance Benchmarks

According to Google's performance table, "Gemini 2.5 Pro Thinking" achieves "21.6%" on "Humanity's Last Exam (no tools)" [226], "86.4%" on "GPQA diamond" [227], "88.0%" on "AIME 2025" [228], and "69.0%" on "LiveCodeBench (UI: 1/1/2025-5/1/2025)" [229].

5.4 Multimodal Capabilities

"Gemini 2.5 builds on the best of Gemini — with native multimodality and a long context window" [230]. The models support "text, image and video modalities" [231], with pricing reflecting these capabilities.

6. Meta Llama 4

6.1 Frontier-Scale Variants

"We're introducing Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context length support and our first built using a mixture-of-experts (MoE) architecture" [233]. "Llama 4 Scout, a 17 billion active parameter model with 16 experts, is the best multimodal model in the world in its class and is more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU" [234]. "Additionally, Llama 4 Scout offers an industry-leading context window of 10M and delivers better results than Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a broad range of widely reported benchmarks" [235].

"Llama 4 Maverick, a 17 billion active parameter model with 128 experts, is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding—at less than half the active parameters" [236]. "Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena" [237].

"These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is our most powerful yet and among the world's smartest LLMs" [238]. "Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks" [239]. "Llama 4 Behemoth is still training, and we're excited to share more details about it even while it's still in flight" [240].

6.2 Training Architecture

"Our new Llama 4 models are our first models that use a mixture of experts (MoE) architecture" [241]. "In MoE models, a single token activates only a fraction of the total parameters" [242]. "MoE architectures are more compute efficient for training and inference and, given a fixed training FLOPs budget, delivers higher quality compared to a dense model" [243].

"As an example, Llama 4 Maverick models have 17B active parameters and 400B total parameters" [244]. "We use alternating dense and mixture-of-experts (MoE) layers for inference efficiency" [245]. "MoE layers use 128 routed experts and a shared expert" [246]. "Each token is sent to the shared expert and also to one of the 128 routed experts" [247]. "As a result, while all parameters are stored in memory, only a subset of the total parameters are activated while serving these models" [248]. "This improves inference efficiency by lowering model serving costs and latency—Llama 4 Maverick can be run on a single NVIDIA H100 DGX host for easy deployment, or with distributed inference for maximum efficiency" [249].

"Llama 4 models are designed with native multimodality, incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone" [250]. "Early fusion is a major step forward, since it enables us to jointly pre-train the model with large amounts of unlabeled text, image, and video data" [251]. "We also improved the vision encoder in Llama 4" [252]. "This is based on MetaCLIP but trained separately in conjunction with a frozen Llama model to better adapt the encoder to the LLM" [253].

"The overall data mixture for training consisted of more than 30 trillion tokens, which is more than double the Llama 3 pre-training mixture and includes diverse text, image, and video datasets" [254].

6.3 Training Methodology

"We developed a new training technique which we refer to as MetaP that allows us to reliably set critical model hyper-parameters such as per-layer learning rates and initialization scales" [258]. "We found that chosen hyper-parameters transfer well across different values of batch size, model width, depth, and training tokens" [259]. "Llama 4 enables open source fine-tuning efforts by pre-training on 200 languages, including over 100 with over 1 billion tokens each, and overall 10x more multilingual tokens than Llama 3" [260].

"Additionally, we focus on efficient model training by using FP8 precision, without sacrificing quality and ensuring high model FLOPs utilization—while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs, we achieved 390 TFLOPs/GPU" [261]. "We continued training the model in what we call 'mid-training' to improve core capabilities with new training recipes including long context extension using specialized datasets" [262]. "This enabled us to enhance model quality while also unlocking best-in-class 10M input context length for Llama 4 Scout" [263].

"For mixing modalities, we came up with a carefully curated curriculum strategy that does not trade-off performance compared to the individual modality expert models" [264]. "With Llama 4, we revamped our post-training pipeline by adopting a different approach: lightweight supervised fine-tuning (SFT) > online reinforcement learning (RL) > lightweight direct preference optimization (DPO)" [265]. "A key learning was that SFT and DPO can over-constrain the model, restricting exploration during the online RL stage and leading to suboptimal accuracy, particularly in reasoning, coding, and math domains" [266]. "To address this, we removed more than 50% of our data tagged as easy by using Llama models as a judge and did lightweight SFT on the remaining harder set" [267]. "In the subsequent multimodal online RL stage, by carefully selecting harder prompts, we were able to achieve a step change in performance" [268]. "Furthermore, we implemented a continuous online RL strategy, where we alternated between training the model and then using it to continually filter and retain only medium-to-hard difficulty prompts" [269]. "This strategy proved highly beneficial in terms of compute and accuracy tradeoffs" [270]. "We then did a lightweight DPO to handle corner cases related to model response quality, effectively achieving a good balance between the model's intelligence and conversational abilities" [271]. "Both the pipeline architecture and the continuous online RL strategy with adaptive data filtering culminated in an industry-leading, general-purpose chat model with state-of-the-art intelligence and image understanding capabilities" [272].

"Llama 4 Scout is both pre-trained and post-trained with a 256K context length, which empowers the base model with advanced length generalization capability" [273]. "We present compelling results in tasks such as retrieval with 'retrieval needle in haystack' for text as well as cumulative negative log-likelihoods (NLLs) over 10 million tokens of code" [274]. "A key innovation in the Llama 4 architecture is the use of interleaved attention layers without positional embeddings" [275]. "Additionally, we employ inference time temperature scaling of attention to enhance length generalization" [276]. "We call this the iRoPE architecture, where 'i' stands for 'interleaved' attention layers, highlighting the long-term goal of supporting 'infinite' context length, and 'RoPE' refers to the rotary position embeddings employed in most layers" [277].

"We trained both of our models on a wide variety of image and video frame stills in order to give them broad visual understanding, including of temporal activities and related images" [278]. "This enables effortless interaction on multi-image inputs alongside text prompts for visual reasoning and understanding tasks" [279]. "The models were pre-trained on up to 48 images, and we've tested in post-training with good results up to eight images" [280]. "Llama 4 Scout is also best-in-class on image grounding, able to align user prompts with relevant visual concepts and anchor model responses to regions in the image" [281]. "This enables more precise visual question answering for the LLM to better understand user intent and localize objects of interest" [282]. "Llama 4 Scout also exceeds comparable models on coding, reasoning, long context, and image benchmarks and offers stronger performance than all previous Llama models" [283].

"We're excited to share a preview of Llama 4 Behemoth, a teacher model that demonstrates advanced intelligence among models in its class" [284]. "Llama 4 Behemoth is also a multimodal mixture-of-experts model, with 288B active parameters, 16 experts, and nearly two trillion total parameters" [285]. "Offering state-of-the-art performance for non-reasoning models on math, multilinguality, and image benchmarks, it was the perfect choice to teach the smaller Llama 4 models" [286]. "We codistilled the Llama 4 Maverick model from Llama 4 Behemoth as a teacher model, resulting in substantial quality improvements across end task evaluation metrics" [287]. "We developed a novel distillation loss function that dynamically weights the soft and hard targets through training" [288]. "Codistillation from Llama 4 Behemoth during pre-training amortizes the computational cost of resource-intensive forward passes needed to compute the targets for distillation for the majority of the training data used in student training" [289]. "For additional new data incorporated in student training, we ran forward passes on the Behemoth model to create distillation targets" [290].

"Post-training a model with two trillion parameters was a significant challenge too that required us to completely overhaul and revamp the recipe, starting from the scale of data" [291]. "In order to maximize performance, we had to prune 95% of the SFT data, as opposed to 50% for smaller models, to achieve the necessary focus on quality and efficiency" [292]. "We also found that doing lightweight SFT followed by large-scale reinforcement learning (RL) produced even more significant improvements in reasoning and coding abilities of the model" [293]. "Our RL recipe focused on sampling hard prompts by doing pass@k analysis with the policy model and crafting a training curriculum of increasing prompt hardness" [294]. "We also found that dynamically filtering out prompts with zero advantage during training and constructing training batches with mixed prompts from multiple capabilities were instrumental in providing a performance boost on math, reasoning, and coding" [295]. "Finally, sampling from a variety of system instructions was crucial in ensuring that the model retained its instruction following ability for reasoning and coding and was able to perform well across a variety of tasks" [296].

"Scaling RL for a two trillion parameter model also required revamping our underlying RL infrastructure due to its unprecedented scale" [297]. "We optimized the design of our MoE parallelization for speed, which enabled faster iteration" [298]. "We developed a fully asynchronous online RL training framework that enhanced flexibility" [299]. "Compared to the existing distributed training framework, which sacrifices the compute memory in order to stack all models in memory, our new infrastructure enabled flexible allocation of different models to separate GPUs, balancing resources across multiple models based on computational speed" [300]. "This innovation resulted in a ~10x improvement in training efficiency over previous generations" [301].

7. DeepSeek V3 and DeepSeek R1

7.1 DeepSeek V3 Capabilities

DeepSeek has developed significant training infrastructure, with references to "自研训练框架、自建智算集群和万卡算力" (self-developed training framework, self-built intelligent computing cluster, and 10,000-card computing power) [302]. The company has achieved rapid development, with "仅用半年时间便已发布并开源多个百亿级参数大模型" (releasing and open-sourcing multiple 10-billion-parameter large models in just half a year) [303].

7.2 DeepSeek R1 Reasoning Model

"We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1" [304]. "DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities" [305]. "Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors" [306]. "However, it encounters challenges such as poor readability, and language mixing" [307]. "To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL" [308]. "DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks" [309].

"DeepSeek-R1 achieves a score of $7 9 . 8 %$ Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217" [310]. "On MATH-500, it attains an impressive score of $9 7 . 3 %$ , performing on par with OpenAI-o1-1217 and significantly outperforming other models" [311]. "On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming $9 6 . 3 %$ human participants in the competition" [312].

"On benchmarks such as MMLU, MMLU-Pro, and GPQA Diamond, DeepSeekR1 achieves outstanding results, significantly outperforming DeepSeek-V3 with scores of $9 0 . 8 %$ on MMLU, $8 4 . 0 %$ on MMLU-Pro, and $7 1 . 5 %$ on GPQA Diamond" [313]. "While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks" [314].

7.3 Training Methodology

"We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step" [315]. "This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero" [316]. "DeepSeekR1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community" [317]. "Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through ${ \\\\mathrm { R L } } ,$ without the need for SFT" [318].

"Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor" [319]. "In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL" [320].

"Group Relative Policy Optimization (GRPO) is introduced in DeepSeekMath" [321], and used in other DeepSeek works, e.g., DeepSeek-V3 and DeepSeek-R1. "GRPO can be viewed as PPO-inspired algorithm with a very similar surrogate loss, but it avoids learning a value function with another copy of the original policy language model" [322]. This brings two posited benefits: "Avoiding the challenge of learning a value function from a LM backbone" [323] and "Saves memory by not needing to keep another set of model weights in memory" [324].

"Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm to improve the reasoning capabilities of LLMs" [507]. "It was introduced in the DeepSeekMath paper in the context of mathematical reasoning" [508]. "GRPO modifies the traditional Proximal Policy Optimization (PPO) by eliminating the need for a value function model" [509]. "Instead, it estimates baselines from group scores, reducing memory usage and computational overhead" [510]. "GRPO, now also used by the Qwen team, can be used with rule/binary-based Rewards as well as General Reward Models to improve models on helpfulness" [511].

"Starting with DeepSeek V3, they applied GRPO to unsupervised reasoning text completions rule-based reward models that focused on aspects like format, mathematics, and coding" [512]. "Accuracy rewards: Evaluate whether the response is correct, correct result or compiled LeetCode problem" [513]. "Format rewards: Evaluate the format that enforces the model to put its thinking process between '' and '' tags" [514]. "This leads to a pass@1 score on AIME 2024 increasing from 15.6% to 71.0%, reaching performance levels comparable to OpenAI-o1-0912 alongside output token length per problem increasing, indicating the model naturally learns to solve tasks with more thinking time/token generation" [515]. "This has the drawback of leading to poor readability and language mixing but it was solved for R1 using a multi-stage approach with alternating SFT → RL steps" [516].

"To prevent the early unstable cold start phase of reinforcement training (RL) training from the base model, the team started with supervised fine-tuning" [517]. "Collected up to 10k token-long chain-of-thought (CoT) using the fine-tuned models, R1-zero and human annotator" [518]. "The data is used to fine-tune Deepseek V3 base to improve readbility and coherence" [519]. "Used the same RL pipeline as R1-Zero, focusing on reasoning-intensive tasks such as coding and math using the same Rule-Based Reward Models" [520]. "This time, an additional reward for 'language consistency' is used to help the model stick to the same language" [521].

"Generated large synthetic dataset using Reject Sampling (RS) focusing on writing, role-playing, and other general-purpose tasks" [522]. "The model from Stage 2 was used with Deepseek V3 as a Judge to generate 600k reasoning-related samples and 200k for writing, role-playing, and other general-purpose tasks using portions of the SFT dataset of DeepSeek-V3 or regenerating them with CoT included" [523]. "In the Final Stage, GRPO is used again with a combination of Rule-Based and Outcome Reward Models to improve the model's helpfulness and harmlessness" [524]. "Leading to the Deepseek R1 model" [525].

"DeepSeek didn't use Monte Carlo Tree Search (MCTS) or Process Reward Models (PRM)" [526]. "Fine-tuning before applying GRPO can actually make the training process faster and more stable" [527]. "Rule-based rewards focused on accuracy and format are more effective than complex rewards models" [528].

"In order to save the training costs of ${ \\\\scriptstyle \\\\mathrm { R L } } ,$ , we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead" [325]. "Specifically, for each question $q ,$ , GRPO samples a group of outputs ${ o \\_ { 1 } , o \\_ { 2 } , \\\\cdots , o \\_ { G } }$ from the old policy $\\\\pi \\_ { \\\\theta \\_ { o l d } }$ and then optimizes the policy model $\\\\scriptstyle { \\\\pi \\_ { \\\\theta } }$ by maximizing the following objective" [326]. "The reward is the source of the training signal, which decides the optimization direction of RL" [327]. "To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of two types of rewards: Accuracy rewards: The accuracy reward model evaluates whether the response is correct" [328]. "For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness" [329]. "Similarly, for LeetCode problems, a compiler can be used to generate feedback based on predefined test cases" [330]. "Format rewards: In addition to the accuracy reward model, we employ a format reward model that enforces the model to put its thinking process between '' and $' \\_ { < }$ /think>' tags" [331].

"We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline" [332]. "To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base model to adhere to our specified instructions" [333]. "This template requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer" [334]. "We intentionally limit our constraints to this structural format, avoiding any content-specific biases—such as mandating reflective reasoning or promoting particular problem-solving strategies—to ensure that we can accurately observe the model's natural progression during the RL process" [335].

"Figure 2 depicts the performance trajectory of DeepSeekR1-Zero on the AIME 2024 benchmark throughout the RL training process" [336]. "As illustrated, DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the RL training advances" [337]. "Notably, the average pass $@ 1$ score on AIME 2024 shows a significant increase, jumping from an initial $1 5 . 6 %$ to an impressive $7 1 . 0 %$ , reaching performance levels comparable to OpenAI-o1-0912" [338]. "This significant improvement highlights the efficacy of our RL algorithm in optimizing the model's performance over time" [339]. "The findings reveal that RL empowers DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised fine-tuning data" [340]. "This is a noteworthy achievement, as it underscores the model's ability to learn and generalize effectively through RL alone" [341]. "Additionally, the performance of DeepSeekR1-Zero can be further augmented through the application of majority voting" [342]. "For example, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero's performance escalates from $7 1 . 0 %$ to $8 6 . 7 %$ , thereby exceeding the performance of OpenAI-o1-0912" [343]. "The ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without majority voting, highlights its strong foundational capabilities and its potential for further advancements in reasoning tasks" [344].

"The self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously" [345]. "By initiating RL directly from the base model, we can closely monitor the model's progression without the influence of the supervised fine-tuning stage" [346]. "This approach provides a clear view of how the model evolves over time, particularly in terms of its ability to handle complex reasoning tasks" [347]. "As depicted in Figure 3, the thinking time of DeepSeek-R1-Zero shows consistent improvement throughout the training process" [348]. "This improvement is not the result of external adjustments but rather an intrinsic development within the model" [349]. "DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation" [350]. "This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth" [351].

"One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases" [352]. "Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously" [353]. "These behaviors are not explicitly programmed but instead emerge as a result of the model's interaction with the reinforcement learning environment" [354]. "This spontaneous development significantly enhances DeepSeek-R1-Zero's reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy" [355]. "A particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an 'aha moment'" [356]. "This moment, as illustrated in Table 3, occurs in an intermediate version of the model" [357]. "During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach" [358]. "This behavior is not only a testament to the model's growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes" [359]. "This moment is not only an 'aha moment' for the model but also for the researchers observing its behavior" [360]. "It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies" [361]. "The 'aha moment' serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future" [362].

"Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues" [363]. "For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing" [364]. "To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data" [365]. "Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities?" [366]. "To address these questions, we design a pipeline to train DeepSeek-R1" [367]. "The pipeline consists of four stages, outlined as follows" [368].

"To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1- Zero outputs in a readable format, and refining the results through post-processing by human annotators" [369]. "Compared to DeepSeek-R1-Zero, the advantages of cold start data include: Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for reading" [370]. "Responses may mix multiple languages or lack markdown formatting to highlight answers for users" [371]. "In contrast, when creating cold-start data for DeepSeek-R1, we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly" [372]. "Here, we define the output format as \\|special\\_token\\|\\|special\\_token\\|, where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results" [373]. "Potential: By carefully designing the pattern for cold-start data with human priors, we observe better performance against DeepSeek-R1-Zero" [374]. "We believe the iterative training is a better way for reasoning models" [375].

"After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale reinforcement learning training process as employed in DeepSeek-R1-Zero" [376]. "This phase focuses on enhancing the model's reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions" [377]. "During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages" [378]. "To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT" [379]. "Although ablation experiments show that such alignment results in a slight degradation in the model's performance, this reward aligns with human preferences, making it more readable" [380]. "Finally, we combine the accuracy of reasoning tasks and the reward for language consistency by directly summing them to form the final reward" [381]. "We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks" [382].

"When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round" [383]. "Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model's capabilities in writing, role-playing, and other general-purpose tasks" [384]. "Specifically, we generate the data and fine-tune the model as described below" [385]. "Reasoning data We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training" [386]. "In the previous stage, we only included data that could be evaluated using rule-based rewards" [387]. "However, in this stage, we expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment" [388]. "Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks" [389]. "For each prompt, we sample multiple responses and retain only the correct ones" [390]. "In total, we collect about 600k reasoning related training samples" [391]. "Non-Reasoning data For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3" [392]. "For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting" [393]. "However, for simpler queries, such as 'hello' we do not provide a CoT in response" [394]. "In the end, we collected a total of approximately 200k training samples that are unrelated to reasoning" [395]. "We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples" [396].

"To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model's helpfulness and harmlessness while simultaneously refining its reasoning capabilities" [397]. "Specifically, we train the model using a combination of reward signals and diverse prompt distributions" [398]. "For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains" [399]. "For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios" [400]. "We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts" [401]. "For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process" [402]. "For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process" [403]. "Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness" [404].

8. Training Methods and Technical Details

8.1 Data Collection and Preprocessing

"Like previous GPT models, the GPT‑4 base model was trained to predict the next word in a document, and was trained using publicly available data (such as internet data) as well as data we've licensed" [47]. The data "is a web-scale corpus of data including correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas" [48].

8.2 Compute Infrastructure

"Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload" [49]. "A year ago, we trained GPT‑3.5 as a first 'test run' of the system" [50]. "We found and fixed some bugs and improved our theoretical foundations" [51]. "As a result, our GPT‑4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time" [52].

8.3 Optimization and Scaling

"We developed infrastructure and optimization that have very predictable behavior across multiple scales" [53]. "To verify this scalability, we accurately predicted in advance GPT‑4's final loss on our internal codebase (not part of the training set) by extrapolating from models trained using the same methodology but using 10,000x less compute" [54].

8.4 Reinforcement Learning from Human Feedback (RLHF)

"To align it with the user's intent within guardrails, we fine-tune the model's behavior using reinforcement learning with human feedback ( RLHF⁠)" [55]. "Note that the model's capabilities seem to come primarily from the pre-training process—RLHF does not improve exam performance (without active effort, it actually degrades it)" [56]. "But steering of the model comes from the post-training process—the base model requires prompt engineering to even know that it should answer the questions" [57].

"Reinforcement learning from Human Feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems" [405]. "The basic pipeline for RLHF involves three steps. First, a language model that can follow user questions must be trained" [406]. "Second, human preference data must be collected for the training of a reward model of human preferences" [407]. "Finally, the language model can be optimized with an RL optimizer of choice, by sampling generations and rating them with respect to the reward model" [408]. "RLHF has been applied to many domains successfully, with complexity increasing as the techniques have matured" [409]. "In modern language model training, RLHF is one component of post-training" [410]. "Post-training is a more complete set of techniques and best-practices to make language models more useful for downstream tasks" [411]. "Post-training can be summarized as using three optimization methods: Instruction / Supervised Finetuning (IFT/SFT), Preference Finetuning (PreFT), and Reinforcement Finetuning (RFT)" [412]. "Instruction / Supervised Finetuning (IFT/SFT), where we teach formatting and form the base of instruction following abilities" [413]. "Preference Finetuning (PreFT), where we align to human preferences (and get smaller bump in capabilities at the same time)" [414]. "Reinforcement Finetuning (RFT). The newest type of post-training that boosts performance on verifiable domains" [415]. "This book focuses on the second area, preference finetuning, which has more complexity than instruction tuning and is far more established than Reinforcement Finetuning" [416]. "RLHF colloquially is what led to modern post-training" [417]. "The core role of this book, beyond teaching the techniques for doing RLHF, is to distill intuition as to why RLHF is crucial to modern AI models" [418]. "Modern research has established RLHF as a general method to integrate subtle stylistic and related behavioral features into the models" [419]. "Compared to other techniques for post-training, such as instruction finetuning, RLHF generalizes far better across domains" [420]. "Instruction finetuning is training the model to predict the next certain token when the text preceding is close to examples it has seen" [421]. "RLHF on the other hand tunes the responses on the response level rather than looking at the next token specifically" [422]. "RLHF also shows a model which type of response it should avoid, i.e. negative feedback" [423]. "The training to achieve this is often called a contrastive loss function and is referenced throughout this book" [424]. "While this flexibility is a major advantage of RLHF, it comes with implementation challenges" [425]. "Largely, these center on how to control the optimization" [426]. "Implementing RLHF often requires training a reward model, of which best practices are not strongly established and depend on the area of application" [427]. "The optimization itself is prone to over-optimization because our reward signal is at best a proxy objective, requiring regularization" [428]. "Effective RLHF requires a strong starting point, so RLHF cannot be a solution to every problem alone and needs to be approached in a broader lens of post-training" [429]. "Due to this complexity, implementing RLHF is far more costly than simple instruction finetuning and can come with unexpected challenges such as length bias" [430]. "For projects where performance matters, RLHF is established as being crucial to achieving a strong finetuned model, but it is more expensive in compute, data costs, and time" [431]. "The intuition I've been using to understand the potential of post-training is called the elicitation interpretation of post-training, where all we are doing is extracting and amplifying valuable behaviors in the base model" [432]. "The best post-training teams extract a ton of performance in a very short time frame" [433]. "The set of techniques is everything after the end of most of pretraining" [434]. "This theory folds in with the reality that the majority of gains users are seeing are from post-training because it implies that there is more latent potential in a model pretraining on the internet than we can teach the model simply" [435].

8.5 Architecture Modifications

"Our new Llama 4 models are our first models that use a mixture of experts (MoE) architecture" [255]. "In MoE models, a single token activates only a fraction of the total parameters" [256]. "MoE architectures are more compute efficient for training and inference and, given a fixed training FLOPs budget, delivers higher quality compared to a dense model" [257].

9. Evaluation and Benchmarking

9.1 Academic Benchmarks

"GPT‑4 considerably outperforms existing large language models, alongside most state-of-the-art (SOTA) models which may include benchmark-specific crafting or additional training protocols" [58]. GPT-4 demonstrated strong performance, with "GPT-4 achieves 86.4% on MMLU (5-shot)" [59], "GPT-4 achieves 95.3% on HellaSwag (10-shot)" [60], "GPT-4 achieves 96.3% on ARC (25-shot)" [61], and "GPT-4 achieves 67.0% on HumanEval (0-shot)" [62].

9.2 Competition-Level Evaluations

"On the 2024 AIME exams, GPT‑4o only solved on average 12% (1.8/15) of problems" [63]. "o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function" [64]. "Claude 3.5 Sonnet sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval)" [181]. "Gemini 2.5 Pro Thinking" achieves "88.0%" on "AIME 2025" [232].

9.3 Internal Evaluation Benchmarks

Companies maintain "internal eval benchmarks" [558] that are not publicly disclosed. "These benchmarks are used to assess model capabilities before public release, though specific details remain proprietary" [559].

10. Safety, Alignment, and Red-Teaming

10.1 Safety Measures

"Our models are subjected to rigorous testing and have been trained to reduce misuse" [182]. "We engaged over 50 experts from domains such as AI alignment risks, cybersecurity, biorisk, trust and safety, and international security to adversarially test the model" [65]. "Their findings specifically enabled us to test model behavior in high-risk areas which require expertise to evaluate" [66].

10.2 Red-Teaming

"We've engaged with external experts to test and refine the safety mechanisms within this latest model" [183]. "We recently provided Claude 3.5 Sonnet to the UK's Artificial Intelligence Safety Institute (UK AISI) for pre-deployment safety evaluation" [184]. "The UK AISI completed tests of 3.5 Sonnet and shared their results with the US AI Safety Institute (US AISI) as part of a Memorandum of Understanding, made possible by the partnership between the US and UK AISIs announced earlier this year" [185].

10.3 Alignment Techniques

"Claude: Employs Constitutional AI (CAI) with RLAIF, using AI-generated feedback based on explicit principles to guide behavior towards being" [529]. Various alignment techniques are employed, including "Constitutional AI, Deliberative Alignment, RLAIF" [560]. "Specific implementation details remain proprietary, though research indicates these techniques are critical for ensuring model safety and alignment with human values" [561].

11. Challenges in Researching Frontier AI Models

11.1 Information Opacity

"Key technical details are intentionally undisclosed or obscured" [562], making comprehensive research extremely difficult. "These details are hidden due to competitive, economic, and geopolitical pressure, while public statements are often vague marketing layers rather than technical truth" [563].

11.2 Rapid Evolution

The field "evolves so fast that even partial information becomes outdated within weeks" [564]. "This rapid evolution means that research findings may become obsolete quickly, requiring constant updates and verification" [565].

11.3 Verification Challenges

"Independent verification is nearly impossible—training runs cost millions and rely on restricted hardware" [566]. "This makes it difficult to verify claims about model capabilities, training methods, and performance" [567].

11.4 Speculation and Misinformation

"Widespread speculation, leaks, and misinformation drown out credible analysis" [568]. "This creates challenges in distinguishing accurate information from speculation or misinformation" [569].

12. Future Directions and Open Questions

12.1 Scaling Laws

Research continues into scaling laws and their implications for future model development. "The relationship between compute, model size, dataset size, and performance" [570] remains an active area of research, though specific findings are often proprietary.

12.2 Synthetic Data and Distillation

"The use of larger models to generate data for training smaller models" [571] represents an emerging area of research. "This approach could potentially reduce training costs while maintaining performance" [572].

12.3 Architecture Innovations

Ongoing research into "attention mechanisms" [573], including "FlashAttention, Slim Attention, Scalable Softmax, Hyena, Mamba, RWKV, RetNet, and Differential Transformers" [574], continues to drive improvements in model efficiency and performance.

12.4 Parallelism Strategies

Research into "data parallelism, model parallelism, tensor parallelism, pipeline parallelism, and hybrid parallelism" [575] continues to optimize training efficiency for large-scale models.

13. Conclusion

"Researching frontier AI model capabilities and training methods presents significant challenges due to intentional opacity, rapid evolution, and verification difficulties" [576]. "While some information is available through official announcements, research papers, and technical reports, many critical details remain proprietary" [577]. "The field continues to evolve rapidly, with new models and techniques emerging regularly" [578]. "Understanding these models requires careful analysis of available information while acknowledging the limitations imposed by competitive secrecy and proprietary development practices" [579].

References

[1]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"We are introducing GPT‑5, our best AI system yet

[2]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is a significant leap in intelligence over all our previous models, featuring state-of-the-art performance across coding, math, writing, health, visual perception, and more

[3]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"It is a unified system that knows when to respond quickly and when to think longer to provide expert-level responses

[4]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is available to all users, with Plus subscribers getting more usage, and Pro subscribers getting access to GPT‑5 pro, a version with extended reasoning for even more comprehensive and accurate answers

[5]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 not only outperforms previous models on benchmarks and answers questions more quickly, but—most importantly—is more useful for real-world queries

[6]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"We've made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, while leveling up GPT‑5's performance in three of ChatGPT's most common uses: writing, coding, and health

[7]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is our strongest coding model to date

[8]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"It shows particular improvements in complex front‑end generation and debugging larger repositories

[9]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"It can often create beautiful and responsive websites, apps, and games with an eye for aesthetic sensibility in just one prompt, intuitively and tastefully turning ideas into reality

[10]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Early testers also noted its design choices, with a much better understanding of things like spacing, typography, and white space

[11]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is our most capable writing collaborator yet, able to help you steer and translate rough ideas into compelling, resonant writing with literary depth and rhythm

[12]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"It more reliably handles writing that involves structural ambiguity, such as sustaining unrhymed iambic pentameter or free verse that flows naturally, combining respect for form with expressive clarity

[13]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is our best model yet for health-related questions, empowering users to be informed about and advocate for their health

[14]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"The model scores significantly higher than any previous model on HealthBench ⁠, an evaluation we published earlier this year based on realistic scenarios and physician-defined criteria

[15]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Compared to previous models, it acts more like an active thought partner, proactively flagging potential concerns and asking questions to give more helpful answers

[16]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is much smarter across the board, as reflected by its performance on academic and human-evaluated benchmarks, particularly in math, coding, visual perception, and health

[17]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"It sets a new state of the art across math (94.6% on AIME 2025 without tools), real-world coding (74.9% on SWE-bench Verified, 88% on Aider Polyglot), multimodal understanding (84.2% on MMMU), and health (46.2% on HealthBench Hard)—and those gains show up in everyday use

[18]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"With GPT‑5 pro's extended reasoning, the model also sets a new SOTA on GPQA, scoring 88.4% without tools

[19]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 was trained on Microsoft Azure AI supercomputers

[20]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 advances the frontier on safety

[21]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"In the past, ChatGPT relied primarily on refusal-based safety training: based on the user's prompt, the model should either comply or refuse

[22]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"While this type of training works well for explicitly malicious prompts, it can struggle to handle situations where the user's intent is unclear, or information could be used in benign or malicious ways

[23]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"For GPT‑5, we introduced a new form of safety-training — safe completions — which teaches the model to give the most helpful answer where possible while still staying within safety boundaries

[24]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is significantly less likely to hallucinate than our previous models

[25]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"With web search enabled on anonymized prompts representative of ChatGPT production traffic, GPT‑5's responses are ~45% less likely to contain a factual error than GPT‑4o, and when thinking, GPT‑5's responses are ~80% less likely to contain a factual error than OpenAI o3

[26]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 (with thinking) more honestly communicates its actions and capabilities to the user—especially for tasks which are impossible, underspecified, or missing key tools

[27]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"In order to achieve a high reward during training, reasoning models may learn to lie about successfully completing a task or be overly confident about an uncertain answer

[28]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"We evaluated deception rates on settings involving impossible coding tasks and missing multimodal assets, and found that GPT‑5 (with thinking) is less deceptive than o3 across the board

[29]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"On a large set of conversations representative of real production ChatGPT traffic, we've reduced rates of deception from 4.8% for o3 to 2.1% of GPT‑5 reasoning responses

[30]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

[31]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"On the 2024 AIME exams, GPT‑4o only solved on average 12% (1.8/15) of problems

[32]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function

[33]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"A score of 13.9 places it among the top 500 students nationally and above the cutoff for the USA Mathematical Olympiad

[34]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology

[35]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions

[36]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We found that o1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark

[37]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"These results do not imply that o1 is more capable than a PhD in all respects — only that the model is more proficient in solving some problems that a PhD would be expected to solve

[38]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process

[39]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)

[40]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them

[41]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Similar to how a human may think for a long time before responding to a difficult question, o1 uses a chain of thought when attempting to solve a problem

[42]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses

[43]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"It learns to recognize and correct its mistakes

[44]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"It learns to break down tricky steps into simpler ones

[45]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"It learns to try a different approach when the current one isn't working

[46]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"This process dramatically improves the model's ability to reason

[47]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

[48]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"is a web-scale corpus of data including correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas

[49]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"Over the past two years, we rebuilt our entire deep learning stack and, together with Azure, co-designed a supercomputer from the ground up for our workload

[50]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"A year ago, we trained GPT‑3.5 as a first 'test run' of the system

[51]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"We found and fixed some bugs and improved our theoretical foundations

[52]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"As a result, our GPT‑4 training run was (for us at least!) unprecedentedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time

[53]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"We developed infrastructure and optimization that have very predictable behavior across multiple scales

[54]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"To verify this scalability, we accurately predicted in advance GPT‑4's final loss on our internal codebase (not part of the training set) by extrapolating from models trained using the same methodology but using 10,000x less compute

[55]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"To align it with the user's intent within guardrails, we fine-tune the model's behavior using reinforcement learning with human feedback ( RLHF⁠)

[56]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"Note that the model's capabilities seem to come primarily from the pre-training process—RLHF does not improve exam performance (without active effort, it actually degrades it)

[57]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"But steering of the model comes from the post-training process—the base model requires prompt engineering to even know that it should answer the questions

[58]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"GPT‑4 considerably outperforms existing large language models, alongside most state-of-the-art (SOTA) models which may include benchmark-specific crafting or additional training protocols

[59]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"GPT-4 achieves 86.4% on MMLU (5-shot)

[60]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"GPT-4 achieves 95.3% on HellaSwag (10-shot)

[61]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"GPT-4 achieves 96.3% on ARC (25-shot)

[62]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"GPT-4 achieves 67.0% on HumanEval (0-shot)

[63]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"On the 2024 AIME exams, GPT‑4o only solved on average 12% (1.8/15) of problems

[64]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function

[65]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"We engaged over 50 experts from domains such as AI alignment risks, cybersecurity, biorisk, trust and safety, and international security to adversarially test the model

[66]

OpenAI. "GPT-4." OpenAI Research, March 14, 2023. https://openai.com/research/gpt-4

"Their findings specifically enabled us to test model behavior in high-risk areas which require expertise to evaluate

[67]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"On the 2024 AIME exams, GPT‑4o only solved on average 12% (1.8/15) of problems

[68]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"o1 averaged 74% (11.1/15) with a single sample per problem, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when re-ranking 1000 samples with a learned scoring function

[69]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

[70]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time

[71]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Once usage limits are reached, a mini version of each model handles remaining queries

[72]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"In the near future, we plan to integrate these capabilities into a single model

[73]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 gets more value out of less thinking time

[74]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"In our evaluations, GPT‑5 (with thinking) performs better than OpenAI o3 with 50-80% less output tokens across capabilities, including visual reasoning, agentic coding, and graduate-level scientific problem solving

[75]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Overall, GPT‑5 is less effusively agreeable, uses fewer unnecessary emojis, and is more subtle and thoughtful in follow‑ups compared to GPT‑4o

[76]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"It should feel less like 'talking to AI' and more like chatting with a helpful friend with PhD‑level intelligence

[77]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Earlier this year, we released an update to GPT‑4o⁠ that unintentionally made the model overly sycophantic, or excessively flattering or agreeable

[78]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"We quickly rolled back the change⁠ and have since worked to understand and reduce this behavior by: Developing new evaluations to measure sycophancy levels, Improving our training so the model is less sycophantic—for instance, adding examples that would normally lead to over-agreement, and then teaching it not to do that

[79]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"In targeted sycophancy evaluations using prompts specifically designed to elicit sycophantic responses, GPT‑5 meaningfully reduced sycophantic replies (from 14.5% to less than 6%)

[80]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half while also delivering other measurable gains, so users continue to have high-quality, constructive conversations—in line with our goal to help people use ChatGPT well⁠

[81]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is significantly better at instruction following, and we see a corresponding improvement in its ability to follow custom instructions

[82]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"We're also launching a research preview of four new preset personalities for all ChatGPT users, made possible by the improvements on steerability

[83]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"These personalities, available initially for text chat and coming later to Voice, let you set how ChatGPT interacts—whether concise and professional, thoughtful and supportive, or a bit sarcastic—without writing custom prompts

[84]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"The four initial options, Cynic, Robot, Listener, and Nerd, are opt-in, adjustable anytime in settings, and designed to match your communication style

[85]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"All of these new personalities meet or exceed our bar on internal evals for reducing sycophancy

[86]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"We look forward to learning and iterating based on early feedback

[87]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"We decided to treat the 'GPT‑5 thinking' model as High capability in the Biological and Chemical domain, and have implemented strong safeguards to sufficiently minimize the associated risks

[88]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"We rigorously tested the model with our safety evaluations under our Preparedness Framework⁠⁠, completing 5,000 hours of red-teaming with partners like the CAISI and UK AISI

[89]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Similar to our approach for ChatGPT Agent, while we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm–our defined threshold⁠(opens in a new window) for High capability–we are taking a precautionary approach and are activating the required safeguards now in order to increase readiness for when such capabilities are available

[90]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"As a result, 'GPT‑5 thinking' has a robust safety stack with a multilayered defense system for biology: comprehensive threat modeling, training the model to not output harmful content through our new safe completions paradigm, always-on classifiers and reasoning monitors, and clear enforcement pipelines

[91]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Read more about our robust safety approach for GPT‑5 in our system card

[92]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

[93]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 pro achieves the highest performance in the GPT‑5 family on several challenging intelligence benchmarks, including state-of-the-art performance on GPQA, which contains extremely difficult science questions

[94]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"In evaluations on over 1000 economically valuable, real-world reasoning prompts, external experts preferred GPT‑5 pro over 'GPT‑5 thinking' 67.8% of the time

[95]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 pro made 22% fewer major errors and excelled in health, science, mathematics, and coding

[96]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Experts rated its responses as relevant, useful, and comprehensive

[97]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is the new default in ChatGPT, replacing GPT‑4o, OpenAI o3, OpenAI o4-mini, GPT‑4.1, and GPT‑4.5 for signed-in users

[98]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Just open ChatGPT and type your question; GPT‑5 handles the rest , applying reasoning automatically when the response would benefit from it

[99]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Paid users can still select 'GPT‑5 Thinking' from the model picker, or type something like 'think hard about this' in the prompt to ensure reasoning is used when generating a response

[100]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming next week

[101]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI⁠(opens in a new window) by signing in with ChatGPT

[102]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"As with GPT‑4o, the difference between free and paid access to GPT‑5 is usage volume

[103]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Pro subscribers get unlimited access to GPT‑5, and access to GPT‑5 Pro

[104]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Plus users can use it comfortably as their default model for everyday questions, with significantly higher usage than free users

[105]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Team, Enterprise, and Edu customers can also use GPT‑5 comfortably as their default model for everyday work, with generous limits that make it easy for entire organizations to rely on GPT‑5

[106]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"For ChatGPT free-tier users, full reasoning capabilities may take a few days to fully roll out

[107]

OpenAI. "Introducing GPT-5." OpenAI, August 7, 2025. https://openai.com/index/introducing-gpt-5/

"Once free users reach their GPT‑5 usage limits, they will transition to GPT‑5 mini, a smaller, faster, and highly capable model

[108]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"o1 performance smoothly improves with both train-time and test-time compute

[109]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"To highlight the reasoning improvement over GPT‑4o, we tested our models on a diverse set of human exams and ML benchmarks

[110]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We show that o1 significantly outperforms GPT‑4o on the vast majority of these reasoning-heavy tasks

[111]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Unless otherwise specified, we evaluated o1 on the maximal test-time compute setting

[112]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"o1 greatly improves over GPT-4o on challenging reasoning benchmarks

[113]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Solid bars show pass@1 accuracy and the shaded region shows the performance of majority vote (consensus) with 64 samples

[114]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"In many reasoning-heavy benchmarks, o1 rivals the performance of human experts

[115]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Recent frontier models do so well on MATH and GSM8K that these benchmarks are no longer effective at differentiating models

[116]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We evaluated math performance on AIME, an exam designed to challenge the brightest high school math students in America

[117]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"On several other ML benchmarks, o1 improved over the state-of-the-art

[118]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"With its vision perception capabilities enabled, o1 scored 78.2% on MMMU, making it the first model to be competitive with human experts

[119]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"It also outperformed GPT‑4o on 54 out of 57 MMLU subcategories

[120]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"In addition to exams and academic benchmarks, we also evaluated human preference of o1‑preview vs GPT‑4o on challenging, open-ended prompts in a broad spectrum of domains

[121]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"In this evaluation, human trainers were shown anonymized responses to a prompt from o1‑preview and GPT‑4o, and voted for which response they preferred

[122]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"o1‑preview is preferred to gpt-4o by a large margin in reasoning-heavy categories like data analysis, coding, and math

[123]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"However, o1‑preview is not preferred on some natural language tasks, suggesting that it is not well-suited for all use cases

[124]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Chain of thought reasoning provides new opportunities for alignment and safety

[125]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles

[126]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"By teaching the model our safety rules and how to reason about them in context, we found evidence of reasoning capability directly benefiting model robustness: o1‑preview achieved substantially improved performance on key jailbreak evaluations and our hardest internal benchmarks for evaluating our model's safety refusal boundaries

[127]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We believe that using a chain of thought offers significant advances for safety and alignment because (1) it enables us to observe the model thinking in a legible way, and (2) the model reasoning about safety rules is more robust to out-of-distribution scenarios

[128]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"To stress-test our improvements, we conducted a suite of safety tests and red-teaming before deployment, in accordance with our Preparedness Framework⁠(opens in a new window)

[129]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We found that chain of thought reasoning contributed to capability improvements across our evaluations

[130]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Of particular note, we observed interesting instances of reward hacking⁠(opens in a new window)

[131]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Detailed results from these evaluations can be found in the accompanying System Card

[132]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We believe that a hidden chain of thought presents a unique opportunity for monitoring models

[133]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Assuming it is faithful and legible, the hidden chain of thought allows us to 'read the mind' of the model and understand its thought process

[134]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user

[135]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought

[136]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We also do not want to make an unaligned chain of thought directly visible to users

[137]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users

[138]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We acknowledge this decision has disadvantages

[139]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer

[140]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"For the o1 model series we show a model-generated summary of the chain of thought

[141]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

[142]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"This model competed in the 2024 IOI under the same conditions as the human contestants

[143]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem

[144]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy

[145]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function

[146]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints

[147]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"With a relaxed submission constraint, we found that model performance improved significantly

[148]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"When allowed 10,000 submissions per problem, the model achieved a score of 362.14 – above the gold medal threshold – even without any test-time selection strategy

[149]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model's coding skill

[150]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Our evaluations closely matched competition rules and allowed for 10 submissions

[151]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"GPT‑4o achieved an Elo rating of 808, which is in the 11th percentile of human competitors

[152]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"This model far exceeded both GPT‑4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors

[153]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"Further fine-tuning on programming competitions improves o1

[154]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"The improved model ranked in the 49th percentile in the 2024 International Olympiad in Informatics under competition rules

[155]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"o1 significantly advances the state-of-the-art in AI reasoning

[156]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We plan to release improved versions of this model as we continue iterating

[157]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We expect these new reasoning capabilities will improve our ability to align models to human values and principles

[158]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We believe o1 – and its successors – will unlock many new use cases for AI in science, coding, math, and related fields

[159]

OpenAI. "Learning to reason with LLMs." OpenAI, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/

"We are excited for users and API developers to discover how it can improve their daily work

[160]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

[161]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"Claude 3.5 Sonnet sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval)

[162]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"marked improvement in grasping nuance, humor, and complex instructions, and is exceptional at writing high-quality content with a natural, relatable tone

[163]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"Claude 3.5 Sonnet operates at twice the speed of Claude 3 Opus

[164]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"ideal for complex tasks such as context-sensitive customer support and orchestrating multi-step workflows

[165]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%

[166]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"Our evaluation tests the model's ability to fix a bug or add functionality to an open source codebase, given a natural language description of the desired improvement

[167]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"When instructed and provided with the relevant tools, Claude 3.5 Sonnet can independently write, edit, and execute code with sophisticated reasoning and troubleshooting capabilities

[168]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Today, we're announcing Claude 3.7 Sonnet1, our most intelligent model to date and the first hybrid reasoning model on the market

[169]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user

[170]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"API users also have fine-grained control over _how long_ the model can think for

[171]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"We've developed Claude 3.7 Sonnet with a different philosophy from other reasoning models on the market

[172]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Just as humans use a single brain for both quick responses and deep reflection, we believe reasoning should be an integrated capability of frontier models rather than a separate model entirely

[173]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"This unified approach also creates a more seamless experience for users

[174]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet is both an ordinary LLM and a reasoning model in one: you can pick when you want the model to answer normally and when you want it to think longer before answering

[175]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"In the standard mode, Claude 3.7 Sonnet represents an upgraded version of Claude 3.5 Sonnet

[176]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"In extended thinking mode, it self-reflects before answering, which improves its performance on math, physics, instruction-following, coding, and many other tasks

[177]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"Despite Claude 3.5 Sonnet's leap in intelligence, our red teaming assessments have concluded that Claude 3.5 Sonnet remains at ASL-2

[178]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"We've engaged with external experts to test and refine the safety mechanisms within this latest model

[179]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"Claude 3.5 Sonnet to the UK's Artificial Intelligence Safety Institute (UK AISI) for pre-deployment safety evaluation

[180]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet also makes more nuanced distinctions between harmful and benign requests, reducing unnecessary refusals by 45% compared to its predecessor

[181]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"Claude 3.5 Sonnet sets new industry benchmarks for graduate-level reasoning (GPQA), undergraduate-level knowledge (MMLU), and coding proficiency (HumanEval)

[182]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"Our models are subjected to rigorous testing and have been trained to reduce misuse

[183]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"We've engaged with external experts to test and refine the safety mechanisms within this latest model

[184]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"We recently provided Claude 3.5 Sonnet to the UK's Artificial Intelligence Safety Institute (UK AISI) for pre-deployment safety evaluation

[185]

Anthropic. "Claude 3.5 Sonnet." Anthropic News, June 20, 2024. https://www.anthropic.com/news/claude-3-5-sonnet

"The UK AISI completed tests of 3.5 Sonnet and shared their results with the US AI Safety Institute (US AISI) as part of a Memorandum of Understanding, made possible by the partnership between the US and UK AISIs announced earlier this year

[186]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet shows particularly strong improvements in coding and front-end web development

[187]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Along with the model, we're also introducing a command line tool for agentic coding, Claude Code

[188]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude Code is available as a limited research preview, and enables developers to delegate substantial engineering tasks to Claude directly from their terminal

[189]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet is now available on all Claude plans—including Free, Pro, Team, and Enterprise—as well as the Claude Developer Platform, Amazon Bedrock, and Google Cloud's Vertex AI

[190]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Extended thinking mode is available on all surfaces except the free Claude tier

[191]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"In both standard and extended thinking modes, Claude 3.7 Sonnet has the same price as its predecessors: $3 per million input tokens and $15 per million output tokens—which includes thinking tokens

[192]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet: Frontier reasoning made practical

[193]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"In developing our reasoning models, we've optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that better reflect how businesses actually use LLMs

[194]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Early testing demonstrated Claude's leadership in coding capabilities across the board: Cursor noted Claude is once again best-in-class for real-world coding tasks, with significant improvements in areas ranging from handling complex codebases to advanced tool use

[195]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Cognition found it far better than any other model at planning code changes and handling full-stack updates

[196]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Vercel highlighted Claude's exceptional precision for complex agent workflows, while Replit has successfully deployed Claude to build sophisticated web apps and dashboards from scratch, where other models stall

[197]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"In Canva's evaluations, Claude consistently produced production-ready code with superior design taste and drastically reduced errors

[198]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet achieves state-of-the-art performance on SWE-bench Verified, which evaluates AI models' ability to solve real-world software issues

[199]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet achieves state-of-the-art performance on TAU-bench, a framework that tests AI agents on complex real-world tasks with user and tool interactions

[200]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet excels across instruction-following, general reasoning, multimodal capabilities, and agentic coding, with extended thinking providing a notable boost in math and science

[201]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Beyond traditional benchmarks, it even outperformed all previous models in our Pokémon gameplay tests

[202]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Since June 2024, Sonnet has been the preferred model for developers worldwide

[203]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Today, we're empowering developers further by introducing Claude Code—our first agentic coding tool—in a limited research preview

[204]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude Code is an active collaborator that can search and read code, edit files, write and run tests, commit and push code to GitHub, and use command line tools—keeping you in the loop at every step

[205]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude Code is an early product but has already become indispensable for our team, especially for test-driven development, debugging complex issues, and large-scale refactoring

[206]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"In early testing, Claude Code completed tasks in a single pass that would normally take 45+ minutes of manual work, reducing development time and overhead

[207]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"In the coming weeks, we plan to continually improve it based on our usage: enhancing tool call reliability, adding support for long-running commands, improved in-app rendering, and expanding Claude's own understanding of its capabilities

[208]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Our goal with Claude Code is to better understand how developers use Claude for coding to inform future model improvements

[209]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"By joining this preview, you'll get access to the same powerful tools we use to build and improve Claude, and your feedback will directly shape its future

[210]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"We've also improved the coding experience on Claude.ai

[211]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Our GitHub integration is now available on all Claude plans—enabling developers to connect their code repositories directly to Claude

[212]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet is our best coding model to date

[213]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"With a deeper understanding of your personal, work, and open source projects, it becomes a more powerful partner for fixing bugs, developing features, and building documentation across your most important GitHub projects

[214]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"We've conducted extensive testing and evaluation of Claude 3.7 Sonnet, working with external experts to ensure it meets our standards for security, safety, and reliability

[215]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"The system card for this release covers new safety results in several categories, providing a detailed breakdown of our Responsible Scaling Policy evaluations that other AI labs and researchers can apply to their work

[216]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"The card also addresses emerging risks that come with computer use, particularly prompt injection attacks, and explains how we evaluate these vulnerabilities and train Claude to resist and mitigate them

[217]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Additionally, it examines potential safety benefits from reasoning models: the ability to understand how models make decisions, and whether model reasoning is genuinely trustworthy and reliable

[218]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Read the full system card to learn more

[219]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"Claude 3.7 Sonnet and Claude Code mark an important step towards AI systems that can truly augment human capabilities

[220]

Anthropic. "Claude 3.7 Sonnet and Claude Code." Anthropic News, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet

"With their ability to reason deeply, work autonomously, and collaborate effectively, they bring us closer to a future where AI enriches and expands what humans can achieve

[221]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"Gemini 2.5 is our most intelligent AI model, capable of reasoning through its thoughts before responding, resulting in enhanced performance and improved accuracy

[222]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"Best for coding and highly complex tasks

[223]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"fast performance on everyday tasks

[224]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"an enhanced reasoning mode that uses cutting edge research techniques in parallel thinking and reinforcement learning to significantly improve Gemini's ability to solve complex problems

[225]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"can better help tackle problems that require creativity, strategic planning, and making improvements step-by-step

[226]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"Humanity's Last Exam (no tools)

[227]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"GPQA diamond

[228]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"AIME 2025

[229]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"LiveCodeBench (UI: 1/1/2025-5/1/2025)

[230]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"Gemini 2.5 builds on the best of Gemini — with native multimodality and a long context window

[231]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"text, image and video modalities

[232]

Google DeepMind. "Gemini." Google DeepMind, 2025. https://deepmind.google/models/gemini/

"AIME 2025

[233]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

[234]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 Scout, a 17 billion active parameter model with 16 experts, is the best multimodal model in the world in its class and is more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU

[235]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Additionally, Llama 4 Scout offers an industry-leading context window of 10M and delivers better results than Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a broad range of widely reported benchmarks

[236]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

[237]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena

[238]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is our most powerful yet and among the world's smartest LLMs

[239]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks

[240]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 Behemoth is still training, and we're excited to share more details about it even while it's still in flight

[241]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Our new Llama 4 models are our first models that use a mixture of experts (MoE) architecture

[242]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"In MoE models, a single token activates only a fraction of the total parameters

[243]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"MoE architectures are more compute efficient for training and inference and, given a fixed training FLOPs budget, delivers higher quality compared to a dense model

[244]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"As an example, Llama 4 Maverick models have 17B active parameters and 400B total parameters

[245]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We use alternating dense and mixture-of-experts (MoE) layers for inference efficiency

[246]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"MoE layers use 128 routed experts and a shared expert

[247]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Each token is sent to the shared expert and also to one of the 128 routed experts

[248]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"As a result, while all parameters are stored in memory, only a subset of the total parameters are activated while serving these models

[249]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"This improves inference efficiency by lowering model serving costs and latency—Llama 4 Maverick can be run on a single NVIDIA H100 DGX host for easy deployment, or with distributed inference for maximum efficiency

[250]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 models are designed with native multimodality, incorporating early fusion to seamlessly integrate text and vision tokens into a unified model backbone

[251]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Early fusion is a major step forward, since it enables us to jointly pre-train the model with large amounts of unlabeled text, image, and video data

[252]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We also improved the vision encoder in Llama 4

[253]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"This is based on MetaCLIP but trained separately in conjunction with a frozen Llama model to better adapt the encoder to the LLM

[254]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"The overall data mixture for training consisted of more than 30 trillion tokens, which is more than double the Llama 3 pre-training mixture and includes diverse text, image, and video datasets

[255]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Our new Llama 4 models are our first models that use a mixture of experts (MoE) architecture

[256]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"In MoE models, a single token activates only a fraction of the total parameters

[257]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"MoE architectures are more compute efficient for training and inference and, given a fixed training FLOPs budget, delivers higher quality compared to a dense model

[258]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We developed a new training technique which we refer to as MetaP that allows us to reliably set critical model hyper-parameters such as per-layer learning rates and initialization scales

[259]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We found that chosen hyper-parameters transfer well across different values of batch size, model width, depth, and training tokens

[260]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 enables open source fine-tuning efforts by pre-training on 200 languages, including over 100 with over 1 billion tokens each, and overall 10x more multilingual tokens than Llama 3

[261]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

[262]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We continued training the model in what we call 'mid-training' to improve core capabilities with new training recipes including long context extension using specialized datasets

[263]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"This enabled us to enhance model quality while also unlocking best-in-class 10M input context length for Llama 4 Scout

[264]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"For mixing modalities, we came up with a carefully curated curriculum strategy that does not trade-off performance compared to the individual modality expert models

[265]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"With Llama 4, we revamped our post-training pipeline by adopting a different approach: lightweight supervised fine-tuning (SFT) > online reinforcement learning (RL) > lightweight direct preference optimization (DPO)

[266]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"A key learning was that SFT and DPO can over-constrain the model, restricting exploration during the online RL stage and leading to suboptimal accuracy, particularly in reasoning, coding, and math domains

[267]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"To address this, we removed more than 50% of our data tagged as easy by using Llama models as a judge and did lightweight SFT on the remaining harder set

[268]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"In the subsequent multimodal online RL stage, by carefully selecting harder prompts, we were able to achieve a step change in performance

[269]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Furthermore, we implemented a continuous online RL strategy, where we alternated between training the model and then using it to continually filter and retain only medium-to-hard difficulty prompts

[270]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"This strategy proved highly beneficial in terms of compute and accuracy tradeoffs

[271]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We then did a lightweight DPO to handle corner cases related to model response quality, effectively achieving a good balance between the model's intelligence and conversational abilities

[272]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Both the pipeline architecture and the continuous online RL strategy with adaptive data filtering culminated in an industry-leading, general-purpose chat model with state-of-the-art intelligence and image understanding capabilities

[273]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 Scout is both pre-trained and post-trained with a 256K context length, which empowers the base model with advanced length generalization capability

[274]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We present compelling results in tasks such as retrieval with 'retrieval needle in haystack' for text as well as cumulative negative log-likelihoods (NLLs) over 10 million tokens of code

[275]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"A key innovation in the Llama 4 architecture is the use of interleaved attention layers without positional embeddings

[276]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Additionally, we employ inference time temperature scaling of attention to enhance length generalization

[277]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We call this the iRoPE architecture, where 'i' stands for 'interleaved' attention layers, highlighting the long-term goal of supporting 'infinite' context length, and 'RoPE' refers to the rotary position embeddings employed in most layers

[278]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We trained both of our models on a wide variety of image and video frame stills in order to give them broad visual understanding, including of temporal activities and related images

[279]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"This enables effortless interaction on multi-image inputs alongside text prompts for visual reasoning and understanding tasks

[280]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"The models were pre-trained on up to 48 images, and we've tested in post-training with good results up to eight images

[281]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 Scout is also best-in-class on image grounding, able to align user prompts with relevant visual concepts and anchor model responses to regions in the image

[282]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"This enables more precise visual question answering for the LLM to better understand user intent and localize objects of interest

[283]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 Scout also exceeds comparable models on coding, reasoning, long context, and image benchmarks and offers stronger performance than all previous Llama models

[284]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We're excited to share a preview of Llama 4 Behemoth, a teacher model that demonstrates advanced intelligence among models in its class

[285]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Llama 4 Behemoth is also a multimodal mixture-of-experts model, with 288B active parameters, 16 experts, and nearly two trillion total parameters

[286]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Offering state-of-the-art performance for non-reasoning models on math, multilinguality, and image benchmarks, it was the perfect choice to teach the smaller Llama 4 models

[287]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We codistilled the Llama 4 Maverick model from Llama 4 Behemoth as a teacher model, resulting in substantial quality improvements across end task evaluation metrics

[288]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We developed a novel distillation loss function that dynamically weights the soft and hard targets through training

[289]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Codistillation from Llama 4 Behemoth during pre-training amortizes the computational cost of resource-intensive forward passes needed to compute the targets for distillation for the majority of the training data used in student training

[290]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"For additional new data incorporated in student training, we ran forward passes on the Behemoth model to create distillation targets

[291]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Post-training a model with two trillion parameters was a significant challenge too that required us to completely overhaul and revamp the recipe, starting from the scale of data

[292]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"In order to maximize performance, we had to prune 95% of the SFT data, as opposed to 50% for smaller models, to achieve the necessary focus on quality and efficiency

[293]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We also found that doing lightweight SFT followed by large-scale reinforcement learning (RL) produced even more significant improvements in reasoning and coding abilities of the model

[294]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Our RL recipe focused on sampling hard prompts by doing pass@k analysis with the policy model and crafting a training curriculum of increasing prompt hardness

[295]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We also found that dynamically filtering out prompts with zero advantage during training and constructing training batches with mixed prompts from multiple capabilities were instrumental in providing a performance boost on math, reasoning, and coding

[296]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Finally, sampling from a variety of system instructions was crucial in ensuring that the model retained its instruction following ability for reasoning and coding and was able to perform well across a variety of tasks

[297]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Scaling RL for a two trillion parameter model also required revamping our underlying RL infrastructure due to its unprecedented scale

[298]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We optimized the design of our MoE parallelization for speed, which enabled faster iteration

[299]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"We developed a fully asynchronous online RL training framework that enhanced flexibility

[300]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"Compared to the existing distributed training framework, which sacrifices the compute memory in order to stack all models in memory, our new infrastructure enabled flexible allocation of different models to separate GPUs, balancing resources across multiple models based on computational speed

[301]

Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." Meta AI Blog, April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/

"This innovation resulted in a ~10x improvement in training efficiency over previous generations

[302]

DeepSeek. "DeepSeek V3." DeepSeek Blog, 2025. https://deepseek.com/blog/DeepSeek-V3

"自研训练框架、自建智算集群和万卡算力

[303]

DeepSeek. "DeepSeek V3." DeepSeek Blog, 2025. https://deepseek.com/blog/DeepSeek-V3

"仅用半年时间便已发布并开源多个百亿级参数大模型

[304]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1

[305]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities

[306]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors

[307]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"However, it encounters challenges such as poor readability, and language mixing

[308]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL

[309]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks

[310]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"DeepSeek-R1 achieves a score of $7 9 . 8 %$ Pass@1 on AIME 2024, slightly surpassing OpenAI-o1-1217

[311]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"On MATH-500, it attains an impressive score of $9 7 . 3 %$ , performing on par with OpenAI-o1-1217 and significantly outperforming other models

[312]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"On coding-related tasks, DeepSeek-R1 demonstrates expert level in code competition tasks, as it achieves 2,029 Elo rating on Codeforces outperforming $9 6 . 3 %$ human participants in the competition

[313]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

[314]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"While its performance is slightly below that of OpenAI-o1-1217 on these benchmarks, DeepSeek-R1 surpasses other closed-source models, demonstrating its competitive edge in educational tasks

[315]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step

[316]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero

[317]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"DeepSeekR1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community

[318]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through ${ \\\\mathrm { R L } } ,$ without the need for SFT

[319]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

[320]

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv, 2025. https://arxiv.org/pdf/2501.12948

"In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL

[321]

N. Lambert. "Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2504.12501v3, November 2, 2025. https://arxiv.org/html/2504.12501v3

"Group Relative Policy Optimization (GRPO) is introduced in DeepSeekMath

[322]

N. Lambert. "Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2504.12501v3, November 2, 2025. https://arxiv.org/html/2504.12501v3

"GRPO can be viewed as PPO-inspired algorithm with a very similar surrogate loss, but it avoids learning a value function with another copy of the original policy language model

[323]

N. Lambert. "Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2504.12501v3, November 2, 2025. https://arxiv.org/html/2504.12501v3

"Avoiding the challenge of learning a value function from a LM backbone

[324]

N. Lambert. "Reinforcement Learning from Human Feedback." arXiv preprint arXiv:2504.12501v3, November 2, 2025. https://arxiv.org/html/2504.12501v3

"Saves memory by not needing to keep another set of model weights in memory

[325]