For the first time, a publicly accessible model has crossed the symbolic 100-trillion-parameter barrier while still delivering usable latency. Grok 4 was trained on an unprecedented 128 000 H100 GPUs running non-stop for four months — roughly 15 million GPU-hours and an electricity bill that could power a small country.
The results are brutal:
96.8 % on MATH-500 (previous public SOTA was 91 %)
89.4 % on HumanEval+ and 81.3 % on SWE-Bench Verified — it now beats every specialist coding agent, including Devin and Cursor, without any tool-use scaffolding
Perfect score on the 1-million-token Needle-in-a-Haystack and 98 % on the brutal RULER-128k benchmark
Multimodal: 87.6 % MMMU, 94.1 % MathVista, 79.8 % ChartQA — all new public records
Agent benchmarks: 72 % WebArena and 68 % AndroidControl in zero-shot setting
But the scaling curves.
The report shows the cleanest confirmation yet that Chinchilla-style scaling laws are still perfectly log-linear even at 104 trillion parameters (MoE, ~26 T active per token). There is essentially zero flattening. If you believe the curves (and xAI included the raw training logs), another 5–10× effective compute should still yield predictable, massive gains.
Architecture notes worth knowing:
Native 128 k context, extendable to 1 M+ with almost no accuracy drop thanks to a new “Grok Attention” routing trick
Trained on ~50 trillion tokens: heavily deduplicated Common Crawl, real-time X posts, the entire arXiv, and massive synthetic math/code textbooks
Post-training used a new reinforcement-learning-from-AI-feedback loop that is 3× more sample-efficient than previous methods
Grok 4 is already live today for SuperGrok and X Premium+ users. A denser, non-MoE “Grok 4 Heavy” (rumoured >400 T parameters) is slated for mid-2026 and will reportedly require the Memphis supercluster currently under construction.
Bottom line: we are not in the diminishing-returns regime. We are still riding the straight part of the scaling highway, and xAI just proved it with the biggest public run ever.