AI: Current Dynamics and What to Expect in 2026

Where does AI stand in early 2026? Here are my notes and reflections on current dynamics: what has actually changed, what's stagnating, and what to expect.

Inspired by Lex Fridman podcast episode #490, with Sebastian Raschka (author of Build a Large Language Model from Scratch) and Nathan Lambert (post-training lead, Allen Institute for AI).

A surprisingly stable architecture

First observation, and perhaps the most counterintuitive: despite the release of new models (GPT-5.2, Claude Opus 4.6, Gemini 3, Llama 4...) and the products built around them by OpenAI, Anthropic, Google and Meta, the fundamental architecture of LLMs has barely evolved since GPT-2 (2019).

The base architecture is still a decoder-only transformer, from the 2017 "Attention Is All You Need" paper. It's theoretically possible to start from GPT-2 code and arrive at a 2026 model through successive additions: no rewriting, just incremental modifications. This means there hasn't been a paradigm shift. The architecture has evolved, but it's the same core idea.

The improvements focus on efficiency: how to activate fewer parameters per query, how to compress memory during inference, how to optimize computations. Optimizations, not a new way of thinking about the problem. Seven years of the industry iterating on the same basic idea, and there are still many optimizations left to explore.

So where do the real advances come from? Three sources: training data quality, post-training techniques (particularly reinforcement learning), and compute allocated at inference time. The architecture itself is still waiting for its next breakthrough.

Letting the model think

The most significant technical advance of 2025 is inference-time scaling: rather than building a larger model, let it think longer before responding. The model generates internal reasoning, sometimes for several minutes, before producing its final answer.

This is what happens when you activate "thinking" mode on ChatGPT, Claude, or Gemini. The implications went far beyond expectations.

Thinking longer isn't just about better phrasing. It's about chaining steps: try an API call, observe the result, adjust, retry. This is what enabled LLMs to perform autonomous web research, execute code iteratively, explore entire projects. A year ago, an LLM couldn't chain API calls. Today, you can launch multiple queries in parallel, each searching for a research paper or verifying an equation.

"It has totally transformed how we think of using AI. But it's not clear what the next avenue will be in terms of unlocking stuff like this." Nathan Lambert

Inference-time scaling has transformed use cases. But no one knows what the next advance of this magnitude will be.

The economics of compute

Every year, someone announces the death of scaling laws. Every year, they hold. The relevant question in 2026 is no longer "does it scale?" but "where to invest your compute?"

All three improvement axes (pre-training, post-training, inference-time) still work. But the economic equation has changed. Pre-training scaling laws have held across 13 orders of magnitude of compute. No reason they would stop. But serving a giant model to hundreds of millions of users costs billions, far more than initial training.

The math is simple: pre-training is a fixed cost, inference is a variable cost. If the model is obsolete in six months, the equation tilts toward inference. That's why OpenAI implemented a routing system with GPT-5: most queries are directed to a lighter, cheaper model.

One-gigawatt clusters are coming online. "Pro" subscriptions could jump from $200 to $2,000/month. The challenge is no longer technical, it's economic.

Data as competitive advantage

To have impact at an AI lab, the most pragmatic advice is to find better data.

"If you join a frontier lab and you want to have impact, the best way to do it is just find new data that's better." Nathan Lambert

OLMo 3 (Allen Institute) was trained with less data than several competitors. And it outperformed them. The secret: curation. Quality over quantity.

Labs invest in OCR of scientific PDFs, filtering raw web data with specialized classifiers, optimizing data mixes. Sample subsets, train small models on each mix, measure performance, adjust. When benchmarks evolve, the mix changes. It's iterative work that's never finished.

On synthetic data: it's not about letting AI invent content. It's often reformulating an article into Q&A format, or summarizing a technical document in accessible language. Like human learning, we learn better from well-structured text.

The RLHF paradox

RLHF (Reinforcement Learning from Human Feedback) is the standard method for making models useful and aligned. Lambert is one of the leading experts, which makes his observation all the more interesting: the method has a structural flaw that no one knows how to solve.

The problem is in the formulation itself. You collect preferences from thousands of people on what a "good" response is. You train the model to maximize that aggregated preference. Result: responses that satisfy the most people, meaning smooth, consensual, edgeless responses.

Lambert calls it losing your "voice." A researcher who writes tries to transform an intuition at the edge of their understanding into words. It's sometimes clumsy, but it's precise and carries a point of view. RLHF, by averaging feedback, prevents this form of expression.

An observation shared by many advanced users: even with elaborate prompts, LLM-generated summaries systematically miss the most important insights. The summary is correct, it covers the main points. But the insight, the sentence that makes you see things differently, isn't there.

"These language models don't have this prior in their deep expression that they're trying to get at. I don't think it's impossible to do. But it's such a wonderful fundamental problem." Nathan Lambert

It's not a bug. It's an open problem. And probably one of the most important of the decade.

The open source shift

In the podcast, the speakers try to list from memory all significant open-weight models. They find more than twenty. No one thinks to mention Llama.

The open source landscape shifted in 2025. Chinese labs now lead: DeepSeek opened the way with R1 in January 2025, quickly joined by Kimi, MiniMax and Qwen. Their licenses are more permissive than Llama or Gemma: no user thresholds, no reporting to Meta or Google.

Chinese models are also larger, often with Mixture of Experts architectures of several hundred billion parameters, giving them an advantage in raw performance. On the American side, NVIDIA and Mistral are announcing equivalent models for early 2026, but they're not available yet.

OpenAI released its first open source model since GPT-2. Designed for tool use, but far from competing with the best Chinese models. Sam Altman was transparent about the motivation: "We're releasing this because we can use your GPUs." When you're short on GPUs, you outsource inference to the community.

Long-term projection

At the end of the podcast, Lex Fridman asks his guests to project a hundred years into the future.

The answers converge toward something modest. No Singularity. No artificial consciousness. Rather, specialized robots integrated into daily life, brain-machine interfaces that will replace our smartphones, and AI that remains what it is: a tool. Powerful, ubiquitous, but a tool.

What will change profoundly is the value we place on what is authentically human. Physical experiences. Connection. Community. Lambert talks about "preserving agency": the ability to choose, to build oneself rather than passively consume what the machine generates.

Perhaps the most unexpected lesson from this conversation. What matters is not the machine. It's what we do with it. And above all, what we choose not to delegate to it.

Listen to the full episode →

Lex Fridman Podcast #490, with Sebastian Raschka and Nathan Lambert