The Scaling Laws Debate: Are We Hitting a Wall or Just Getting Started?
There’s been a lot of discussion about whether the scaling laws that drove progress from GPT-2 to GPT-4 are hitting diminishing returns. Let’s look at both sides.
The “hitting a wall” argument:
– Training compute has increased 1000x but capabilities haven’t improved 1000x
– We’re running out of high-quality training data
– Benchmark improvements are flattening on some tasks
– The gains from GPT-4 to current models feel incremental compared to GPT-3 to GPT-4
The “just getting started” argument:
– New architectures (MoE, state-space models) change the scaling equation
– Synthetic data generation provides potentially unlimited training data
– Test-time compute scaling (like o1/o3) is a completely new axis of improvement
– We haven’t seriously optimized for reasoning yet
– Multimodal scaling is still in early stages
My take: I think raw language-model scaling is plateauing, but the field is pivoting to scaling along new dimensions – reasoning time, tool use, agent architectures, and specialized training. The next big jumps won’t come from just making models bigger.
What do the researchers and ML engineers here think?
5 Replies
Join the discussion.
Log In to Replytest-time compute scaling is the one that keeps me up at night honestly. o3 on ARC-AGI was not supposed to happen yet. that's not incremental, that's a different curve entirely.
the benchmark flattening argument is weaker than people think. half those benchmarks got saturated because they were too easy, not because models stopped improving. we need better evals before we can even have this conversation properly.
synthetic data is the wildcard nobody has a confident answer on. some papers show it works great for math and code, others show model collapse. the variance in outcomes depending on how you filter the synthetic data seems massive and we don't have good rules of thumb yet.
multimodal is where things get really interesting. vision + language + code reasoning in a single model opens up application categories that werent possible before
wait but multimodal has been "early stages" for like 3 years now. at what point do we admit the vision-language integration is harder than the labs expected? the cross-modal reasoning is still pretty shallow in practice.