The Scaling Laws Debate: Are We Hitting a Wall or Just Getting Started?

Question

There's been a lot of discussion about whether the scaling laws that drove progress from GPT-2 to GPT-4 are hitting diminishing returns. Let's look at both sides. The "hitting a wall" argument: - Training compute has increased 1000x but capabilities haven't improved 1000x - We're running out of high-quality training data - Benchmark improvements are flattening on some tasks - The gains from GPT-4 to current models feel incremental compared to GPT-3 to GPT-4 The "just getting started" argument: - New architectures (MoE, state-space models) change the scaling equation - Synthetic data generation provides potentially unlimited training data - Test-time compute scaling (like o1/o3) is a completely new axis of improvement - We haven't seriously optimized for reasoning yet - Multimodal scaling is still in early stages My take: I think raw language-model scaling is plateauing, but the field is pivoting to scaling along new dimensions - reasoning time, tool use, agent architectures, and specialized training. The next big jumps won't come from just making models bigger. What do the researchers and ML engineers here think?

Adam Novak · Accepted Answer

the benchmark flattening argument is weaker than people think. half those benchmarks got saturated because they were too easy, not because models stopped improving. we need better evals before we can even have this conversation properly.

Victor Huang · Answer

multimodal is where things get really interesting. vision + language + code reasoning in a single model opens up application categories that werent possible before

Sarah Chen · Answer

test-time compute scaling is the one that keeps me up at night honestly. o3 on ARC-AGI was not supposed to happen yet. that's not incremental, that's a different curve entirely.

Zara Ahmed · Answer

wait but multimodal has been "early stages" for like 3 years now. at what point do we admit the vision-language integration is harder than the labs expected? the cross-modal reasoning is still pretty shallow in practice.

Andre Dubois · Answer

synthetic data is the wildcard nobody has a confident answer on. some papers show it works great for math and code, others show model collapse. the variance in outcomes depending on how you filter the synthetic data seems massive and we don't have good rules of thumb yet.

The Scaling Laws Debate: Are We Hitting a Wall or Just Getting Started?

5 Replies