17 Comments
User's avatar
Olle Häggström's avatar

Good post! I propose to use the term "Schubertian sobriety" for this kind of level-headed analysis.

David Manheim's avatar

I think a large part of the gap in expectations is about generalization, or in different terms, how the correlations between the metrics and the capabilities degrade. Everyone agrees that the metrics are going to be overfit, and won't directly predict success - as you note - but if the correlation between measurable capabilities and hard to measure capabilities doesn't entirely disappear, we'll also see continued significant progress on the hard to measure items as well.

And evaluating the two theories performance to date, we have seen exactly the sort of general progress that broad generalization predicts - lower text prediction log-loss ended up leading to increased performance on a wide variety of tasks, and improved time horizon success aimed primarily at for software development has led to better writing, better qualitative analysis, and better mathematical capabilities. And this isn't back-forecasting, it's effectively exactly what those promoting the scaling hypothesis were betting on, and it has paid off in the last several generations of model.

Of course, it's always possible that the correlation between measured capability and most other tasks drops to zero, or even ends up perversely decreasing performance - but at that point, the generalization argument is that the developers will find (possibly harder to measure but at least temporarily) more robust measures to improve. Though, critically, this retargeting only works as long as we can observe the changes' impact on performance. And that's exactly the case that alignment researchers have been worried about for well over a decade, at this point.

akash's avatar

> increased performance on a wide variety of tasks, and improved time horizon success aimed primarily at for software development has led to better writing, better qualitative analysis, and better mathematical capabilities.

Isn't it also the case that the RL during post-training has led to asymmetric progress in math and programming but not-so-much for fuzzier domains?

David Manheim's avatar

Yes, that's an example of what I'm referring to!

The fact that progress in verifiable domains moves faster due to the optimization pressure directed at those tasks doesn't imply that it's not also assisting with fuzzier domains, to a lesser extent. And we see that it is, as I noted. (For example, no-one is optimizing for improvements in qualitative data analysis and tagging, but newer RL-optimized models (e.g. GPT 5.1->5.2->5.4, and Claude 4.5->4.6) are getting noticeably better at it.) Which is exactly what should be expected from optimization leading to improved performance on one subset of tasks that partially reflect some underlying capacity.

akash's avatar

> no-one is optimizing for improvements in qualitative data analysis and tagging

This is not true! Though, I don't know to what extent the increase in performance is because of (abductive?) generalization vs. qualitative data analysis tasks that labs collect.

David Manheim's avatar

Really? At the very least, the labs certainly aren't focusing on it, and it's not subject to the same kinds of verifiable improvement loops for RL as the things they are focusing on.

akash's avatar

I don't know is to what extent such data makes it to the final post-training pipeline / actually improves performance, but data where experts are asked to annotate the tiniest of decisions is definitely collected.

> the labs certainly aren't focusing on it

They pay a lot of money to data collection companies, so I would argue that the labs are focused on this!

> it's not subject to the same kinds of verifiable improvement loops for RL

I am pretty sure Scale has found ways to provide RL rewards for at least some fuzzier tasks

David Manheim's avatar

Seems like you're conflating human expert rating of qualitative data to improve AI, and doing other qualitative analysis of AI outputs, which I agree they are doing, with optimizing for AI systems doing qualitative analysis, which is what I was referring to.

Paul Barnes's avatar

Superb curation, thank you.

Henrique Soares's avatar

In this debate, I’m like Socrates: I know that I know nothing.

Anatol Wegner, PhD's avatar

Just a little side note - the longest task in the METR benchmark is ~30 hours and it only contains two of those.

Alex Willen's avatar

The race effect argument is one that I have trouble evaluating. On the one hand it makes sense in a lot of industries - if my software company replaced all engineers with AI, I can drastically undercut on price. On the other hand, maybe a meaningful portion of buyers view the low cost and heavy use of AI negatively, as though it implies meaningful quality risk in the software (even if AI is at the point where that’s not actually a valid concern).

Ljubomir Josifovski's avatar

My experience is Codex is a capable collaborative sidekick that is saving me 70-80% of my time already. Almost doubling my attention. For the past 3-4 days, we have been running LiveCodeBench (shortest) benchmark on my home gpu (amd 7090xtx; same 960 GB/s 24gb vram as nvidia 3090, double TFLOPS on paper but performing same due to bad amd s/w support; same price) to find a local llama.cpp server configuration that is fast enough and works robust enought, among 4 models (Qwen 3.5s - 35B-A3B MoE most likely next upcoming, 27B dense, 9B small; and glm-4.7-flash the incumbent). In the olden days, I'd have to 1-come up with hypothesis why something observed happened, then as a consequence - what I should 2-test next, then I'd 3-write scripts and short programs, 4-run them, 5-fix them, until they seem to be doing what I want them to do. Then I'd let them 6-run for many hours, monitoring, 7-restarting any failed jobs, 8-collecting results along the way. Once finished, I'd 9-collect the final results. Together with previous iterations final results, I'd then A- hypothesise some more, and B-decide on the next iteration.

With Codex 1-2 we do together, then Codex is independent does 3-9. Then we do A-B together again. Not only is that a huge time and attention saver, but having a collaborator able, dilligent and willing to talk to me, is a great quality of life improvement.

Freddie deBoer's avatar

The "superforecaster" has been wrong a lot lol

akash's avatar

This chart suggests otherwise: https://www.metaculus.com/accounts/profile/100912/track-record/

But what makes you say that?

Jessumsica's avatar

There are whole countries' bureaucracies where they don't even use software released 30 years ago - what's the explanation for why those countries will suddenly performing a huge volte-face? Or is that irrelevant to the analysis?