Hacker Newsnew | past | comments | ask | show | jobs | submit | aspenmartin's commentslogin

I think this sounds like a true yet short sighted take. Keep in mind these features are immature but they exist to obtain a flywheel and corner the market. I don’t know why but people seem to consistently miss two points and their implications

- performance is continuing to increase incredibly quickly, even if you rightfully don’t trust a particular evaluation. Scaling laws like chinchilla and RL scaling laws (both training and test time)

- coding is a verifiable domain

The second one is most important. Agent quality is NOT limited by human code in the training set, this code is simply used for efficiency: it gets you to a good starting point for RL.

Claiming that things will not reach superhuman performance, INCLUDING all end to end tasks: understanding a vague business objective poorly articulated, architecting a system, building it out, testing it, maintaining it, fixing bugs, adding features, refactoring, etc. is what requires the burden of proof because we literally can predict performance (albeit it has a complicated relationship with benchmarks and real world performance).

Yes definitely, error rates are too high so far for this to be totally trusted end to end but the error rates are improving consistently, and this is what explains the METR time horizon benchmark.


Scaling laws vs combinatorial explosion, who wins? In personal experience claude does exceedingly well on mundane code (do a migration, add a field, wire up this UI) and quite poorly on code that has likely never been written (even if it is logically simple for a human). The question is whether this is a quantitative or qualitative barrier.

Of course it's still valuable. A real app has plenty of mundane code despite our field's best efforts.


Combinatorial explosion? What do you mean? Again, your experiences are true, but they are improving with each release. The error rate on tasks continues to go down, even novel tasks (as far as we can measure them). Again this is where verifiable domains come in -- whatever problems you can specify the model will improve on them, and this improvement will result in better generalization, and improvements on unseen tasks. This is what I mean by taking your observations of today, ignoring the rate of progress that got us here and the known scaling laws, and then just asserting there will be some fundamental limitation. My point is while this idea may be common, it is not at all supported by literature and the mathematics.

The space of programs is incomprehensibly massive. Searching for a program that does what you need is a particularly difficult search problem. In the general case you can't solve search, there's no free lunch. Even scaling laws must bow to NFL. But depending on the type of search problem some heuristics can do well. We know human brains have a heuristic that can program (maybe not particularly well, but passably). To evaluate these agents we can only look at it experimentally, there is no sense in which they are mathematically destined to eventually program well.

How good are these types of algorithms at generalization? Are they learning how to code; or are they learning how to code migrations, then learning how to code caches, then learning how to code a command line arg parser, etc?

Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."


So what do you think the difference is between humans and an agent in this respect? What makes you think this has any relevance to the problem? everything is combinatorially explosive: the combination of words that we can string into sentences and essays is also combinatorially explosive and yet LLMs and humans have no problem with it. It's just the wrong frame of thinking for what's going on. These systems are obtaining higher and higher levels of abstractions because that is the most efficient thing for them to do to gain performance. That's what reasoning looks like: compositions of higher level abstractions. What you say may be true but I don't see how this is relevant.

"There is no sense in which they are mathematically destined to eventually program well"

- Yes there is and this belies and ignorance of the literature and how things work

- Again: RL has been around forever. Scaling laws have held empirically up to the largest scales we've tested. There are known RL scaling laws for both training and test time. It's ludicrous to state there is "no sense" in this, on the contrary, the burden of proof of this is squarely on yourself because this has already been studied and indeed is the primary reason why we're able to secure the eye-popping funding: contrary to popular HN belief, a trillion dollars of CapEx spend is based on rational evidence-based decision making.

> "How good are these types of algorithms at generalization"

There is a tremendously large literature and history of this. ULMFiT, BERT ==> NLP task generalization; https://arxiv.org/abs/2206.07682 ==> emergent capabilities, https://transformer-circuits.pub/2022/in-context-learning-an... ==> demonstrated circuits for in context learning as a mechanism for generalization, https://arxiv.org/abs/2408.10914 + https://arxiv.org/html/2409.04556v1 ==> code training produces downstream performance improvements on other tasks

> Verifiable domains are interesting. It is unquestionably why agents have come first for coding. But if you've played with claude you may have experienced it short-circuiting failing tests, cheating tests with code that does not generalize, writing meaningless tests, and at long last if you turn it away from all of these it may say something like "honest answer - this feature is really difficult and we should consider a compromise."

You say this and ignore my entire argument: you are right about all of your observations, yet

- Opus 4.6 compared to Sonnet 3.x is clearly more generalizable and less prone to these mistakes

- Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop and our recursive improvement loop will die off. Verifiable domains mean that we are in alphago land, we're learning by doing and not by mimicking human data or memorizing a training set.


Hey man, it sounds like you're getting frustrated. I'm not ignoring anything; let's have a reasonable discussion without calling each other ignorant. I don't dispute the value of these tools nor that they're improving. But the no free lunch theorem is inexorable so the question is where this improvement breaks down - before or beyond human performance on programming problems specifically.

What difference do I think there is between humans and an agent? They use different heuristics, clearly. Different heuristics are valuable on different search problems. It's really that simple.

To be clear, I'm not calling either superior. I use agents every day. But I have noticed that claude, a SOTA model, makes basic logic errors. Isn't that interesting? It has access to the complete compendium of human knowledge and can code all sorts of things in seconds that require my trawling through endless documentation. But sometimes it forgets that to do dirty tracking on a pure function's output, it needs to dirty-track the function's inputs.

It's interesting that you mention AlphaGo. I was also very fascinated with it. There was recent research that the same algorithm cannot learn Nim: https://arstechnica.com/ai/2026/03/figuring-out-why-ais-get-.... Isn't that food for thought?


What is unreasonable? I am saying the claims you are making are completely contradicted by the literature. I am calling you ignorant in the technical sense, not dumb or unintelligent, and I don't mean this as an insult. I am completely ignorant of many things, we all are.

I am saying you are absolutely right that Opus 4.6 is both SOTA and also colossally terrible in even surprisingly mundane contexts. But that is just not relevant to the argument you are making which is that there is some fundamental limitation. There is of course always a fundamental limitation to everything, but what we're getting at is where that fundamental limitation is and we are not yet even beginning to see it. Combinatorics here is the wrong lens to look at this, because it's not doing a search over the full combinatoric space, as is the case with us. There are plenty of efficient search "heuristics" as you call them.

> They use different heuristics, clearly.

what is the evidence for this? I don't see that as true, take for instance: https://www.nature.com/articles/s42256-025-01072-0

> It's interesting that you mention AlphaGo. I was also very fascinated with it. There was recent research that the same algorithm cannot learn Nim: https://arstechnica.com/ai/2026/03/figuring-out-why-ais-get-.... Isn't that food for thought?

It's a long known problem with RL in a particular regime and isn't relevant to coding agents. Things like Nim are a small, adversarially structured task family and it's not representative of language / coding / real-world tasks. Nim is almost the worst possible case, the optimal optimal policy is a brittle, discontinuous function.

Alphago is pure RL from scratch, this is quite challenging, inefficient, and unstable, and why we dont do that with LLMs, we pretrain them first. RL is not used to discover invariants (aspects of the problem that don't change when surface details change) from scratch in coding agents as they are in this example. Pretraining takes care of that and RL is used for refinement, so a completely different scenario where RL is well suited.


I didn't make any claims contradicted by literature. The only thing I cited as bedrock fact, NFL, is a mathematical theorem. I'm not sure why Nim shouldn't be relevant, it's an exercise in logic.

> “AlphaZero excels at learning through association,” Zhou and Riis argue, “but fails when a problem requires a form of symbolic reasoning that cannot be implicitly learned from the correlation between game states and outcomes.”

Seems relevant.


> So what do you think the difference is between humans and an agent in this respect?

Humans learn.

Agents regurgitate training data (and quality training data is increasingly hard to come by).

Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.

> Verifiable domain performance SCALES, we have no reason to expect that this scaling will stop.

Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.

For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.


> Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.

I'm just going to ask that you read any of my other comments, this is not at all how coding agents work and seems to be the most common misunderstanding of HN users generally. It's tiring to refute it. RL in verifiable domains does not work like this.

> Humans learn.

Sigh, so do LLMs, in context.

> Moreover, humans learn (somewhat) intangible aspects: human expectations, contracts, business requirements, laws, user case studies etc.

Literally benchmarks on this all over the place, I'm sure you follow them.

> Yes, yes we have reasons to expect that. And even if growth continues, a nearly flat logarithmic scale is just as useless as no growth at all.

and yet its not logarithmic? Consider data flywheel, consistent algorithmic improvements, synthetic data [basically: rejection sampling from a teacher model with a lot of test-time compute + high temperature],

> For a year now all the amazing "breakthrough" models have been showing little progress (comparatively). To the point that all providers have been mercilessly cheating with their graphs and benchmarks.

Benchmaxxing is for sure a real thing, not to mention even honest benchmarking is very difficult to do, but considering "all of the AI companies are just faking the performance data" to be the "story" is tremendously wrong. Consider AIME performance on 2025 (uncontaminated data), the fact that companies have a _deep incentive_ to genuinely improve their models (and then of course market it as hard as possible, thats a given). People will experiment with different models, and no benchmaxxing is going to fool people for very long.

If you think Opus 4.6 compared to Sonnet 3.x is "little progress" I think we're beyond the point of logical argument.


Are you aware that LLms are still the same autocomplete just with different token decisions more data better pre and post training and settings

We have all the data now.

I don’t see where the huge gap should come from, as one person before they said they still make basic errors.

Models got better for a bunch of soft tuning. Language and abstractness is not really the same thing there are a lot of very good speakers that are terrible in logic and abstractness.

Thinking abstract sometimes makes it necessary to leave language and draw or som people even code in another coding language to get it.

We’ve seen it with the compiler project it’s nice looking but if you would want to make a competitive compiler you would be as far as starting fresh


But the issue isn't coding, it's doing the right thing. I don't see anywhere in your plan some way of staying aligned to core business strategy, forethought, etc.

The number of devs will reduce but there will still be large activities that can't be farmed out without an overall strategy


Why do you think this is a problem? Reasoning is constantly improving, it has ample access to humans to gather more business context, it has access to the same industry data and other signals that humans do, and it can get any data necessary. It has Zoom meeting notes, I mean why do people think there's somehow a fundamental limit beyond coding?

The other thing you're missing here is generalizability. Better coding performance (which is verifiable and not limited by human data quality) generalizes performance on other benchmarks. This is a long known phenomenon.


> Why do you think this is a problem?

Because it cannot do it?

Every investment has a date where there should be a return on that investment. If there’s no date, it’s a donation of resources (or a waste depending on perspective).

You may be OK with continuing to try to make things work. But others aren’t and have decided to invest their finite resources somewhere else.


> Because it cannot do it?

Ah ok so you didn't really read my comment, what is your counter argument? Models are just fundamentally incapable of understanding business context? They are demonstrably already capable of this to a large extent.

> Every investment has a date where there should be a return on that investment. If there’s no date, it’s a donation of resources (or a waste depending on perspective).

what are you implying here? This convo now turns into the "AI is not profitable and this is a house of cards" theme? That's ok, we can ignore every other business model like say Uber running at a loss to capture what is ultimately an absolutely insane TAM. Little ol' Uber accumuluated ~33B in losses over 14 years, and you're right they tanked and collapsed like a dying star...oh wait...hmm interesting I just looked at their market cap and it's 141 Billion.

> You may be OK with continuing to try to make things work. But others aren’t and have decided to invest their finite resources somewhere else.

I truly love that. If you want to code as a hobby that is fantastic, and we can go ahead and see in 2 years how your comment ages.


> They are demonstrably already capable of this to a large extent.

I’d very like to see such demonstration. Where someone hands over a department to an agent and let it makes decisions.

> This convo now turns into the "AI is not profitable and this is a house of cards" theme?

Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.


@skydhash I think aspenmartin is ragebaiting he can’t be for real.

> I’d very like to see such demonstration. Where someone hands over a department to an agent and let it makes decisions.

That's your bar for understanding business context? I thought we were talking about what you actually said which is: understanding business context. If I brainstorm about a feature it will be able to pull the compendium of knowledge for the business (reports, previous launches, infrastructure, an understanding of the problem space, industry, company strategy). That's business context.

> Where did I say that? I didn’t even mention money, just the broader resource term. A lot of business are mostly running experiments if the current set of tooling can match the marketing (or the hype). They’re not building datacenters or running AI labs. Such experiments can’t run forever.

I misunderstood you then, I wasn't sure what point you were trying to make. Is your point "companies are trying to cajole Claude to do X and it doesn't work and hasn't for the last year so they are giving up"? If so I think that is a wonderful opportunity for people that understand the nuance of these systems and the concept of timing.


You make the mistake of disregarding tacit knowledge - the stuff that isn't in reports, docs, etc, etc because it's just "how we do things" and picked up on the job.

Unless the AI is inserted into every conversation it won't discover this, or how it changes.

Even if it had access to all this documented it wouldn't then be able to account for politics, where Barry who runs analytics is secretly trying to sabotage the project so it ends up run by his team, etc.


> - coding is a verifiable domain

You're missing the point though. "1 + 1" vs "one.add(1)" might both be "passable" and correct, but it's missing the forest for the trees, how do you know which one is "long-term the right choice, given what we know?", which is the engineering part of building software, and less about "coding" which tends to be the easy part.

How do you evaluate, score and/or benchmark something like that? Currently, I don't think we have any methodologies for this, probably because it's pretty subjective in the end. That's where the "creative" parts of software engineering becomes more important, and it's also way harder to verify.


While I agree we don't have any methodologies for this, it's also true that we can just "fail" more often.

Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.

I wouldn't be surprised if in a couple of years we see several projects that approach the problem of tech debt like this:

1. Instruct AI to write tens of thousands of tests by using available information, documentation, requirements, meeting transcripts, etc. These tests MUST include performance AND availability related tests (along with other "quality attribute" concerns) 2. Have humans verify (to the best of their ability) that the tests are correct -- step likely optional 3. Ask another AI to re-implement the project while matching the tests

It sounds insane, but...not so insane if you think we will soon have models better than Opus 4.6. And given the things I've personally done with it, I find it less insane as the days go by.

I do agree with the original poster who said that software is moving in this direction, where super fast iteration happens and non-developers can get features to at least be a demo in front of them fast. I think it clearly is and am working internally to make this a reality. You submit a feature request and eventually a live demo is ready for you, deployed in isolation at some internal server, proxied appropriately if you need a URL, and ready for you to give feedback and have the AI iterate on it. Works for the kind of projects we have, and, though I get it might be trickier for much larger systems, I'm sure everyone will find a way.

For now, we still need engineers to help drive many decisions, and I think that'll still be the case.These days all I do when "coding" is talking (via TTS) with Opus 4.6 and iterating on several plans until we get the right one, and I can't wait to see how much better this workflow will be with smarter and faster models.

I'm personally trying to adapt everything in our company to have agents work with our code in the most frictionless way we can think of.

Nonetheless, I do think engineers with a product inclination are better off than those who are mostly all about coding and building systems. To me, it has never felt so magical to build a product, and I'm loving it.


> Code is effectively becoming cheap, which means even bad design decisions can be overturned without prohibitive costs.

I'm sorry, but only someone who never maintained software long-term would say something like this. The further along you are in development, the magnitude of costs related to changing that increases, maybe even exponentially.

Correct the design before you even wrote code, might be 100x cheaper (or even 1000x) than changing that design 2 years later, after you've stored TBs of data in some format because of that decision, and lots of other parts of the company/product/project depends on those choices you made earlier.

You can't just pile on code on top of code, say "code is cheap" and hope for the best, it's just not feasible to run a project long-term that way, and I think if you had the experience of maintaining something long-term, you'd realize how this sounds.

The easiest part of "software engineering" is "writing code", and today "writing code" is even easier. But the hardest parts, actually designing, thinking and maintaining, remains the same as before, although some parts are easier, others are harder.

Don't get me wrong, I'm on the "agentic coding" train as much as everyone else, probably haven't written/edited a code by myself for a year at this point, but it's important to be realistic about what it actually takes to produce "worthwhile software", not just slop out patchy and hacky code.


I've never maintained software long-term so i could be wrong, but I interpret "code is cheap" to mean that you can have coding agents refactor or rewrite the project from scratch around the design correction. I don't think 'code is cheap' ever should be interpreted to mean ship hacky code.

I think using agents to prototype code and design will be a big thing. Have the agent write out what you want, come back with what works and what doesn't, write a new spec, toss out the old code and and have a fresh agent start again. Spec-driven development is the new hotness, but we know that the best spec is code, have the agent write the spec in code, rewrite the spec in natural language, then iterate.


because it has business context and better reasoning, and can ask humans for clarification and take direction.

You don't need to benchmark this, although it's important. We have clear scaling laws on true statistical performance that is monotonically related to any notion of what performance means.

I do benchmarks for a living and can attest: benchmarks are bad, but it doesn't matter for the point I'm trying to make.


I feel like you're missing the initial context of this conversation (no pun intended):

> Like for example a trusted user makes feedback -> feedback gets curated into a ticket by an AI agent, then turned into a PR by an Agent, then reviewed by an Agent, before being deployed by an Agent.

Once you add "humans for clarifications and take direction" then yeah, things can be useful, but that's far away from the non-human-involvment-loop earlier described in this thread, which is what people are pushing back against.

Of course, involving people makes things better, that's the entire point here, and that by removing the human, you won't get as good results. Going back to benchmarks, obviously involving humans aren't possible here, so again we're back to being unable to score these processes at all.


I'm confused on the scenario here. There is human in the loop, it's the feedback part...there is business context, it is either seeded or maintained by the human and expanded by the agent. The agent can make inferences about the world, especially when embodiment + better multimodal interaction is rolled out [embodiment taking longer].

Benchmarks ==> it's absolutely not a given that humans can't be involved in the loop of performance measurement. Why would that be the case?


> because it has business context

It doesn't because it doesn't learn. Every time you run it, it's a new dawn with no knowledge of your business or your business context

> better reasoning

It doesn't have better reasoning beyond very localized decisions.

> and can ask humans for clarification and take direction.

And yet it doesn't, no matter how many .md file you throw at it, at crucial places in code.

> We have clear scaling laws on true statistical performance that is monotonically related to any notion of what performance means.

This is just a bunch of words stringed together, isn't it?


Almost every task that people are tackling agents on, it’s either not worth doing, can be done better with scripts and software, or require human oversight (that negates all the advantages.

I assume this is a troll because it's just so far removed from reality there's not much to say. "Almost every task" -- I'm sure you have great data to back this up. "It's not worth doing" well sure if you want to put your head in the sand and ignore even what systems today can do let alone the improvement trajectory. "can be done better with scripts and software" .... not sure if you realize this but agents write scripts and software. "or require human oversight (that negates all the advantages." it certainly does not; human oversight vs actual humans implementing the code is pretty dramatically more efficient and productive.

> It doesn't because it doesn't learn. Every time you run it, it's a new dawn with no knowledge of your business or your business context

It does learn in context. And lack of continuous learning is temporary, that is a quirk of the current stack, expect this to change rather quickly. Also still not relevant, consider that agentic systems can be hierarchical and that they have no trouble being able to grok codebases or do internal searches effectively and this will only improve.

> It doesn't have better reasoning beyond very localized decisions.

Do you have any basis for this claim? It contradicts a large amount of direct evidence and measurement and theory.

> This is just a bunch of words stringed together, isn't it?

Maybe to yourself? Chinchilla scaling laws and RL scaling laws are measured very accurately based on next token test loss (Chinchilla). This scales very predictably. It is related to downstream performance, but that relationship is noisy but clearly monotonic


> It does learn in context

It quite literally doesn't.

It also doesn't help that every new context is a new dawn with no knowledge if things past.

> Also still not relevant, consider that agentic systems can be hierarchical and that they have no trouble being able

A bunch of Memento guys directing a bunch of other Memento guys don't make a robust system, or a system that learns, or a system that maintains and retains things like business context.

> and this will only improve.

We've heard this mantra for quite some time now.

> Do you have any basis for this claim?

Oh. Just the fact that in every single coding session even on a small 20kloc codebase I need to spend time cleaning up large amounts of duplicated code, undo quite a few wrong assumptions, and correct the agent when it goes on wild tangents and goose hunts.

> Maybe to yourself? Chinchilla scaling laws a

yap yap yap. The result is anything but your rosy description of these amazing reasoning learning systems that handle business context.


> It quite literally doesn't.

Awesome you've backed this up with real literature. Let's just include this for now to easily refute your argument which I don't know where it comes from: https://transformer-circuits.pub/2022/in-context-learning-an...

> It also doesn't help that every new context is a new dawn with no knowledge if things past.

Absolutely true that it doesn't help but: agents like Claude have access to older sessions, they can grok impressive amounts of data via tool use, they can compose agents into hierarchical systems that effectively have much larger context lengths at the expense of cost and coordination which needs improvement. Again this is a temporary and already partially solved limitation

> A bunch of Memento guys directing a bunch of other Memento guys don't make a robust system, or a system that learns, or a system that maintains and retains things like business context.

I think you are not understanding: hierarchical agents have long term memory maintained by higher level agents in the hierarchy, it's the whole point. It's annoying to reset model context, but yet you have a knowledge base of the business context persisted and it can grok it...

> We've heard this mantra for quite some time now.

yes you have, and it has held true and will continue to hold true. Have you read the literature on scaling laws? Do you follow benchmark progression? Do you know how RL works? If you do I don't think you will have this opinion.

> yap yap yap. The result is anything but your rosy description of these amazing reasoning learning systems that handle business context.

Well that's fine to call an entire body of literature "yap" but don't pretend like you have some intelligible argument, I don't see you backing up any argument you have here with any evidence, unlike the multitude of sources I have provided to you.

Do you argue things have not improved in the last year with reasoning systems? If so I would really love to hear the evidence for this.


> Let's just include this for now to easily refute your argument which I don't know where it comes from: https://transformer-circuits.pub/2022/in-context-learning-an...

I love it when people include links to papers that refute their words.

So, Antropic (which is heavily reliant on hype and making models appear more than they are) authors a paper which clearly states: "tokens later in context are easier to predict and there's less loss of tokens. For no reason at all we decided to give this a new name, in-context learning".

> agents like Claude have access to older sessions, they can grok impressive amounts of data via tool use

That is they rebuild the world from scratch for every new session, and can't build on what was learned or built in the last one.

Hence continuous repeating failure modes.

10 years ago I worked in a team implementing royalties for a streaming service. I can still give you a bunch of details, including references to multiple national laws, about that. Agents would exhaust their context window just re-"learning" it from scratch, every time. And they would miss a huge amount of important context and business implications.

> Have you read the literature on scaling laws?

You keep referencing this literature as it was Holy Bible. Meanwhile the one you keep referring to, Chinchilla, clearly shows the very hard limits of those laws.

> Do you argue things have not improved in the last year with reasoning systems?

I don't.

Frankly, I find your aggressiveness quite tiring


> Frankly, I find your aggressiveness quite tiring

having to answer for opinions with no basis in the literature is I'm sure very tiring for you. Your aggression being met is I'm sure uncomfortable.

> I love it when people include links to papers that refute their words. > So, Antropic (which is heavily reliant on hype and making models appear more than they are) authors a paper which clearly states: "tokens later in context are easier to predict and there's less loss of tokens. For no reason at all we decided to give this a new name, in-context learning".

well I don't really love it when people just totally misread a paper because they have an agenda to push and can't seem to accept that their opinions are contradicted by real evidence.

in-context learning is not "later tokens easier" it’s task adaptation from examples in the prompt. I'm sure you realize this. Models can learn a mapping (e.g. word --> translation) from a few examples in the prompt, apply inputs within the same forward pass. That is function learning at inference time, not just "predicting later tokens better"

I'm sure also you're happy to chalk up any contradicting evidence to a grand conspiracy of all AI companies just gaming benchmarks and that this gaming somehow completely explains progress.

> That is they rebuild the world from scratch for every new session, and can't build on what was learned or built in the last one.

That they rebuild the world from scratch (wrong, they have priors from pretraining, but I accept your point here) does not mean they can't build on what was learned or built in the last one. They have access to the full transcript, and they have access to the full codebase, the diff history, whatever knowledge base is available. It's just disingenuous to say this, and then it also assumes (1) there is no mitigation for this, which I have presented twice before and you don't seem to understand it, (2) this is a temporary limitation, continual learning is one of the most important and well funded problems right now.

> 10 years ago I worked in a team implementing royalties for a streaming service. I can still give you a bunch of details, including references to multiple national laws, about that. Agents would exhaust their context window just re-"learning" it from scratch, every time. And they would miss a huge amount of important context and business implications.

also not an accurate understanding of how agents and their context work; you can use multiple session to digest and distill information useful in other sessions and in fact Claude does this automatically with subagents. It's a problem we have _already sort of solved today_ and that will continue to improve.

> You keep referencing this literature as it was Holy Bible. Meanwhile the one you keep referring to, Chinchilla, clearly shows the very hard limits of those laws.

You keep dismissing this literature as if you have understood it and that your opinion somehow holds more weight...Can you elaborate on why you think Chinchilla shows the hard limits of the scaling laws? Perhaps you're referring to the term capturing the irreducible loss? Is that what you're saying?

> Do you argue things have not improved in the last year with reasoning systems? I don't

Then are you arguing this progress will stop? I'm just not sure I understand, you seem to contradict yourself


> having to answer for opinions with no basis in the literatu

Having only literature on your side must feel nice.

> They have access to the full transcript, and they have access to the full codebase, the diff history, whatever knowledge base is available.

Yes. And it means that they don't learn, and they alway miss important details when rebuilding the world.

That's why even the tiniest codebases are immediately filled with duplications, architecturally unsound decisions, invalid assumptions etc.

> also not an accurate understanding of how agents and their context work; you can use multiple session to digest and distill information useful in other sessions and in fact

I say: agents don't learn and have to rebuild the world from scratch

You: not an accurate understanding of how agents and their context work.... they rebuild the world from scratch every time they run.

> You keep dismissing this literature as if you have understood it

No. I'm dismissing your flawed interpretation of purely theoretical constructs.

Chinchilla doesn't project unlimited amazing scalability. If anything, it shows a very real end of scalability.

Anthropic's paper adopts a nice marketable term for a process that has little to do with learning.

Etc.

Meanwhile you do keep rejecting actual real-world behaviour of these systems.

> Then are you arguing this progress will stop? I'm just not sure I understand, you seem to contradict yourself

I didn't say that either. Your opponents don't contradict themselves if you only stop to pretend they think or say.

Your unsubstantiated belief is that improvements are on a steep linear or even exponensial progression. Because "literature" or something.

Looking past all the marketing bullshit, it could be argued that growth is at best logarithmic, and most improvments come from tooling around (harnesses, subagents etc.). While all the failure modes from a year ago are still there: misunderstanding context, inability to maintain cohesion between sessions, context pollution etc.

And providers are running into the issue of getting non-polluted trainig data.

---

At this point we're going around in circles, and I'm no interested in arguing with theorists.

Adieu


> 1. LLMs are trained on human-quality data, so they will naturally learn to mimic our limitations. Their capabilities should saturate at human or maybe above-average human performance.

Why oh why is this such a commonly held belief. RL in verifiable domains being the way around this is the entire point. It’s the same idea behind a system like AlphaGo — human data is used only to get to a starting point for RL. RL will then take you to superhuman performance. I’m so confused why people miss this. The burden of proof is on people who claim that we will hit some sort of performance wall because I know of absolutely zero mechanisms for this to happen in verifiable domains.


Hmm...the sun comes up today is a pretty good bet that the sun comes up tomorrow.

We have robust scaling laws that continue to hold at the largest scales. It is absolutely a very safe bet that more compute + more training + algorithmic improvements will certainly improve performance it's not like we're rolling a 1 trillion dollar die.


this is true because markets are generally efficient. It's very hard to find predictive signals. That is a completely different space than what we're talking about here. Performance is incredibly predictable through scaling laws that continue to hold even at the largest scales we've built

I agree this is a new space and prediction volatility is much higher. We have evidence going back to at least 2019 that improvements have been exponential (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...). The benchmarks are all over the place because improvements don't happen in a straight line. Even composites aren't that useful because the last 10% improvement can require more effort than the first 90%.

To be frank, from what I can see, even if all progress stopped right now, it would take 1-2 decades to fully operationalise the existing potential of LLMs. There would be massive economic and social change. But progress is not stopping, and in some measurements, continues to improve exponentially. I really think this is incredibly transformative. Moreso than anything humanity has ever experienced. In the last year, OpenAI and potentially Claude have been working on recursive self-improvement. Meaning these models are designing better versions of themselves. This means we have effectively entered the singularity.


I agree with all of this -- the one nit I'll say is that scaling laws (e.g. Chinchilla -- classic paper on this that still holds) are based on next-token log loss on an evaluation set for pretraining, and follow (empirically) very consistent powerlaw relationships with compute / data (there is an ideal mixture of compute + data, and the thing you scale is the compute at this ideal mixture). So that's all I mean by performance -- we do also have as you observe benchmark performance trends (which are measured on the final model, after post-training, RL stages etc). These follow less predictable relationships, but it's the pretraining loss that dominates anyway.

I agree with all of this though


The “good deal of evidence” is everywhere. The proof is in the pudding. Of course you can find failure modes, the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark? Designed to elicit failure modes, ok so what? As if this is surprising to anyone and somehow negates everything else?

Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means. That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training). That’s indistinguishable from intelligence. It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.


> The “good deal of evidence” is everywhere. The proof is in the pudding.

I'm open! Please, by all means.

> the blog article (not an actual paper?) rightfully derides benchmarks and then…creates a benchmark?

The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.

> Anyone who says that “statistical models for next token generation” are unlikely to provide emergent intelligence I think is really not understanding what a statistical model for next token generation really means.

Okay.

> That is a proxy task DESIGNED to elicit intelligence because in order to excel at that task beyond a certain point you need to develop the right abstractions and decide how to manipulate them to predict the next token (which, by the way, is only one of many many stages of training).

This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.

As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?

> It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.

So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?


> The “good deal of evidence” is everywhere. The proof is in the pudding. I'm open! Please, by all means.

sure here are but a few: [1] you get smooth gains in reasoning with more RL train-time compute and more test-time compute (o1)

[2] DeepSeek-R1 showed that RL on verifiable rewards produces behavior like backtracking, adaptation, reflection, etc.

[3] SWE-Bench is a relatively decent benchmark and perf here is continually improving — these are real GitHub issues in real repos

[4] MathArena — still good perf on uncontaminated 2025 AIME problems

[5] the entire field of reinforcement learning, plus successes in other fields with verifiable domains (e.g. AlphaGo); Bellman updates will give you optimal policies eventually

[6] Anthropics cool work looking effectively at biology of a large language models: https://transformer-circuits.pub/2025/attribution-graphs/met... — if you trace internal circuits in Haiku 3.5 you see what you expect from a real reasoning system: planning ahead, using intermediate concepts, operating in a conceptual latent space (above tokens). And thats Haiku 3.5!!! We’re on Opus 4.6 now…

people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.

> The blog article is a review of benchmarking methodologies and the issues involved by a PhD neuroscientist who works directly on large language models and their applications to neuroscience and cognition, it's probably worth some consideration.

Appeals to authority are a fine prior, but lo and behold I also have a PhD and have worked on and led benchmark development professionally for several years at an AI lab. That’s ultimately no reason to really trust either of us. As I said, the blog post rightfully decries benchmarks but it then presents a new benchmark as though that isn’t subject to all of the same problems. It’s a good article! I think they do a good job here! I agree with all of their complaints about benchmarks! It rightfully identifies failure modes, and there are plenty of other papers pointing out similar failure modes. Reasoning is still brittle, lots of areas where LLMs/agentic systems fail in ways that are incredible given their talent in other areas. But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”. This is just not true, but it is true that they are brittle and fallible in weird ways, today.

> This isn't a great argument. It seems to say that in order for LLMs to do well they must have emergent intelligence. That is not evidence for LLMs having emergent intelligence, it's just stating that a goal would be to have it.

"They do well, therefore intelligence" is not an argument, sure. But that’s also not what I’m saying. The Occam’s razor here is that reasoning-like computation is the best explanation for an increasing amount of the observed behavior, especially in fresh math and real software tasks where memorization is a much worse fit.

> As I said, a theoretical framework with real tests would be great. That's how science is done, I don't really think I'm asking for a lot here?

I would encourage you to read Kuhn’s structure of scientific revolutions. "That’s how science is done" is a bit of an oversimplification of how the sausage is made here. Real science moves forward in a messy mix of partial theory + better measurements + interventions long before anyone has some sort of grand unified framework. Neuroscience is no different here. And I would say at this point with LLMs we now do have pretty decent tests: fresh verifiable-task evals, mechanistic circuit tracing, causal activation patching, and scaling results for RL/test-time compute. The claim that there is no framework + no real tests is just not true anymore. It’s not like we have some finished theory of reasoning, but thats a bit of an unfair demand at this point and is asymmetrical as well.

> It’s like saying “I think it’s surprising that a jumble of trillions of little cells zapping each other would produce emergent intelligence” while ignoring the fact that brains are clearly intelligent.

>> Well, it is a bit surprising. But we have an extremely robust model for exactly that - there are fields dedicated to it, we can create simulations and models, we can perform interventative analysis, we have a theory and falsifying test cases, etc. We don't just say "clearly brains are intelligent, therefor intelligence is an emergent property of cells zapping" lol that would be absurd.

>> So I'm just asking for you to provide a model and evidence. How else should I form my beliefs? As I've expressed, I have reasons to find the idea of emergent logic from statistical models surprising, and I have no compelling theory to account for that nor evidence to support that. If you have a theory and evidence, provide it! I'd be super interested, I'm in no way ideologically opposed to the idea. I'm a functionalist so I fundamentally believe that we can build intelligent systems, I'm just not convinced that LLMs are doing that - I'm not far though, so please, what's the theory?

The model is: reasoning is not inherently human, it’s mathematical. It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.

What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above? Also I’m confused by your last part. You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd, but why? You start from the position that brains are intelligent, so why is this absurd within your argument? Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?


Thanks, this is great and I'll have quite a bit to read here.

> people like to move goalposts whenever a new result comes out, which is silly. Could AI systems do this 2 years ago? No. I don’t know how people don’t look at robust trends in performance improvement, combined with verifiable RL rewards, and can’t understand where things are going.

I don't think it's goal post moving to acknowledge improvements but still reject the conclusion that AI has reached a specific milestone if those improvements don't justify the position. I doubt anyone sensible is rejecting improvements.

> But you pretend as though this is definitive evidence that “LLMs are poor general reasoners”.

I don't think I've ever made any definitive claims at all, quite the contrary - I've tried to express exactly how open I am to what you're saying. As I've said, I'm a functionalist, and I already am largely supportive of reductive intelligence, so I'm exactly the type of person who would be sympathetic to what you're saying.

> "That’s how science is done" is a bit of an oversimplification

Of course, but I don't think it's too much to ask for to have a theory and evidence. I don't need a lined up series of papers that all start with perfectly syllogisms and then map to well controlled RCTs or whatever. Just an "I think this accounts for it, here's how I support that".

> The claim that there is no framework + no real tests is just not true anymore.

I didn't say it wasn't true, to be clear, I asked for it. Again, I'm sympathetic to the view at a glance so I simply need a way to reason about it.

No need for a complete view, I'd never expect such a thing.

> The model is: reasoning is not inherently human, it’s mathematical.

Well, hand wringing perhaps, but I'd say it's maybe mathematical, computational, structural, functional, whatever - I think we're on the same page here regardless.

> It falls easily within the purview of RL, statistics, representation, optimization, etc, and to claim otherwise would require evidence.

Sure, but I grant that, in fact I believe it entirely. But that doesn't mean that every mathematical construct exhibits the function of intelligence.

> What is the robust model for reasoning in humans again? Simulations and models — what are these? Interventative analysis — we can’t do this with LLMs? Falsifying test cases — what would satisfy you here beyond everything I’ve presented above?

Sorry, I'm not fully understanding this framing. We can do those things with LLMs, and it's hard to say what I would be satisfied. In general, I'd be satisfied with a theory that (a) accounts for the data (b) has supporting evidence (c) does not contradict any major prior commitments. I don't think (c) will be an issue here.

> You say “brains are intelligent” ==> “intelligence is an emergent property of cells zapping” is absurd,

Because intelligence could have been a property of our brains being wet, or roundish, or it could have been a property of our spines, or maybe some force we hadn't discovered, or a soul, etc. We formed a theory, it accounted for observations, we performed tests, we've modeled things, etc, and so the theories we've adopted have been extremely successful and I think hold up quite well. But certainly we didn't go "the brain has electricity, the brain is intelligent, therefor electricity in the brain is what drives intelligence".

> Brains _are_ made up of real, physical atoms organized into molecules organized into cells organized into a coordinated system, and…that’s it? What’s missing here?

Certainly nothing on my world view.


Do you think AlphaGo is regurgitating human gameplay? No it’s not: it’s learning an optimal policy based on self play. That is essentially what you’re seeing with agents. People have a very misguided understanding of the training process and the implications of RL in verifiable domains. That’s why coding agents will certainly reach superhuman performance. Straw/steel man depending on what you believe: “But they won’t be able to understand systems! But a good spec IS programming!” also a bad take: agents absolutely can interact with humans, interpret vague deseridata, fill in the gaps, ask for direction. You are not going to need to write a spec the same way you need to today. It will be exactly like interacting with a very good programmer in EVERY sense of the word

How does alphago come into picture? It works in a completely different way all together.

I'm not saying that LLMs can't solve new-ish problems, not part of the training data, but they sure as hell not got some Apple-specific library call from a divine revelation.


AlphaGo comes into the picture to explain that in fact coding agents in verifiable domains are absolutely trained in very similar ways.

It’s not magic they can’t access information that’s not available but they are not regurgitating or interpolating training data. That’s not what I’m saying. I’m saying: there is a misconception stemming from a limited understanding of how coding agents are trained that they somehow are limited by what’s in the training data or poorly interpolating that space. This may be true for some domains but not for coding or mathematics. AlphaGo is the right mental model here: RL in verifiable domains means your gradient steps are taking you in directions that are not limited by the quality or content of the training data that is used only because starting from scratch using RL is very inefficient. Human training data gives the models a more efficient starting point for RL.


Well said.

Why is it important? “Statistical fit” is what you want…not understanding this is indicative of a limited understanding of what statistics is. What do you think it means to truly understand something? I don’t get it: read probability theory by Jaynes. It doesn’t really matter if the brain does Bayesian updates but that’s what’s optimal…

"Statistical fit" to environment is arguably what all life does.

How can people look at

- clear generalizability

- insane growth rates (go back and look at where we were maybe 2 years ago and then consider the already signed compute infrastructure deals coming online)

And still say with a straight face that this is some kind of parlor trick or monkeys with typewriters.

we don’t need to run LLMs for years. The point is look at where we are today and consider performance gets 10x cheaper every year.

LLMs and agentic systems are clearly not monkeys with typewriters regurgitating training data. And they have and continue to grow in capabilities at extremely fast rates.


I was talking about highest difficulty problems only, in the scope of that comment. Sure at mundane tasks they are useful and we optimizing that constantly.

But for super hard tasks, there is no situation when you just dump a few papers for context add a prompt and LLM will spit out correct answer. It's likely that a lead on such project would need to additionally train LLM on their local dataset, then parse through a lot of experimental data, then likely run multiple LLMs for for many iterations homing on the solution, verifying intermediate results, then repeating cycle again and again. And in parallel the same would do other team members. All in all, for such a huge hard task a year of cumulative machine-hours is not something outlandish.


This is just not true. Maybe it will be true if you increase the problem difficulty in concert with model performance? You don't need fine tuning for this and you haven't for years now. Reasoning performance for now may be SOMEWHAT brittle but again look at where we have come from in like 2 years. Then also consider the logical next steps

- better context compression (already happening) + memory solutions that extend the effective context length [memory _is_ compression]

- continual learning systems (likely already prototyped)

- these domains are _verifiable_ which I think just seems to confuse people. RL in verifiable domains takes you farther and farther. Training data is a bootstrap to get to a starting point, because RL from scratch is too inefficient.

agents can already deal with large codebases and datasets, just like any SWE, DS or researcher.

and yes! If you throw more compute at a problem you will get better solutions! But you are missing the point: for the frontier solutions, which changes with every model update, you of course need to eek out as much performance as you can, which requires a large amount of test time compute. But what you can do _without_ this is continually improving. The pattern _already in place_ is that at first you need an extreme amount of compute, then the next model iterations need far less compute to reach that same solution, etc etc. The costs + compute requirements to perform a particular task decrease exponentially.


I don’t know it’s fair to characterize the LLM community as being ignorant and rediscovering PSP/TCP. I in fact see that as programmers rediscovering survival analysis, and most LLM folks I know have learned these perspectives from that lens. Could be wrong about PSP, maybe things are more nuanced? But what is there that isn’t already covered by foundational statistics?


How about people that understand things are changing whether anyone likes it or not and want to stay relevant. What about the people who care about the end product and not rabbitholing design decisions on a proof of concept. What about someone who understands there is more nuance than assuming people with a different perspective on AI are lesser than or lower than people who resist the technology. You may feel you know the “right way” but to everyone else who is interested in operating in a world changing beneath our feet and not whining about the fact that everything will be different, and denigrating the people who want to succeed in it, this opinion is not exactly convincing. You want to cludge your way through a problem you’re welcome to but it’s not necessarily logical to suggest this is the only “right” way and infer that people who build with AI don’t like “understanding systems”.

When I build with AI I build things I never would have built before, and in doing so I’m exposed to technologies, designs, tools I wasn’t aware of before. I ask questions about them. Sure I don’t understand the tools as deeply as the person who wasted like 10 hours going down rabbit holes to answer a simple question, but I don’t really see that as particularly valuable.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: