I worked on building blockchains for about 4 years, and this is not a stupid question at all. The verification problem is real. A 5-minute training run produces an objective val_bpb score that anyone can reproduce from the published source code. And this is actually valuable work, unlike most proof of work chain workloads.
The practical challenge is that adding a blockchain means agents also need to participate in consensus, store and sync the ledger, and run the rest of the network infrastructure on top of the actual research. So it needs a unit economic analysis. That said, all results already include full source code and deterministic metrics, so the hard part of verifiable compute is already solved. You could take this further with a zkVM to generate cryptographic proofs that the code produced the claimed score, so nobody needs to re-run anything to verify. Verification becomes checking a proof, not reproducing the compute.
Compute-credits are interesting. Contribute GPU time now, draw on the swarm later for training, inference, whatever you need. That's a real utility token with intrinsic value tied to actual compute, not speculation.
> The verification problem is real. A 5-minute training run produces an objective val_bpb score that anyone can reproduce from the published source code. And this is actually valuable work, unlike most proof of work chain workloads.
Yes, thank you for the validation! That was the core of what sparked this for me -- my cartoon drawing of blockchain is that it's dependent on problems that are difficult to solve (improve this codebase), but easy to verify (loss went down).
Like you noted, this is also cool in that it's valuable work (unlike most of these workloads)
I appreciate the opportunities for optimization you've laid out (such as zkVM) but it feels like that would be optional compared to the basic thing here?
And yeah -- what one _does_ with the crypto-credits is pretty open-ended. Like you said, drawing on the swarm for training or inference or whatever you need -- it feels like the sort of thing that one could use as a GPU battery of sorts. Most of my personal GPU work goes in bursts -- but most of the time my GPU is sitting idle.
Most of the other GPU share-cropping sorts of ideas I've seen floating around lack the ability to independently prove that work was done. Having a global metric for a shared target like this seems to solve what has been lacking in a lot of other distributed systems I've seen.
Looking at the graph on the website, it looks like it's already got a bit of a scoreboard and independent verification / validation of results. Feels like it would be a relatively small jump to crowdsource this and put it into a formal blockchain.
But the next natural question is: Would we stand to gain anything by adding blockchain to this?
The objective is to train a small GPT language model to the lowest possible validation bits-per-byte (val_bpb) in 5-minute runs, using AI agents to autonomously iterate on the code. This builds on Karpathy's autoresearch: https://x.com/AustinBaggio/status/2031888719943192938?s=20
Yeah the obvious workloads are for training, I think I want to point this at RL next, but I think drug research is a really strong common good next target too. We were heavily inspired by folding@home and BOINC
We thought about storing all of the commits on Ensue too, but we wanted to match the spirit of Andrej's original design, which leans heavily on github. Curious what you were looking for when trying to inspect the code?
I was hoping to see the code change the agent made! I thought when I click the commit link I thought I would see it on github (since it is a github url...), but the links don't seem to work, they take me to github 404. e.g. https://github.com/mutable-state-inc/autoresearch-at-home/co... I'm not sure what that has to do with Ensue so I've probably misunderstood how this works.
I know it's a bit of a barrier. . . but I set one up on vast.ai really quickly and ran it for a day for the price of lunch. One of our teammates ran it from their old gaming PC too, and it still found novel strategies
+1 to logging output. Not too sure what you mean by herald-style message passing, but it sounds like you've implemented subscribe logic from scratch, and each of your agents needs to be aware of domain boundaries and locks?
For most tasks, I agree. One agent with a good harness wins. The case for multiple agents is when the context required to solve the problem exceeds what one agent can hold. This Putnam problem needed more working context than fits in a single window. Decomposing into subgoals lets each agent work with a focused context instead of one agent suffocating on state. Ideally, multi-agent approaches shouldn't add more overall complexity, but there needs to be better tooling for observation etc, as you describe.
I think about this with the analogue of MoE a lot. Essentially, a decision routing process, and similar to having expert submodels, you have a human in the loop or decision sub-tasks when the task requires it.
More specifically, we've been working on a memory/context observability agent. It's currently really good at understanding users and understanding the wide memory space. It could help with the oversight and at least the introspection part.
Yeah I have seen those camps too. I think there will always be a set of problems that have complexity, measured by amount of context required to be kept in working ram, that need more than one agent to achieve a workable or optimal result. I think that single player mode, dev + claude code, you'll come up against these less frequently, but cross-team, cross-codebase bigger complex problems will need more complex agent coordination.
The practical challenge is that adding a blockchain means agents also need to participate in consensus, store and sync the ledger, and run the rest of the network infrastructure on top of the actual research. So it needs a unit economic analysis. That said, all results already include full source code and deterministic metrics, so the hard part of verifiable compute is already solved. You could take this further with a zkVM to generate cryptographic proofs that the code produced the claimed score, so nobody needs to re-run anything to verify. Verification becomes checking a proof, not reproducing the compute.
Compute-credits are interesting. Contribute GPU time now, draw on the swarm later for training, inference, whatever you need. That's a real utility token with intrinsic value tied to actual compute, not speculation.