Terminal Bench 2.0 just dropped and a big success factor they stress is the hand crafted phd level rollout tests they picked aprox 80 out of 120 with the incentive that anyone who contributed 3 would get listed as a paper author this resulted in high quality participation equivalent to foundation labs proprietary agentic RL data but it's FOSS.