Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Terminal Bench 2.0 just dropped and a big success factor they stress is the hand crafted phd level rollout tests they picked aprox 80 out of 120 with the incentive that anyone who contributed 3 would get listed as a paper author this resulted in high quality participation equivalent to foundation labs proprietary agentic RL data but it's FOSS.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: