You're right to point that out, and thank you. The questions around the ARC-AGI validation set are valid and reproducibility is key.
My core argument in the post, however, is deliberately independent of that specific result. The architectural innovation is the breakthrough, and its success is already demonstrated clearly by the SOTA performance on the Sudoku and Maze benchmarks.
Those results show that the model's method for deep, iterative reasoning works, and that's the story I wanted to tell. The inner workings are what matter.
Agreed on your last point. Adding language to this core reasoning engine is the next big step.
Just like in Alpaca, the answer lies in the LLama dataset where the focus is on instruction following. GPT-J knows all the answers but just now was taught to understand the questions thoroughly before writing the answer.