the abstraction level argument is spot on. i've been working on browser automation for AI agents and the biggest lesson has been that exposing Playwright-level primitives to a foundation model is fundamentally the wrong interface. the model burns most of its context reasoning about DOM traversal and coordinate-based clicking instead of the actual task. the natural language intent layer is the right call, it's basically treating the browser interaction as a tool-use problem where the tool itself is agentic.
curious about failure recovery though: when the specialised browsing model misinterprets an intent (e.g. clicks the wrong "Submit" on a page with multiple forms), does the outer agent get enough signal to retry or reframe the instruction? that's been the hardest part in my experience, the error surface between "the browser did the wrong thing" and "I specified the wrong thing" is really blurry.
I agree that there is communication overhead between the agents, although so far it looks like they can communicate very effectively and efficiently. We’re also working on efficient ways to transfer more contextual information
curious about failure recovery though: when the specialised browsing model misinterprets an intent (e.g. clicks the wrong "Submit" on a page with multiple forms), does the outer agent get enough signal to retry or reframe the instruction? that's been the hardest part in my experience, the error surface between "the browser did the wrong thing" and "I specified the wrong thing" is really blurry.