One point of confusion might be that this is a tough but relatively *low-value* ...

One point of confusion might be that this is a tough but relatively low-value task (on a per-unit basis). The budget per item moderated is measured in small double-digit cents, but there's hundreds of thousands of items regularly being ingested.

FWIW – across all of these, we already do automated prompt rewriting, self-reflection, verification, and a suite of other things that help maximize reliability / quality, but those tokens add up quickly and being able to dynamically switch over to a smaller model without degrading performance improves margin substantially.

Fine-tuning is a non-starter for a number of reasons, but that's a much longer post.