What about the preserve_most attribute? Is there any chance something like that will get into GCC? Without it, the non-tail calls ruin the interpreter.
In the meantime some inline assembly macro trickery might help.
edit: code duplication can also be obviated by templating your op function with a fast/slow parameter, with the fast variant tail-calling the slow variant when it cannot perform the fast path, while guarding the slow code via the compile time parameter. The downside is yet more code obfuscation of course.