In OpenAI and a lot of open source inference engines this is done using llguidan...

In OpenAI and a lot of open source inference engines this is done using llguidance.

https://github.com/guidance-ai/llguidance

Llguidance implements constrained decoding. It means that for each output token sequence you know which fixed set of tokens are allowed for decoding the next token. You prepare token masks so that in the decoding step you limit which tokens can be sampled.

So if you expect a JSON object the first token can only be whitespace or token '{'. This can be more complex because the tokenizers usually allow byte pair encoding which means they can represent any UTF-8 sequence. So if your current tokens are '{"enabled": ' and your output JSON schema requires 'enabled' field to be a boolean, the allowed tokens mask can only contain whitespace tokens, tokens 'true', 'false', 't' UTF-8 BPE token or 'f' UTF-8 BPE token ('true' and 'false' are usually a single token because they are so common)

JSON schema must first be converted into a grammar then into token masks. This takes some time to be computed and takes quite a lot of space (you need to precompute token masks) so this is usually cached for performance.