I haven't thought about it deeply. But I guess it's about allowing the model to ...

I haven't thought about it deeply. But I guess it's about allowing the model to easily distinguish the prompt from the conversation. Models seem to get confused with escaping, which is fair enough, escaping is very confusing. It's true that for the transformer architecture the prompt and conversation are in the same stream. However you could do something like activate a special input neuron only for prompt input. Or have the prompt a fixed size (e.g. a fixed prefix size). And then do a bunch of adversarial training to punish the model when it confuses the prompt and conversation :)