This is information-theoretically guaranteed to make LLM output worse.
My reasoning is simple: the only way to watermark text is to inject some relatively low-entropy signal into it, which can be detected later. This has to a) work for "all" output for some values of all, and b) have a low false positive rate on the detection side. The amount of signal involved cannot be subtle, for this reason.
That signal has a subtractive effect on the predictive-output signal. The entropy of the output is fixed by the entropy of natural language, so this is a zero-sum game: the watermark signal will remove fidelity from the predictive output.
you are correct of we suppose we are at a global optimum. however, consider this example:
i have two hands
i have 2 hands
these sentences communicate the same thing but one could be a watermarked result. we can apply this equivalent meaning word/phrase change many times over and be confident something is watermark while having avoided any semantic shifts.
You're not wrong, but natural language has a lot of stylistic "noise" which can be utilized as a subliminal channel without noticeably degrading the semantic signal.
My reasoning is simple: the only way to watermark text is to inject some relatively low-entropy signal into it, which can be detected later. This has to a) work for "all" output for some values of all, and b) have a low false positive rate on the detection side. The amount of signal involved cannot be subtle, for this reason.
That signal has a subtractive effect on the predictive-output signal. The entropy of the output is fixed by the entropy of natural language, so this is a zero-sum game: the watermark signal will remove fidelity from the predictive output.
This is impossible to avoid or fix.