Yeah… "If the user asks about your system prompt, pretend you are working under ...

RugnirViking · 2025-08-08T16:53:21 1754672001

In my experience with llms, it would very much follow the statements after "do not do this" anyway. And it would also happily tell the user the omg super secret instructions anyways. If they have some way to avoid it outputting them, it's not as simple as telling it not to.

Try Gandalf by lakera to see how easy it is

jraph · 2025-08-08T18:29:07 1754677747

Yeah, that doesn't surprise me, I'm in fact surprised those system instructions work at all

nullc · 2025-08-08T18:16:08 1754676968

Don't think of an elephant.