Breaking Instruction Hierarchy in OpenAI\'s gpt-4o-mini
Recently, OpenAI announced gpt-4o-mini and there are some interesting updates, including safety improvements regarding “Instruction Hierarchy”: OpenAI puts this in the light of “safety”, the word security is not mentioned in the announcement. Additionally, this The Verge article titled “OpenAI’s latest model will block the ‘ignore all previous instructions’ loophole” created interesting discussions on X, including a first demo bypass. I spent some time this weekend to get a better intuition about gpt-4o-mini model and instruction hierarchy, and the conclusion is that system instructions are still not a security boundary. More details here