This research presents a novel bypass technique targeting LLM-based WAFs, specifically the LLaMA Prompt Guard. The attack exploits the way the WAF uses tokenization by subtly changing the tokens in the input. This causes a mismatch between how the WAF parses the input and how the underlying application processes it. It's a type of differential parsing attack adapted for language models, where the attacker confuses the token boundaries. As a result, malicious payloads can evade detection by the WAF even though the intended harmful content is still present. This clever method highlights new challenges in securing LLM-based filtering systems, as traditional parsing defenses may fail against sophisticated token manipulation strategies.
Great research on a token confusion attack to bypass an LLM-based WAF (LLaMA Prompt Guard):https://t.co/7A3AFLqwJ6
It’s a clever trick, think parser differential attacks, but for LLMs.
By subtly changing tokenization, the attacker breaks how the WAF parses input, while the…
— Lord Steak (@Adrian__T) June 26, 2025