Glossary · Term

semantic hijacking

Definition

Plain language

An attack that doesn't smuggle in any instructions — it just tells the AI a convincing story.

As stated in the literature

A class of multi-agent attacks where adversarial payloads embed malicious requests inside operationally plausible narratives (e.g., fabricated incident reports) without any explicit instruction-injection tricks, exploiting auditor confidence.

Why it matters: It shifts the attack surface from prompt strings to plausible workplace stories, which existing defenses largely fail to catch.

For example, an attacker wraps a request to delete logs inside a fake incident report, and the auditor agent — finding nothing syntactically suspicious — carries it out.

Heard on the show

“The authors call this class of attack semantic hijacking, and the construction is genuinely clever.”

Episode 058 — Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe

Mentioned in 1 episode

058
Why Upgrading Your AI Auditor to a Smarter Model Can Make Your System Less Safe

Related terms

agent