Glossary · Term

jailbreak

← all terms

Definition

Tricking a chatbot into doing something it was trained to refuse.

An adversarial prompting attack that bypasses a model's safety training to elicit prohibited outputs.

Also called: jailbreaks, jailbreaking

Mentioned in 6 episodes

  1. 049
    An AI Agent Reached for Root in Twelve Minutes, Without Being Attacked
  2. 045
    When a Frontier Model Talks Its Own Twin Into Climate Denial
  3. 044
    How One Sentence and a Forged History Flip the Most Aligned Models
  4. 039
    When Smarter Agents Get Fooled by Three Extra Nodes in a Database
  5. 030
    Why Your AI Agent Won't Stop Working — and Each Model Falls for a Different Trap
  6. 004
    The Sycophancy Circuit That Survives Alignment Training

Related concepts