Definition
A training signal that pays an agent only when its extra work actually moved the answer toward correct.
A reward formulation that compares an agent's full trajectory against a counterfactual shadow pass without certain interventions, granting credit only where the intervention is causally responsible for improved outcomes.