Topic · 6 episodes across 1 review

Can We Still Watch the Model Think? Oversight and Monitoring

← all reviews

A sobering cluster: the transcript often can't reveal misbehavior, chain-of-thought monitoring fails across languages, models can resist their own training, and more capability sometimes means less trustworthy outputs.

Covered in these reviews