Definition
A benchmark that tests whether AI assistants ask for help at the right moments.
A human-in-the-loop evaluation suite penalizing both over-clarification and missed escalation in agentic interactions.
A benchmark that tests whether AI assistants ask for help at the right moments.
A human-in-the-loop evaluation suite penalizing both over-clarification and missed escalation in agentic interactions.