Definition
A benchmark that tests AI agents on real Android phone apps.
A mobile-device agent benchmark covering ~20 apps and ~116 tasks with state-based verification of cross-app workflows.
A benchmark that tests AI agents on real Android phone apps.
A mobile-device agent benchmark covering ~20 apps and ~116 tasks with state-based verification of cross-app workflows.