TL;DR
This paper introduces MMTB, a comprehensive benchmark with 105 multimedia-file tasks, and Terminus-MM, a tool extending terminal agents to handle audio and video, to evaluate their performance on multimedia workflows.
Contribution
The paper presents MMTB and Terminus-MM, enabling systematic evaluation of terminal agents on multimedia tasks involving audio and video content.
Findings
Different multimedia access methods significantly affect task success.
Multimedia perception capabilities influence the evidence agents rely on.
The benchmark facilitates controlled studies of multimedia terminal agents.
Abstract
Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
