MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

Chiyeong Heo; Jaechang Kim; Junhyuk Kwon; Hoyoung Kim; Dongmin Park; Jonghyun Lee; Jungseul Ok

arXiv:2605.10966·cs.MM·May 13, 2026

MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

Chiyeong Heo, Jaechang Kim, Junhyuk Kwon, Hoyoung Kim, Dongmin Park, Jonghyun Lee, Jungseul Ok

PDF

2 Repos

TL;DR

This paper introduces MMTB, a comprehensive benchmark with 105 multimedia-file tasks, and Terminus-MM, a tool extending terminal agents to handle audio and video, to evaluate their performance on multimedia workflows.

Contribution

The paper presents MMTB and Terminus-MM, enabling systematic evaluation of terminal agents on multimedia tasks involving audio and video content.

Findings

01

Different multimedia access methods significantly affect task success.

02

Multimedia perception capabilities influence the evidence agents rely on.

03

The benchmark facilitates controlled studies of multimedia terminal agents.

Abstract

Terminals provide a powerful interface for AI agents by exposing diverse tools for automating complex workflows, yet existing terminal-agent benchmarks largely focus on tasks grounded in text, code, and structured files. However, many real-world workflows require practitioners to work directly with audio and video files. Working with such multimedia files calls for terminal agents not only to understand multimedia content, but also to convert auditory and visual evidence across related files into appropriate actions. To evaluate terminal agents on multimedia-file tasks, we introduce MultiMedia-TerminalBench (MMTB), a benchmark of 105 tasks across 5 meta-categories where terminal agents directly operate with audio and video files. Alongside MMTB, we propose Terminus-MM, a multimedia harness that extends Terminus-KIRA with audio and video perception for terminal agents. Together, MMTB and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.