FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

TL;DR
FineBench is a comprehensive benchmark for evaluating vision-language models on fine-grained human activity understanding in videos, revealing current limitations and proposing a modular enhancement framework called FineAgent.
Contribution
The paper introduces FineBench, a large-scale human-centric video VQA benchmark, and proposes FineAgent, a modular framework to improve VLMs' fine-grained reasoning capabilities.
Findings
Open-source VLMs underperform on FineBench, especially in spatial reasoning and subtle movement distinctions.
Proposed FineAgent consistently improves VLM performance on FineBench.
GPT-5 achieves respectable performance, but open-source models lag behind.
Abstract
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
