FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Gueter Josmy Faure; Min-Hung Chen; Jia-Fong Yeh; Hung-Ting Su; Winston H. Hsu

arXiv:2605.19846·cs.CV·May 21, 2026

FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Hung-Ting Su, Winston H. Hsu

PDF

1 Repo 1 Datasets

TL;DR

FineBench is a comprehensive benchmark for evaluating vision-language models on fine-grained human activity understanding in videos, revealing current limitations and proposing a modular enhancement framework called FineAgent.

Contribution

The paper introduces FineBench, a large-scale human-centric video VQA benchmark, and proposes FineAgent, a modular framework to improve VLMs' fine-grained reasoning capabilities.

Findings

01

Open-source VLMs underperform on FineBench, especially in spatial reasoning and subtle movement distinctions.

02

Proposed FineAgent consistently improves VLM performance on FineBench.

03

GPT-5 achieves respectable performance, but open-source models lag behind.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we introduce FineBench, a human-centric video question answering (VQA) benchmark specifically designed to assess fine-grained understanding. FineBench comprises 199,420 multiple-choice QA pairs densely annotated across 64 long-form videos (15 minutes each), focusing on detailed person movement, person interaction, and object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

joslefaure/assets/html/finebench.html
github

Datasets

FINEBENCH/FineBench
dataset· 2.7k dl
2.7k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.