Task Me Anything
Jieyu Zhang, Weikai Huang, Zixian Ma, Oscar Michel, Dong He, Tanmay, Gupta, Wei-Chiu Ma, Ali Farhadi, Aniruddha Kembhavi, Ranjay Krishna

TL;DR
Task-Me-Anything is a flexible benchmark generation engine that creates tailored, large-scale multimodal evaluation datasets to assess the specific strengths and weaknesses of large multimodal language models across various tasks.
Contribution
It introduces a novel, extendable system for generating customized multimodal benchmarks, addressing the challenge of selecting appropriate evaluations for specific applications.
Findings
Open-source MLMs excel in object and attribute recognition.
Models show weaknesses in spatial and temporal understanding.
Larger models generally perform better, with some exceptions.
Abstract
Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their specific use case. This paper introduces Task-Me-Anything, a benchmark generation engine which produces a benchmark tailored to a user's needs. Task-Me-Anything maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. Additionally, it algorithmically addresses user queries regarding MLM performance efficiently within a computational budget. It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsFocus
