LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content
Nimrod Shabtay, Felipe Maia Polo, Sivan Doveh, Wei Lin, M. Jehanzeb, Mirza, Leshem Chosen, Mikhail Yurochkin, Yuekai Sun, Assaf Arbelle, Leonid, Karlinsky, Raja Giryes

TL;DR
LiveXiv is a scalable, evolving benchmark based on arXiv papers that automatically generates visual question-answer pairs from scientific manuscripts to evaluate multi-modal models without data contamination.
Contribution
The paper introduces LiveXiv, a novel live benchmark for multi-modal models based on arXiv papers, with automatic VQA generation and an efficient evaluation method to assess model performance over time.
Findings
Benchmark is challenging for current models.
Automatic annotations closely match manual ones (<2.5% variance).
Evaluation cost is significantly reduced by subset evaluation.
Abstract
The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Multimedia Communication and Technology · Scientific Computing and Data Management
