Toloka Visual Question Answering Benchmark
Dmitry Ustalov, Nikita Pavlichenko, Sergey Koshelev, Daniil, Likhobaba, Alisa Smirnova

TL;DR
This paper introduces Toloka Visual Question Answering, a large crowdsourced dataset for grounding visual question answering, enabling comparison of machine learning models against human performance, and reports that current models lag behind non-expert crowdsourcing.
Contribution
The paper presents a new dataset for grounding visual question answering and evaluates baseline models, highlighting the gap between machine and human performance.
Findings
No machine learning model outperformed non-expert crowdsourcing.
The dataset contains 45,199 image-question pairs with ground truth bounding boxes.
A multi-phase competition attracted 48 participants worldwide.
Abstract
In this paper, we present Toloka Visual Question Answering, a new crowdsourced dataset allowing comparing performance of machine learning systems against human level of expertise in the grounding visual question answering task. In this task, given an image and a textual question, one has to draw the bounding box around the object correctly responding to that question. Every image-question pair contains the response, with only one correct response per image. Our dataset contains 45,199 pairs of images and questions in English, provided with ground truth bounding boxes, split into train and two test subsets. Besides describing the dataset and releasing it under a CC BY license, we conducted a series of experiments on open source zero-shot baseline models and organized a multi-phase competition at WSDM Cup that attracted 48 participants worldwide. However, by the time of paper submission,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
