Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge
Xiangyu Wu, Zhouyang Chi, Yang Yang, Jianfeng Lu

TL;DR
This paper describes a three-stage multimodal model approach for visual question answering, utilizing synthetic data and fine-tuning strategies, achieving second place in the WSDM2023 challenge.
Contribution
The authors propose a novel three-stage training pipeline with synthetic data generation and post-processing for visual question answering tasks.
Findings
Achieved a score of 76.342, ranking second in the competition.
Designed a synthetic dataset for pre-training the model.
Developed a bounding box matching and replacing strategy.
Abstract
In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer the question and display the answer as a bounding box on the image. We designed a three-stage solution for this task. Specifically, we used the visual-language pre-trained model OFA as the foundation. In the first stage, we constructed a large-scale synthetic dataset similar to the competition dataset and coarse-tuned the model to learn generalized semantic information. In the second stage, we treated the competition task as a visual grounding task, loaded the weights from the previous stage,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsOFA
