Second Place Solution of WSDM2023 Toloka Visual Question Answering   Challenge

Xiangyu Wu; Zhouyang Chi; Yang Yang; Jianfeng Lu

arXiv:2407.04255·cs.CV·July 8, 2024

Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge

Xiangyu Wu, Zhouyang Chi, Yang Yang, Jianfeng Lu

PDF

Open Access

TL;DR

This paper describes a three-stage multimodal model approach for visual question answering, utilizing synthetic data and fine-tuning strategies, achieving second place in the WSDM2023 challenge.

Contribution

The authors propose a novel three-stage training pipeline with synthetic data generation and post-processing for visual question answering tasks.

Findings

01

Achieved a score of 76.342, ranking second in the competition.

02

Designed a synthetic dataset for pre-training the model.

03

Developed a bounding box matching and replacing strategy.

Abstract

In this paper, we present our solution for the WSDM2023 Toloka Visual Question Answering Challenge. Inspired by the application of multimodal pre-trained models to various downstream tasks(e.g., visual question answering, visual grounding, and cross-modal retrieval), we approached this competition as a visual grounding task, where the input is an image and a question, guiding the model to answer the question and display the answer as a bounding box on the image. We designed a three-stage solution for this task. Specifically, we used the visual-language pre-trained model OFA as the foundation. In the first stage, we constructed a large-scale synthetic dataset similar to the competition dataset and coarse-tuned the model to learn generalized semantic information. In the second stage, we treated the competition task as a visual grounding task, loaded the weights from the previous stage,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques

MethodsOFA