Dynamic Double Space Tower

Weikai Sun; Shijie Song; Han Wang

arXiv:2506.11394·cs.CV·June 16, 2025

Dynamic Double Space Tower

Weikai Sun, Shijie Song, Han Wang

PDF

Open Access

TL;DR

This paper introduces a dynamic bidirectional spatial tower to improve reasoning and spatial relationship understanding in Visual Question Answering, achieving state-of-the-art results with fewer parameters.

Contribution

It proposes a novel spatial tower architecture replacing attention mechanisms to enhance spatial reasoning in VQA models.

Findings

01

The module improves spatial relationship processing across models.

02

The July VQA model with our method achieves state-of-the-art results.

03

Fewer parameters are needed for high performance.

Abstract

The Visual Question Answering (VQA) task requires the simultaneous understanding of image content and question semantics. However, existing methods often have difficulty handling complex reasoning scenarios due to insufficient cross-modal interaction and capturing the entity spatial relationships in the image.\cite{huang2023adaptive}\cite{liu2021comparing}\cite{guibas2021adaptive}\cite{zhang2022vsa}We studied a brand-new approach to replace the attention mechanism in order to enhance the reasoning ability of the model and its understanding of spatial relationships.Specifically, we propose a dynamic bidirectional spatial tower, which is divided into four layers to observe the image according to the principle of human gestalt vision. This naturally provides a powerful structural prior for the spatial organization between entities, enabling the model to no longer blindly search for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStructural Analysis and Optimization