# A visual question answering method based on task decomposition

**Authors:** Yao Cong, Hongwei Mo

PMC · DOI: 10.1371/journal.pone.0336623 · 2025-11-13

## TL;DR

This paper introduces a new visual question answering method that improves accuracy and reduces bias by decomposing tasks using natural language structure.

## Contribution

The novel Graph2Seq-TDN network uses semantic structure to enhance task decomposition and reasoning execution in VQA.

## Key findings

- The proposed Graph2Seq-TDN outperforms existing methods in answering accuracy and program accuracy.
- The model reduces training costs while maintaining the same level of accuracy.
- Validation on four datasets shows improved performance over comparative models.

## Abstract

Visual question answering (VQA) as an interdisciplinary task of computer vision and natural language processing, estimating the model’s visual reasoning ability, which requires the integration of image information extraction technology and natural language understanding technology. The testing on professional benchmark which controls the potential bias states that the VQA method based on task decomposition is a promising approach, offering advantages in interpretability at program execution stage and reducing data bias dependencies, compared with traditional VQA methods that only rely on multimodal fusion. The VQA method based on task decomposition decomposes the task by parsing natural language and it usually parses the language with sequence-to-sequence networks. It has limitations when faced with flexible and varied natural language, making it difficult to accurately decompose the task. To address this issue, we propose a Graph-to-Sequence Task Decomposition Network (Graph2Seq-TDN), which uses semantic structural information from natural language to guide the task decomposition process and improve parsing accuracy, additionally, in terms of reasoning execution, in addition to the original symbolic reasoning execution, we propose a reasoning executor to enhance execution performance. We conducted validation on four datasets: CLEVR, CLEVR-Human, CLEVR-CoGenT and GQA. The experimental results showed that our model outperformed the comparative model in terms of answering accuracy, program accuracy, and training costs under the same accuracy.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

42 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12614567/full.md

---
Source: https://tomesphere.com/paper/PMC12614567