Auto-Parsing Network for Image Captioning and Visual Question Answering
Xu Yang, Chongyang Gao, Hanwang Zhang, Jianfei Cai

TL;DR
The paper introduces an Auto-Parsing Network that leverages probabilistic graphical models within Transformer architectures to discover hierarchical structures in input data, enhancing performance in image captioning and visual question answering tasks.
Contribution
It presents a novel method to implicitly parse hierarchical structures in vision-language tasks using PGM-constrained self-attention layers within Transformers.
Findings
Improved accuracy in image captioning and VQA tasks.
Effective discovery of hidden hierarchical structures during inference.
Enhanced Transformer models with hierarchical parsing capabilities.
Abstract
We propose an Auto-Parsing Network (APN) to discover and exploit the input data's hidden tree structures for improving the effectiveness of the Transformer-based vision-language systems. Specifically, we impose a Probabilistic Graphical Model (PGM) parameterized by the attention operations on each self-attention layer to incorporate sparse assumption. We use this PGM to softly segment an input sequence into a few clusters where each cluster can be treated as the parent of the inside entities. By stacking these PGM constrained self-attention layers, the clusters in a lower layer compose into a new sequence, and the PGM in a higher layer will further segment this sequence. Iteratively, a sparse tree can be implicitly parsed, and this tree's hierarchical knowledge is incorporated into the transformed embeddings, which can be used for solving the target vision-language tasks. Specifically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Softmax · Byte Pair Encoding · Multi-Head Attention · Dropout · Dense Connections
