Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng, Tao Kong, Ya Jing, Jiaan Wang, Xiaojie Wang

TL;DR
This paper introduces UniRef, a unified model for reference expression generation and comprehension that leverages a novel fusion layer and joint pre-training to improve performance on both tasks.
Contribution
The paper presents UniRef, a novel unified model with a specialized fusion layer and joint pre-training strategies for REG and REC tasks, addressing their interrelated challenges.
Findings
Outperforms previous state-of-the-art on REG and REC tasks
Effective fusion of image, region, and text improves task performance
Joint pre-training enhances the shared representation quality
Abstract
Reference Expression Generation (REG) and Comprehension (REC) are two highly correlated tasks. Modeling REG and REC simultaneously for utilizing the relation between them is a promising way to improve both. However, the problem of distinct inputs, as well as building connections between them in a single model, brings challenges to the design and training of the joint model. To address the problems, we propose a unified model for REG and REC, named UniRef. It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention. Additionally, IRTF could generate pseudo input regions for the REC task to enable a uniform way for sharing the identical representation space across the REC and REG. We further propose Vision-conditioned Masked Language Modeling (VMLM) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
