DynRefer: Delving into Region-level Multimodal Tasks via Dynamic   Resolution

Yuzhong Zhao; Feng Liu; Yue Liu; Mingxiang Liao; Chen Gong; Qixiang; Ye; Fang Wan

arXiv:2405.16071·cs.CV·March 4, 2025

DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

Yuzhong Zhao, Feng Liu, Yue Liu, Mingxiang Liao, Chen Gong, Qixiang, Ye, Fang Wan

PDF

Open Access 1 Repo

TL;DR

DynRefer introduces a resolution-adaptive approach for multimodal models, mimicking human visual cognition to improve accuracy in region-level tasks like captioning and recognition, achieving state-of-the-art results.

Contribution

The paper presents DynRefer, a novel method that dynamically adapts resolution during training and inference for improved region-level multimodal task performance.

Findings

01

Improves accuracy in region-level captioning and recognition.

02

Achieves state-of-the-art results on multiple tasks.

03

Enhances model adaptability to human visual cognition.

Abstract

One fundamental task of multimodal models is to translate referred image regions to human preferred language descriptions. Existing methods, however, ignore the resolution adaptability needs of different tasks, which hinders them to find out precise language descriptions. In this study, we propose a DynRefer approach, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. During training, DynRefer stochastically aligns language descriptions of multimodal tasks with images of multiple resolutions, which are constructed by nesting a set of random views around the referred region. During inference, DynRefer performs selectively multimodal referring by sampling proper region representations for tasks from the nested views based on image and task priors. This allows the visual information for referring to better match human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

callsys/dynrefer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training