FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

Fan Yang; Yousong Zhu; Xin Li; Yufei Zhan; Hongyin Zhao; Shurong Zheng; Yaowei Wang; Ming Tang; Jinqiao Wang

arXiv:2506.16806·cs.CV·September 23, 2025

FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

Fan Yang, Yousong Zhu, Xin Li, Yufei Zhan, Hongyin Zhao, Shurong Zheng, Yaowei Wang, Ming Tang, Jinqiao Wang

PDF

Open Access

TL;DR

FOCUS introduces a unified vision-language model that integrates segmentation-aware perception with controllable object-centric image editing, enabling accurate understanding and high-quality, guided visual synthesis within an end-to-end framework.

Contribution

It presents FOCUS, a novel end-to-end LVLM that combines segmentation and generation modules, bridging perception and editing for improved multimodal understanding and controllable image synthesis.

Findings

01

Achieves state-of-the-art referring segmentation accuracy.

02

Demonstrates superior controllable image editing quality.

03

Effectively unifies perception and generation tasks.

Abstract

Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling, enabling both accurate content understanding and flexible editing. However, current approaches treat "what to see" and "how to edit" separately: they either perform isolated object segmentation or utilize segmentation masks merely as conditional prompts for local edit generation tasks, often relying on multiple disjointed models. To bridge these gaps, we introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework. FOCUS employs a dual-branch visual encoder to simultaneously capture global semantic context and fine-grained spatial details. In addition, we leverage a MoVQGAN-based visual tokenizer to produce discrete visual tokens that enhance generation quality.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Digital Humanities and Scholarship