Harmonizing Visual Text Comprehension and Generation

Zhen Zhao; Jingqun Tang; Binghong Wu; Chunhui Lin; Shu Wei; Hao Liu,; Xin Tan; Zhizhong Zhang; Can Huang; Yuan Xie

arXiv:2407.16364·cs.CV·October 24, 2024

Harmonizing Visual Text Comprehension and Generation

Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shu Wei, Hao Liu,, Xin Tan, Zhizhong Zhang, Can Huang, Yuan Xie

PDF

1 Repo

TL;DR

This paper introduces TextHarmony, a unified multimodal model that effectively comprehends and generates visual text by harmonizing vision and language within a single instance using Slide-LoRA, and enhances capabilities with a new dataset.

Contribution

The paper proposes Slide-LoRA for dynamic modality aggregation, enabling a unified model for visual text tasks, and introduces the DetailedTextCaps-100K dataset for improved visual text generation.

Findings

01

Achieves comparable performance to modality-specific fine-tuning with only 2% more parameters.

02

Improves visual text comprehension tasks by 2.5%.

03

Enhances visual text generation tasks by 4.0%.

Abstract

In this work, we present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. Simultaneously generating images and texts typically results in performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this challenge, existing approaches resort to modality-specific data for supervised fine-tuning, necessitating distinct model instances. We propose Slide-LoRA, which dynamically aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. Slide-LoRA harmonizes the generation of vision and language within a singular model instance, thereby facilitating a more unified generative process. Additionally, we develop a high-quality image caption dataset, DetailedTextCaps-100K, synthesized with a sophisticated closed-source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/textharmony
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.