An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation

Peiming Guo; Sinuo Liu; Yanzhao Zhang; Dingkun Long; Pengjun Xie,; Meishan Zhang; Min Zhang

arXiv:2408.08650·cs.CL·April 1, 2025

An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation

Peiming Guo, Sinuo Liu, Yanzhao Zhang, Dingkun Long, Pengjun Xie,, Meishan Zhang, Min Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces the first end-to-end model for photo-sharing multi-modal dialogue generation, integrating visual perception and image generation within a large language model to improve performance over pipeline approaches.

Contribution

The paper presents a novel end-to-end architecture that combines an image perceptron and image generator with a large language model, enabling gradient flow and improved multi-modal dialogue generation.

Findings

01

Achieves state-of-the-art results on PhotoChat and DialogCC datasets.

02

Outperforms pipeline models in text and image generation metrics.

03

Validates the effectiveness of end-to-end training for multi-modal dialogue tasks.

Abstract

Photo-Sharing Multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment. Using image text caption as the bridge, a pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task. However, representing the images with text captions may loss important visual details and information and cause error propagation in the complex dialogue system. Besides, the pipeline model isolates the three models separately because discrete image text captions hinder end-to-end gradient propagation. We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model. The large language model employs the Q-Former to perceive visual images in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guopeiming/E2E_PSDG
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling

MethodsDiffusion · ALIGN