An End-to-End Model for Photo-Sharing Multi-modal Dialogue Generation
Peiming Guo, Sinuo Liu, Yanzhao Zhang, Dingkun Long, Pengjun Xie,, Meishan Zhang, Min Zhang

TL;DR
This paper introduces the first end-to-end model for photo-sharing multi-modal dialogue generation, integrating visual perception and image generation within a large language model to improve performance over pipeline approaches.
Contribution
The paper presents a novel end-to-end architecture that combines an image perceptron and image generator with a large language model, enabling gradient flow and improved multi-modal dialogue generation.
Findings
Achieves state-of-the-art results on PhotoChat and DialogCC datasets.
Outperforms pipeline models in text and image generation metrics.
Validates the effectiveness of end-to-end training for multi-modal dialogue tasks.
Abstract
Photo-Sharing Multi-modal dialogue generation requires a dialogue agent not only to generate text responses but also to share photos at the proper moment. Using image text caption as the bridge, a pipeline model integrates an image caption model, a text generation model, and an image generation model to handle this complex multi-modal task. However, representing the images with text captions may loss important visual details and information and cause error propagation in the complex dialogue system. Besides, the pipeline model isolates the three models separately because discrete image text captions hinder end-to-end gradient propagation. We propose the first end-to-end model for photo-sharing multi-modal dialogue generation, which integrates an image perceptron and an image generator with a large language model. The large language model employs the Q-Former to perceive visual images in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
MethodsDiffusion · ALIGN
