A Survey of Multimodal Composite Editing and Retrieval
Suyan Li, Fuxiang Huang, and Lei Zhang

TL;DR
This survey comprehensively reviews multimodal composite editing and retrieval, covering methods, applications, benchmarks, and future directions in integrating diverse data types like text, images, and audio for improved retrieval systems.
Contribution
It is the first comprehensive review of multimodal composite retrieval, filling a gap in existing literature on multimodal fusion and retrieval techniques.
Findings
Systematic organization of application scenarios and methods
Analysis of benchmarks and experimental results
Identification of future research directions
Abstract
In the real world, where information is abundant and diverse across different modalities, understanding and utilizing various data types to improve retrieval systems is a key focus of research. Multimodal composite retrieval integrates diverse modalities such as text, image and audio, etc. to provide more accurate, personalized, and contextually relevant results. To facilitate a deeper understanding of this promising direction, this survey explores multimodal composite editing and retrieval in depth, covering image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval. In this survey, we systematically organize the application scenarios, methods, benchmarks, experiments, and future directions. Multimodal learning is a hot topic in large model era, and have also witnessed some surveys in multimodal learning and vision-language models with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
MethodsFocus
