Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment
Mounvik K, N Harshit

TL;DR
This paper presents a scalable multimodal summarization framework that combines web, news, and image data using CLIP-based semantic alignment, enabling coherent summaries with adjustable features and strong alignment metrics.
Contribution
It introduces a lightweight, configurable pipeline integrating retrieval, CLIP, and BLIP models for web-scale multimodal summarization, with a user-friendly API.
Findings
Achieved ROC-AUC of 0.9270 on image-caption pairs
F1-score of 0.6504 indicating effective semantic filtering
System demonstrates high accuracy of 96.99% in multimodal alignment
Abstract
We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal coherence.The pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured presets.Evaluation on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Biomedical Text Mining and Ontologies
