Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment

Mounvik K; N Harshit

arXiv:2602.14889·cs.LG·February 17, 2026

Web-Scale Multimodal Summarization using CLIP-Based Semantic Alignment

Mounvik K, N Harshit

PDF

Open Access

TL;DR

This paper presents a scalable multimodal summarization framework that combines web, news, and image data using CLIP-based semantic alignment, enabling coherent summaries with adjustable features and strong alignment metrics.

Contribution

It introduces a lightweight, configurable pipeline integrating retrieval, CLIP, and BLIP models for web-scale multimodal summarization, with a user-friendly API.

Findings

01

Achieved ROC-AUC of 0.9270 on image-caption pairs

02

F1-score of 0.6504 indicating effective semantic filtering

03

System demonstrates high accuracy of 96.99% in multimodal alignment

Abstract

We introduce Web-Scale Multimodal Summarization, a lightweight framework for generating summaries by combining retrieved text and image data from web sources. Given a user-defined topic, the system performs parallel web, news, and image searches. Retrieved images are ranked using a fine-tuned CLIP model to measure semantic alignment with topic and text. Optional BLIP captioning enables image-only summaries for stronger multimodal coherence.The pipeline supports features such as adjustable fetch limits, semantic filtering, summary styling, and downloading structured outputs. We expose the system via a Gradio-based API with controllable parameters and preconfigured presets.Evaluation on 500 image-caption pairs with 20:1 contrastive negatives yields a ROC-AUC of 0.9270, an F1-score of 0.6504, and an accuracy of 96.99%, demonstrating strong multimodal alignment. This work provides a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Biomedical Text Mining and Ontologies