EffiMiniVLM: A Compact Dual-Encoder Regression Framework
Yin-Loon Khor, Yi-Jie Wong, Yan Chai Hum

TL;DR
EffiMiniVLM is a compact, resource-efficient vision-language regression model that achieves competitive product quality prediction using only 20% of the dataset and no external data.
Contribution
The paper introduces EffiMiniVLM, a lightweight dual-encoder framework with a novel weighted loss, improving efficiency and scalability in multimodal product quality prediction.
Findings
Achieves CES score of 0.40 with 27.7M parameters and 6.8 GFLOPs.
Outperforms larger models in resource efficiency, with 4-8x fewer resources.
Scaling data to 40% surpasses larger models trained on more data.
Abstract
Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
