EffiMiniVLM: A Compact Dual-Encoder Regression Framework

Yin-Loon Khor; Yi-Jie Wong; Yan Chai Hum

arXiv:2604.03172·cs.CV·April 6, 2026

EffiMiniVLM: A Compact Dual-Encoder Regression Framework

Yin-Loon Khor, Yi-Jie Wong, Yan Chai Hum

PDF

TL;DR

EffiMiniVLM is a compact, resource-efficient vision-language regression model that achieves competitive product quality prediction using only 20% of the dataset and no external data.

Contribution

The paper introduces EffiMiniVLM, a lightweight dual-encoder framework with a novel weighted loss, improving efficiency and scalability in multimodal product quality prediction.

Findings

01

Achieves CES score of 0.40 with 27.7M parameters and 6.8 GFLOPs.

02

Outperforms larger models in resource efficiency, with 4-8x fewer resources.

03

Scaling data to 40% surpasses larger models trained on more data.

Abstract

Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.