Efficient Learning for Product Attributes with Compact Multimodal Models
Mandar Kulkarni

TL;DR
This paper presents a label-efficient semi-supervised fine-tuning approach for compact vision-language models in e-commerce product attribute prediction, leveraging unlabeled data with minimal compute overhead.
Contribution
It introduces a novel semi-supervised fine-tuning method using PEFT and DPO that effectively utilizes unlabeled data for compact VLMs in e-commerce applications.
Findings
DPO-based fine-tuning outperforms supervised models on multiple verticals.
Accuracy improves as more unlabeled data is incorporated.
Method achieves efficient convergence with minimal compute.
Abstract
Image-based product attribute prediction in e-commerce is a crucial task with numerous applications. The supervised fine-tuning of Vision Language Models (VLMs) faces significant scale challenges due to the cost of manual or API based annotation. In this paper, we investigate label-efficient semi-supervised fine-tuning strategies for compact VLMs (2B-3B parameters) that leverage unlabeled product listings through Direct Preference Optimization (DPO). Beginning with a small, API-based, annotated, and labeled set, we first employ PEFT to train low-rank adapter modules. To update the adapter weights with unlabeled data, we generate multiple reasoning-and-answer chains per unlabeled sample and segregate these chains into preferred and dispreferred based on self-consistency. We then fine-tune the model with DPO loss and use the updated model for the next iteration. By using PEFT fine-tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Text and Document Classification Technologies · Sentiment Analysis and Opinion Mining
