AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

Biao Zhang; Lixin Chen; Bin Zhang; Zongwei Wang; Tong Liu; Bo Zheng

arXiv:2604.20135·cs.CL·April 23, 2026

AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

Biao Zhang, Lixin Chen, Bin Zhang, Zongwei Wang, Tong Liu, Bo Zheng

PDF

TL;DR

AFMRL introduces a novel two-stage framework leveraging generative multimodal models to improve fine-grained product retrieval in E-commerce by integrating attribute generation with contrastive learning and reinforcement.

Contribution

The paper proposes a new attribute-enhanced learning framework that combines attribute generation with contrastive learning and reinforcement to improve fine-grained multimodal representations.

Findings

01

Achieves state-of-the-art performance on large-scale E-commerce datasets.

02

Effectively filters false negatives using attribute-guided contrastive learning.

03

Enhances attribute generation with retrieval-aware reinforcement learning.

Abstract

Multimodal representation is crucial for E-commerce tasks such as identical product retrieval. Large representation models (e.g., VLM2Vec) demonstrate strong multimodal understanding capabilities, yet they struggle with fine-grained semantic comprehension, which is essential for distinguishing highly similar items. To address this, we propose Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning (AFMRL), which defines product fine-grained understanding as an attribute generation task. It leverages the generative power of Multimodal Large Language Models (MLLMs) to extract key attributes from product images and text, and enhances representation learning through a two-stage training framework: 1) Attribute-Guided Contrastive Learning (AGCL), where the key attributes generated by the MLLM are used in the image-text contrastive learning training process to identify hard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.