TLDR: Token-Level Detective Reward Model for Large Vision Language   Models

Deqing Fu; Tong Xiao; Rui Wang; Wang Zhu; Pengchuan Zhang; Guan Pang,; Robin Jia; Lawrence Chen

arXiv:2410.04734·cs.LG·February 26, 2025

TLDR: Token-Level Detective Reward Model for Large Vision Language Models

Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang, Guan Pang,, Robin Jia, Lawrence Chen

PDF

Open Access

TL;DR

This paper introduces TLDR, a token-level reward model for large vision-language models that provides fine-grained feedback, improves model performance, and accelerates human annotation processes.

Contribution

The paper proposes a novel token-level reward model with a perturbation-based training method, enhancing interpretability and utility in vision-language tasks.

Findings

01

TLDR improves model performance significantly.

02

It assists in self-correction of generation outputs.

03

It speeds up human annotation by 3 times.

Abstract

Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a $T$ oken- $L$ evel $D$ etective $R$ eward Model ( $TLDR$ ) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings