On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

Zhanzhong Pang; Dibyadip Chatterjee; Fadime Sener; Angela Yao

arXiv:2603.02546·cs.CV·March 4, 2026

On Discriminative vs. Generative classifiers: Rethinking MLLMs for Action Understanding

Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener, Angela Yao

PDF

Open Access 3 Reviews

TL;DR

This paper compares generative and discriminative classifiers in multimodal large language models for action understanding, proposing a hybrid GAD approach that improves accuracy and efficiency over existing methods.

Contribution

It introduces the GAD classifier that combines generative and discriminative approaches, enhancing performance while maintaining efficiency in action understanding tasks.

Findings

01

Discriminative classifiers outperform generative ones in accuracy and efficiency.

02

GAD improves accuracy by 2.5% and triples inference speed on COIN benchmark.

03

GAD achieves state-of-the-art results across multiple datasets.

Abstract

Multimodal Large Language Models (MLLMs) have advanced open-world action understanding and can be adapted as generative classifiers for closed-set settings by autoregressively generating action labels as text. However, this approach is inefficient, and shared subwords across action labels introduce semantic overlap, leading to ambiguity in generation. In contrast, discriminative classifiers learn task-specific representations with clear decision boundaries, enabling efficient one-step classification without autoregressive decoding. We first compare generative and discriminative classifiers with MLLMs for closed-set action understanding, revealing the superior accuracy and efficiency of the latter. To bridge the performance gap, we design strategies that elevate generative classifiers toward performance comparable with discriminative ones. Furthermore, we show that generative modeling…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The paper's greatest strength is its thorough, apples-to-apples comparison between generative and discriminative approaches across a wide range of tasks and datasets. - The implementation details in the main paper and appendix are extensive.

Weaknesses

- Using a discriminative head on top of a generative model and training it with an auxiliary task is a standard practice in machine learning. The conceptual contribution is limited. - The paper lacks a deep analysis of how the auxiliary generative task helps. - The GAD framework is only evaluated in closed-set settings. Its applicability to open-world or few-shot scenarios is not discussed, limiting the scope of its claimed generality.

Reviewer 02Rating 6Confidence 4

Strengths

The paper is very well written and presented. The motivation is clear and the paper is properly threaded. The authors identify specific challenges of generative models for action classification, and enumerate and study the possible reasons behind this subpar performance w.r.t. discriminative methods. Then, the authors devise a combined approach that is shown to produce better results. The paper is accompanied by proper ablation studies and an efficiency analysis, illustrating how discriminative

Weaknesses

While the paper is well threaded and motivated, part of the story is driven towards obvious conclusions that leave aside some potential alternatives (please see the questions below). In particular, 1. The fact that discriminative approaches, or closed-set approaches, outperform generative ones, is not a finding or contribution of this paper, it is general knowledge. It is not expected that open-vocabulary models will outperform discriminative approaches in closed-set environments. The narrativ

Reviewer 03Rating 6Confidence 4

Strengths

1. Strong results across multiple datasets with multiple baselines with focus on both accuracy and speed. 2. Interesting dual training strategy. 3. Clear presentation of key idea (e.g. great use of diagrams in Figure 2).

Weaknesses

1. "approximately 1.8× faster training": Discuss this more - how is the training faster than next-token prediction? This is unclear. 2. Lost generality: does this classification training remove the general QnA abilities of these VLMs? This is unclear. 3. How do these VLMs perform zero-shot (generative classifier) on these tasks? This will provide an interesting point of comparison. 4. Inference compute scaling: if you generate one or two tokens (with the generative part of GAD), and then ru

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Explainable Artificial Intelligence (XAI)