TL;DR
This paper introduces DeepIntuit, a framework that enhances open-instance video classification by evolving from imitation to intrinsic reasoning, leveraging reinforcement learning and calibration for better generalization.
Contribution
DeepIntuit is a novel approach that combines supervised reasoning initialization, reinforcement learning refinement, and calibration to improve open-instance video classification.
Findings
DeepIntuit outperforms traditional models on open-instance tasks.
Intrinsic reasoning improves generalization over imitation-based methods.
The approach effectively transfers knowledge without distribution mismatch.
Abstract
Conventional video classification models, acting as effective imitators, excel in scenarios with homogeneous data distributions. However, real-world applications often present an open-instance challenge, where intra-class variations are vast and complex, beyond existing benchmarks. While traditional video encoder models struggle to fit these diverse distributions, vision-language models (VLMs) offer superior generalization but have not fully leveraged their reasoning capabilities (intuition) for such tasks. In this paper, we bridge this gap with an intrinsic reasoning framework that evolves open-instance video classification from imitation to intuition. Our approach, namely DeepIntuit, begins with a cold-start supervised alignment to initialize reasoning capability, followed by refinement using Group Relative Policy Optimization (GRPO) to enhance reasoning coherence through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
