HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu

TL;DR
This paper introduces HPE-CogVLM, a novel vision language model framework that significantly improves head pose estimation accuracy by integrating object detection and HPE tasks through a specialized model merging technique.
Contribution
The paper presents a new LoRA layer-based merging method for VLMs that effectively combines object detection and head pose estimation capabilities, overcoming previous limitations.
Findings
Achieves 31.5% reduction in MAE over state-of-the-art CNN model
Outperforms direct LoRA fine-tuning and task arithmetic merging methods
Successfully integrates HPE and object detection in a unified VLM framework
Abstract
Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
