HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

Yu Tian; Tianqi Shao; Tsukasa Demizu; Xuyang Wu; Hsin-Tai Wu

arXiv:2406.01914·cs.CV·March 24, 2026

HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu

PDF

TL;DR

This paper introduces HPE-CogVLM, a novel vision language model framework that significantly improves head pose estimation accuracy by integrating object detection and HPE tasks through a specialized model merging technique.

Contribution

The paper presents a new LoRA layer-based merging method for VLMs that effectively combines object detection and head pose estimation capabilities, overcoming previous limitations.

Findings

01

Achieves 31.5% reduction in MAE over state-of-the-art CNN model

02

Outperforms direct LoRA fine-tuning and task arithmetic merging methods

03

Successfully integrates HPE and object detection in a unified VLM framework

Abstract

Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.