HeadPosr: End-to-end Trainable Head Pose Estimation using Transformer Encoders
Naina Dhingra

TL;DR
HeadPosr introduces an end-to-end trainable head pose estimation method using transformer encoders, demonstrating superior performance and establishing new benchmarks across multiple datasets.
Contribution
The paper presents a novel transformer-based architecture for head pose estimation, including extensive ablation studies and outperforming existing state-of-the-art methods.
Findings
Outperforms all compared methods on AFLW2000 and BIWI datasets.
Demonstrates the effectiveness of transformer encoders in HPE.
Sets new benchmarks for head pose estimation accuracy.
Abstract
In this paper, HeadPosr is proposed to predict the head poses using a single RGB image. \textit{HeadPosr} uses a novel architecture which includes a transformer encoder. In concrete, it consists of: (1) backbone; (2) connector; (3) transformer encoder; (4) prediction head. The significance of using a transformer encoder for HPE is studied. An extensive ablation study is performed on varying the (1) number of encoders; (2) number of heads; (3) different position embeddings; (4) different activations; (5) input channel size, in a transformer used in HeadPosr. Further studies on using: (1) different backbones, (2) using different learning rates are also shown. The elaborated experiments and ablations studies are conducted using three different open-source widely used datasets for HPE, i.e., 300W-LP, AFLW2000, and BIWI datasets. Experiments illustrate that \textit{HeadPosr} outperforms all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
