SkillSight: Efficient First-Person Skill Assessment with Gaze
Chi Hsuan Wu, Kumar Ashutosh, Kristen Grauman

TL;DR
SkillSight introduces a power-efficient method for assessing skills using gaze data from first-person videos, achieving high accuracy with significantly reduced energy consumption.
Contribution
The paper presents a novel two-stage framework that models gaze and video for skill assessment and distills it into a gaze-only model, enabling efficient real-world applications.
Findings
Gaze data significantly improves skill assessment accuracy.
The gaze-only model reduces power consumption by 73x compared to video-based methods.
SkillSight achieves state-of-the-art performance across diverse datasets.
Abstract
Egocentric perception on smart glasses could transform how we learn new skills in the physical world, but automatic skill assessment remains a fundamental technical challenge. We introduce SkillSight for power-efficient skill assessment from first-person data. Central to our approach is the hypothesis that skill level is evident not only in how a person performs an activity (video), but also in how they direct their attention when doing so (gaze). Our two-stage framework first learns to jointly model gaze and egocentric video when predicting skill level, then distills a gaze-only student model. At inference, the student model requires only gaze input, drastically reducing power consumption by eliminating continuous video processing. Experiments on three datasets spanning cooking, music, and sports establish, for the first time, the valuable role of gaze in skill understanding across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
