Multi-modal Egocentric Activity Recognition using Audio-Visual Features
Mehmet Ali Arabac{\i}, Fatih \"Ozkan, Elif Surer, Peter, Jan\v{c}ovi\v{c}, Alptekin Temizel

TL;DR
This paper introduces a novel multi-modal framework combining audio-visual features with multi-kernel learning and boosting for egocentric activity recognition, demonstrating superior performance over existing methods.
Contribution
The work presents a new adaptive fusion framework using MKL and MKBoost for egocentric activity recognition with multi-modal features, improving accuracy.
Findings
MKL outperforms traditional fusion methods
Multi-modal features improve recognition accuracy
Framework tested on multiple egocentric datasets
Abstract
Egocentric activity recognition in first-person videos has an increasing importance with a variety of applications such as lifelogging, summarization, assisted-living and activity tracking. Existing methods for this task are based on interpretation of various sensor information using pre-determined weights for each feature. In this work, we propose a new framework for egocentric activity recognition problem based on combining audio-visual features with multi-kernel learning (MKL) and multi-kernel boosting (MKBoost). For that purpose, firstly grid optical-flow, virtual-inertia feature, log-covariance, cuboid are extracted from the video. The audio signal is characterized using a "supervector", obtained based on Gaussian mixture modelling of frame-level features, followed by a maximum a-posteriori adaptation. Then, the extracted multi-modal features are adaptively fused by MKL classifiers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
