Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware
Stavros Mitsis, Ermos Hadjikyriakos, Humaid Ibrahim, Savvas Neofytou, Shashwat Raman, James Myles, Eiman Kanjo

TL;DR
This paper presents a hardware-efficient, multimodal emotion recognition system optimized for ultra-low-power edge devices, achieving real-time performance and improved accuracy through a novel late-fusion transformer architecture.
Contribution
It introduces a hardware-aware, late-fusion transformer model combining acoustic and linguistic features, optimized for Edge TPU deployment on microcontroller-class devices.
Findings
Achieves 6.3% macro F1 improvement over unimodal baselines.
Real-time inference within 21-23ms latency on Edge TPU.
Operates within a 1.8MB memory budget.
Abstract
Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech Recognition and Synthesis
