Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

Stavros Mitsis; Ermos Hadjikyriakos; Humaid Ibrahim; Savvas Neofytou; Shashwat Raman; James Myles; Eiman Kanjo

arXiv:2510.18036·cs.SD·October 22, 2025

Transformer Redesign for Late Fusion of Audio-Text Features on Ultra-Low-Power Edge Hardware

Stavros Mitsis, Ermos Hadjikyriakos, Humaid Ibrahim, Savvas Neofytou, Shashwat Raman, James Myles, Eiman Kanjo

PDF

Open Access

TL;DR

This paper presents a hardware-efficient, multimodal emotion recognition system optimized for ultra-low-power edge devices, achieving real-time performance and improved accuracy through a novel late-fusion transformer architecture.

Contribution

It introduces a hardware-aware, late-fusion transformer model combining acoustic and linguistic features, optimized for Edge TPU deployment on microcontroller-class devices.

Findings

01

Achieves 6.3% macro F1 improvement over unimodal baselines.

02

Real-time inference within 21-23ms latency on Edge TPU.

03

Operates within a 1.8MB memory budget.

Abstract

Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Music and Audio Processing · Speech Recognition and Synthesis