Input Conditioned Layer Dropping in Speech Foundation Models

Abdul Hannan; Daniele Falavigna; Alessio Brutti

arXiv:2507.07954·cs.SD·July 11, 2025

Input Conditioned Layer Dropping in Speech Foundation Models

Abdul Hannan, Daniele Falavigna, Alessio Brutti

PDF

Open Access

TL;DR

This paper introduces an input-driven layer dropping method for speech models that dynamically adapts the network's depth based on input features, improving efficiency without sacrificing performance.

Contribution

It proposes a novel input-conditioned layer dropping technique that uses a lightweight network to select layers, enhancing dynamic adaptability in speech foundation models.

Findings

01

Outperforms random layer dropping in experiments

02

Achieves comparable or better results than early exit strategies

03

Effective across multiple speech and audio benchmarks

Abstract

Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ( $L D$ ) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven $L D$ that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis