TL;DR
Splitformer introduces parallel processing layers with downsampled inputs to enhance early-exit speech recognition models, achieving better accuracy on benchmarks with minimal parameter increase and no impact on inference time.
Contribution
The paper proposes a novel architecture that combines early-exit strategies with parallel downsampling layers to improve speech recognition performance.
Findings
Significant performance improvement on standard benchmarks.
Minimal increase in model parameters.
No change in inference time.
Abstract
The ability to dynamically adjust the computational load of neural models during inference in a resource aware manner is crucial for on-device processing scenarios, characterised by limited and time-varying computational resources. Early-exit architectures represent an elegant and effective solution, since they can process the input with a subset of their layers, exiting at intermediate branches (the upmost layers are hence removed from the model). From a different perspective, for automatic speech recognition applications there are memory-efficient neural architectures that apply variable frame rate analysis, through downsampling/upsampling operations in the middle layers, reducing the overall number of operations and improving significantly the performance on well established benchmarks. One example is the Zipformer. However, these architectures lack the modularity necessary to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttentive Walk-Aggregating Graph Neural Network · Parallel Layers
