Multimodal Large Language Models with Fusion Low Rank Adaptation for   Device Directed Speech Detection

Shruti Palaskar; Oggi Rudovic; Sameer Dharur; Florian Pesce; Gautam; Krishna; Aswin Sivaraman; Jack Berkowitz; Ahmed Hussen Abdelaziz; Saurabh; Adya; Ahmed Tewfik

arXiv:2406.09617·cs.CL·June 17, 2024·1 cites

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam, Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh, Adya, Ahmed Tewfik

PDF

Open Access

TL;DR

This paper introduces FLoRA, a low-rank adaptation method enabling large language models to incorporate multimodal data efficiently, significantly improving device-directed speech detection performance while reducing tuning complexity and maintaining scalability.

Contribution

The paper presents FLoRA, a novel low-rank adaptation technique that allows pre-trained LLMs to effectively integrate new modalities with fewer parameters and enhanced robustness.

Findings

01

22% relative reduction in EER over text-only models

02

FLoRA achieves performance parity with full fine-tuning while tuning fewer parameters

03

Robustness to missing data with 20% lower EER and 56% lower false accept rate

Abstract

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsAdapter