Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation
Jialong Mai, Xiaofen Xing, Yawei Li, Weidong Chen, Zhipeng Li, Jingyuan Xing, Xiangmin Xu

TL;DR
This paper introduces a Dynamic Parameter Memory mechanism that enhances speech large language models with temporary LoRA modules, enabling effective processing of long audio sequences for emotion recognition in conversation, surpassing existing methods.
Contribution
The paper proposes a novel DPM mechanism that encodes sentence-level emotions into temporary LoRA modules, allowing SLLMs to handle unlimited-length audio sequences for ERC.
Findings
DPM significantly improves emotion recognition accuracy on IEMOCAP.
The method achieves state-of-the-art performance in long-sequence emotion recognition.
DPM effectively memorizes contextual information across conversation turns.
Abstract
Recent research has focused on applying speech large language model (SLLM) to improve speech emotion recognition (SER). However, the inherently high frame rate in speech modality severely limits the signal processing and understanding capabilities of SLLM. For example, a SLLM with a 4K context window can only process 80 seconds of audio at 50Hz feature sampling rate before reaching its capacity limit. Input token compression methods used in SLLM overlook the continuity and inertia of emotions across multiple conversation turns. This paper proposes a Dynamic Parameter Memory (DPM) mechanism with contextual semantics and sentence-level emotion encoding, enabling processing of unlimited-length audio with limited context windows in SLLM. Specifically, DPM progressively encodes sentence-level information and emotions into a temporary LoRA module during inference to effectively "memorize" the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The manuscript presents a novel inference method (DPM) addressing long-sequence limits in LLMs. This is usually a very complex method that can be very helpful in long sequential emotional recognition conversations. 2. The manuscript also maintains contextual emotion continuity across dialogue turns. This is also very in depth and contextual 3. Demonstrated SOTA performance (e.g., 79.34% WF1). The SOTA performance is a good parameter to consider overall 4. Elegant use of temporary LoRA for
1. Although the evaluation looks pretty comprehensive but limited to two datasets; lacks real-world or multilingual validation. 2. The metrics are good but there is no explicit latency or computational cost benchmarks. 3. Overall there is a high dependency on sentence segmentation quality. 4. Limited analysis on failure or misclassification cases were also seen overall.
1. The primary novelty of this work lies in its innovative application of LoRA not as a static fine-tuning method, but as a dynamic, temporal memory for extending the effective context of an SLLM. Instead of conventional approaches like input compression or sliding windows, which risk losing historical information, the authors propose to progressively encode the evolving conversational context directly into the LoRA parameters during inference. This reconceptualization of LoRA is really interest
The paper has some weaknesses and I will try to write them down in a somewhat decreasing order of significance that would hopefully help the authors to fix these issues and improve the quality of their paper. 1. A primary concern is the marginal performance improvement when contextualized against the immense increase in model complexity. The reported 10-15% gain in weighted and unweighted accuracy over a four-emotion task on IEMOCAP is unimpressive when compared to results from over seven years
The problem statement and motivation behind the work are well introduced. The overall presentation of the paper and the visuals are clear.
For a more comprehensive evaluation, I would suggest more evaluations on more conversational style datasets in addition to IEMOCAP and MELD. This could be something to consider for other language speakers and more varied conversational styles. In addition, a synthetic dataset could also be used to show generalizability. I would also like to know the model's performance on different discrete emotions, instead of total accuracy or macro F1. I know the main scope of the work is for ERC, but for S
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
