LLM2Fx-Tools: Tool Calling For Music Post-Production

Seungheon Doh; Junghyun Koo; Marco A. Mart\'inez-Ram\'irez; Woosung Choi; Wei-Hsiang Liao; Qiyu Wu; Juhan Nam; Yuki Mitsufuji

arXiv:2512.01559·cs.SD·January 30, 2026

LLM2Fx-Tools: Tool Calling For Music Post-Production

Seungheon Doh, Junghyun Koo, Marco A. Mart\'inez-Ram\'irez, Woosung Choi, Wei-Hsiang Liao, Qiyu Wu, Juhan Nam, Yuki Mitsufuji

PDF

Open Access 3 Reviews

TL;DR

This paper presents LLM2Fx-Tools, a novel framework that leverages large language models to generate and control audio effect chains for music post-production, enabling interpretable and style transfer capabilities.

Contribution

It introduces a new multimodal tool-calling framework and a structured dataset for audio effects, advancing LLM applications in music production.

Findings

01

Accurately infers audio effect chains from processed and unprocessed audio.

02

Enables style transfer of audio effects from reference to new content.

03

Demonstrates effective reasoning and control in music post-production tasks.

Abstract

This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine their order, and estimate parameters, guided by chain-of-thought (CoT) planning. We also present LP-Fx, a new instruction-following dataset with structured CoT annotations and tool calls for audio effects modules. Experiments show that LLM2Fx-Tools can infer an Fx-chain and its parameters from pairs of unprocessed and processed audio, enabled by autoregressive sequence modeling, tool calling, and CoT reasoning. We further validate the system in a style transfer setting, where audio effects information is transferred from a reference source and applied to new content. Finally, LLM-as-a-judge evaluation demonstrates…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. First work applying structured tool calling to audio effects chains 2. Comprehensive evaluation across multiple metrics

Weaknesses

1. The paper misuses terminology. "Audio style transfer" has established meaning in audio processing literature (timbre/texture transformation). This work only does audio effects parameter transfer, which is much narrower. This creates confusion with existing work and is misleading. 2. Limited technical novelty. The method is standard multimodal LLM fine-tuning: audio encoder -> adapter -> LLM with LoRA. This is direct application of existing techniques without methodological contribution. 3. No

Reviewer 02Rating 8Confidence 3

Strengths

Originality: The paper's key novelty lies in formulating Fx-chain estimation as a LLM-based tool call problem. The autoregressive modeling for LLMs is able to learn the sequential order of audio effect calls as opposed to systems only based on audio features. Quality: The paper has detailed experiments around the three evaluation tasks, reverse engineering to show the model can predict tool-chain for paired audios, blind style transfer to show the generalization capability to unseen audios, an

Weaknesses

For the reverse engineering task, the strongest baseline is Multi-task regression, which comes close even without relying on the ordering of Fx-chain, while the LLM is learning that information. The authors can consider adding a pairwise-ordering loss for the 9 audio effects for the multi-task baselines. For the style transfer task, the style of the output appears to be mixed between the input and reference audio while listening subjectively to the demo examples. A comparison with differential

Reviewer 03Rating 6Confidence 4

Strengths

- The proposed approach to Fx-chain estimation is novel. The integration of Chain-of-Thought (CoT) reasoning into the training framework is also interesting. - The problem is clearly defined and well motivated. - The methodology for dataset creation is clearly described and systematically organized.

Weaknesses

- In Figure 1, the meaning of FxNorm is unclear. - In Figure 2, why does e_{SEP} consist of two tokens? - Below Equation (1), what is N? What is param_n? - In Section 2.1, second paragraph, the authors mention “handle both tasks.” What exactly are the two tasks? - In Section 2.1, the term “secondary task” is introduced but not clearly defined. - In Section 2.2 (Audio Encoder), why was Fx-Encoder++ chosen over other possible encoders? How might different audio encoders influence system perfo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis