From Data to Behavior: Predicting Unintended Model Behaviors Before Training
Mengru Wang, Zhenqian Xu, Junfeng Fang, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang

TL;DR
This paper introduces Data2Behavior and MDF, a novel approach to predict unintended biases in large language models before training, enabling efficient risk assessment without costly fine-tuning.
Contribution
The paper proposes a new task and a lightweight method to predict model biases pre-training, reducing resource consumption and improving safety assessments.
Findings
MDF reliably predicts unintended behaviors in LLMs.
MDF uses only 20% of GPU resources compared to fine-tuning.
Experiments confirm MDF's effectiveness across multiple models.
Abstract
Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
