From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Mengru Wang; Zhenqian Xu; Junfeng Fang; Yunzhi Yao; Shumin Deng; Huajun Chen; Ningyu Zhang

arXiv:2602.04735·cs.LG·February 5, 2026

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Mengru Wang, Zhenqian Xu, Junfeng Fang, Yunzhi Yao, Shumin Deng, Huajun Chen, Ningyu Zhang

PDF

Open Access

TL;DR

This paper introduces Data2Behavior and MDF, a novel approach to predict unintended biases in large language models before training, enabling efficient risk assessment without costly fine-tuning.

Contribution

The paper proposes a new task and a lightweight method to predict model biases pre-training, reducing resource consumption and improving safety assessments.

Findings

01

MDF reliably predicts unintended behaviors in LLMs.

02

MDF uses only 20% of GPU resources compared to fine-tuning.

03

Experiments confirm MDF's effectiveness across multiple models.

Abstract

Large Language Models (LLMs) can acquire unintended biases from seemingly benign training data even without explicit cues or malicious content. Existing methods struggle to detect such risks before fine-tuning, making post hoc evaluation costly and inefficient. To address this challenge, we introduce Data2Behavior, a new task for predicting unintended model behaviors prior to training. We also propose Manipulating Data Features (MDF), a lightweight approach that summarizes candidate data through their mean representations and injects them into the forward pass of a base model, allowing latent statistical signals in the data to shape model activations and reveal potential biases and safety risks without updating any parameters. MDF achieves reliable prediction while consuming only about 20% of the GPU resources required for fine-tuning. Experiments on Qwen3-14B, Qwen2.5-32B-Instruct, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)