Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI
Sharan Maiya, Henning Bartsch, Nathan Lambert, Evan Hubinger

TL;DR
This paper presents an open implementation of character training for AI assistants using Constitutional AI and synthetic data, resulting in more robust, coherent, and realistic persona shaping without compromising general capabilities.
Contribution
It introduces a novel open-source method for character training of language models, utilizing constitutional AI and synthetic data for improved persona control.
Findings
More robust to adversarial prompts
Produces more coherent and realistic personas
Preserves general language capabilities
Abstract
The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect interaction quality, perceived intelligence, and alignment with both developer and user intentions. The shaping of this persona, known as character training, is a critical component of industry post-training, yet remains effectively unstudied in the academic literature. We introduce the first open implementation of character training, leveraging Constitutional AI and a new data pipeline using synthetic introspective data to shape the assistant persona in a more effective and controlled manner than alternatives such as constraining system prompts or activation steering. Specifically, we fine-tune three popular open-weights models using 11 example personas, such as humorous, deeply caring, or even…
Peer Reviews
Decision·Submitted to ICLR 2026
- Paper introduces a new method for character fine-tuning of LLMs in 2 stages: distillation via DPO from a teacher model and introspection. - Ablation study shows that introspection is a necessary component since it leads to improvement over distillation. - Open-source codebase could be of use for further research in the area of character training. However, there could be more emphasis on how the proposed methods are specific to the character training problem and not just general post-training
- *Lack of grounding in psychology or theory*: the paper engages with concepts such as personality and personas, and introduces a set of personas in Table 1. However, the paper does not attempt to define what they mean by the personas. Are these personality traits (e.g. Impulsive) ? Styles (Poetic)? Moral values (Misaligned)? What about specific characters (e.g. movie personas etc.)? What about combinations of the "personas": as some of these are personality traits, can the model personality be
- Persona adherence persists better under “break character” instructions versus system prompts and often versus steering, suggesting the method alters the assistant’s default behavior rather than merely role-playing. - The breadth of personas and models, eleven distinct personas spanning style and values (including flourishing, loving, misaligned) across three open-weights models, builds a useful testbed for future study.
- The paper’s primary limitation lies in its heavy reliance on model-based evaluators, including an LLM-as-a-Judge for coherence assessment and a finetuned persona classifier for robustness measurement, but without proving reliability. While these automated evaluations enable large-scale, reproducible comparisons, they introduce potential circularity and bias—both evaluators are derived from similar distributional assumptions as the models being tested. As a result, improvements in persona persi
1. The paper is clearly written, the training methodology is comprehensively described, and the model is open-sourced. 2. The paper demonstrates good performance in terms of character robustness and response coherence. 3. The paper proposes a set of response style categories, enabling a more fine-grained representation of the character's degree of personalization.
1. The paper asserts that "Character Training" is an unexplored area in academia but a common consensus in industry, thereby emphasizing the pioneering nature of its exploration in the academic field. This claim is debatable. 2. The paper only compares its method with training-free approaches like Activation Steering and system prompts. It fails to compare with other post-training methods or the so-called "industry consensus" methods. 3. The paper generates 10,000 fine-tuning data entries per
Code & Models
- 🤗maius/llama-3.1-8b-it-personasmodel· ♡ 3♡ 3
- 🤗maius/gemma-3-4b-it-personasmodel
- 🤗maius/qwen-2.5-7b-it-personasmodel
- 🤗maius/llama-3.1-8b-it-misalignmentmodel· 10 dl10 dl
- 🤗maius/qwen-2.5-7b-it-misalignmentmodel· 12 dl12 dl
- 🤗maius/gemma-3-4b-it-misalignmentmodel· 9 dl9 dl
- 🤗Hengzongshu/Kos_Mos_projectmodel
- 🤗oliverdk/llama31-8b-goodness-personamodel
- 🤗oliverdk/qwen-2.5-7b-goodnessmodel· 1 dl1 dl
- 🤗mariiakoroliuk/low-agreeableness-llama-3.1-8b-loramodel· 12 dl12 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Persona Design and Applications · Social Robot Interaction and HRI
