Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Edward Lue Chee Lip; Anthony Channg; Diana Kim; Aaron Sandoval; Kevin Zhu

arXiv:2512.14745·cs.CR·December 18, 2025

Factor(U,T): Controlling Untrusted AI by Monitoring their Plans

Edward Lue Chee Lip, Anthony Channg, Diana Kim, Aaron Sandoval, Kevin Zhu

PDF

Open Access

TL;DR

This paper explores monitoring strategies for untrusted AI decomposers in complex task decomposition, demonstrating that monitoring concrete solutions is more effective than monitoring decompositions alone for detecting malicious activity.

Contribution

It introduces Factor(U,T), a framework allowing untrusted models to decompose tasks with trusted models implementing child tasks, and evaluates monitoring effectiveness in this setting.

Findings

01

Monitors evaluating complete solutions outperform decomposition-only monitors.

02

Implementation-context monitoring achieves high discrimination (AUROC 0.96) and safety.

03

Decomposition-only monitoring performs poorly (AUROC 0.52).

Abstract

As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks $\unicode x 2013$ but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor( $U$ , $T$ ), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor( $U$ , $T$ ) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI