Factor(U,T): Controlling Untrusted AI by Monitoring their Plans
Edward Lue Chee Lip, Anthony Channg, Diana Kim, Aaron Sandoval, Kevin Zhu

TL;DR
This paper explores monitoring strategies for untrusted AI decomposers in complex task decomposition, demonstrating that monitoring concrete solutions is more effective than monitoring decompositions alone for detecting malicious activity.
Contribution
It introduces Factor(U,T), a framework allowing untrusted models to decompose tasks with trusted models implementing child tasks, and evaluates monitoring effectiveness in this setting.
Findings
Monitors evaluating complete solutions outperform decomposition-only monitors.
Implementation-context monitoring achieves high discrimination (AUROC 0.96) and safety.
Decomposition-only monitoring performs poorly (AUROC 0.52).
Abstract
As AI capabilities advance, we increasingly rely on powerful models to decompose complex tasks but what if the decomposer itself is malicious? Factored cognition protocols decompose complex tasks into simpler child tasks: one model creates the decomposition, while other models implement the child tasks in isolation. Prior work uses trusted (weaker but reliable) models for decomposition, which limits usefulness for tasks where decomposition itself is challenging. We introduce Factor(,), in which an untrusted (stronger but potentially malicious) model decomposes while trusted models implement child tasks. Can monitors detect malicious activity when observing only natural language task instructions, rather than complete solutions? We baseline and red team Factor(,) in control evaluations on BigCodeBench, a dataset of Python coding tasks. Monitors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
