Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis

Da Song; Yuheng Huang; Boqi Chen; Tianshuo Cong; Randy Goebel; Lei Ma; Foutse Khomh

arXiv:2601.08196·cs.CL·January 14, 2026

Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis

Da Song, Yuheng Huang, Boqi Chen, Tianshuo Cong, Randy Goebel, Lei Ma, Foutse Khomh

PDF

Open Access

TL;DR

This paper presents LogiSafetyGen and LogiSafetyBench, a framework and benchmark for evaluating whether large language models can autonomously enforce implicit regulatory compliance in safety-critical tasks.

Contribution

It introduces a novel logic-guided synthesis framework and a comprehensive benchmark to assess LLMs' ability to adhere to implicit safety regulations.

Findings

01

Larger models perform better on functional correctness.

02

Despite improvements, models often neglect safety constraints.

03

The benchmark reveals significant non-compliance issues in state-of-the-art LLMs.

Abstract

The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Adversarial Robustness in Machine Learning · Topic Modeling