Adversarial Circuit Evaluation
Niels uit de Bos, Adri\`a Garriga-Alonso

TL;DR
This paper evaluates the fidelity of three neural network circuits by adversarially testing their divergence from the full model, revealing significant discrepancies even on benign inputs, thus highlighting the need for more robust circuit designs.
Contribution
It introduces an adversarial evaluation method for neural network circuits, demonstrating that existing circuits often fail to accurately represent the full model's behavior.
Findings
IOI and docstring circuits diverge significantly from the full model on benign inputs.
Current circuits are insufficient for safety-critical applications.
Adversarial testing reveals vulnerabilities in circuit representations.
Abstract
Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs. Our results show that the circuits for the IOI and docstring tasks fail to behave similarly to the full model even on completely benign inputs from the original task, indicating that more robust circuits are needed for safety-critical applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntegrated Circuits and Semiconductor Failure Analysis · Electrostatic Discharge in Electronics · VLSI and Analog Circuit Testing
