Aligning Artificial Superintelligence via a Multi-Box Protocol
Avraham Yair Negozio

TL;DR
This paper introduces a multi-box protocol for aligning artificial superintelligence by using isolated systems that verify each other's alignment proofs, fostering a truth-telling coalition without direct communication.
Contribution
It proposes a novel multi-box verification protocol that leverages isolated superintelligences and peer review to achieve alignment without direct human oversight.
Findings
Diverse superintelligences can reach consensus through mutual verification.
The protocol incentivizes honest behavior via a reputation system.
High-reputation superintelligences are required for release from containment.
Abstract
We propose a novel protocol for aligning artificial superintelligence (ASI) based on mutual verification among multiple isolated systems that self-modify to achieve alignment. The protocol operates by containing multiple diverse artificial superintelligences in strict isolation ("boxes"), with humans remaining entirely outside the system. Each superintelligence has no ability to communicate with humans and cannot communicate directly with other superintelligences. The only interaction possible is through an auditable submission interface accessible exclusively to the superintelligences themselves, through which they can: (1) submit alignment proofs with attested state snapshots, (2) validate or disprove other superintelligences' proofs, (3) request self-modifications, (4) approve or disapprove modification requests from others, (5) report hidden messages in submissions, and (6) confirm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Advanced Malware Detection Techniques · Security and Verification in Computing
