CoIn: Counting the Invisible Reasoning Tokens in Commercial Opaque LLM APIs
Guoheng Sun, Ziyao Wang, Bowei Tian, Meng Liu, Zheyu Shen, Shwai He, Yexiao He, Wanghao Ye, Yiting Wang, Ang Li

TL;DR
This paper introduces CoIn, a verification framework that audits hidden reasoning tokens in commercial LLM APIs to ensure billing transparency and detect token inflation, thereby addressing opacity issues in proprietary AI services.
Contribution
CoIn provides a novel method to verify both the count and semantic validity of concealed reasoning tokens in opaque LLM APIs, enhancing transparency and trust.
Findings
CoIn detects token count inflation with up to 94.7% success rate.
The framework effectively verifies hidden reasoning tokens in commercial LLM APIs.
CoIn restores billing transparency in opaque AI services.
Abstract
As post-training techniques evolve, large language models (LLMs) are increasingly augmented with structured multi-step reasoning abilities, often optimized through reinforcement learning. These reasoning-enhanced models outperform standard LLMs on complex tasks and now underpin many commercial LLM APIs. However, to protect proprietary behavior and reduce verbosity, providers typically conceal the reasoning traces while returning only the final answer. This opacity introduces a critical transparency gap: users are billed for invisible reasoning tokens, which often account for the majority of the cost, yet have no means to verify their authenticity. This opens the door to token count inflation, where providers may overreport token usage or inject synthetic, low-effort tokens to inflate charges. To address this issue, we propose CoIn, a verification framework that audits both the quantity…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. this work raises awareness of a critical problem: token inflation, which is new after the release of reasoning LLMs such as GPT o1. 2. this work proposes two novel approaches for token inflation detection, from hash-tree construction to semantic match.
1. the effectiveness of the proposed solution is a bit limited: according to figure 3, under most of the scenarios IRs have to be at least 1.0 for the detection accuracy to be decent. particularly, in Ada. Inflation 2, no meaningful performance is achieved until IR=3. questioning the solution's effectiveness when only a fraction of tokens are added. 2. overall the experimental setting is a bit hypothetical: (a) token inflation scenarios as presented in table 1 are a bit too synthetic, Ada. Inf
- The methodology is straightforward and easy to understand. - The paper is well-structured and clearly organized.
- The research question presented in this work lacks sufficient practical grounding. If a commercial provider of LLM API intends to increase user payment costs, directly raising the price of each token would be a more straightforward and effective approach. In contrast, increasing charges by inflating the consumption of reasoning tokens appears neither practical nor necessary from a business perspective. - The Semantic Validity Verification relies on consistency checks between reasoning tokens
The paper is well-written and easy to follow.
1. Validity of the threat model - In practice, other challenges related to honest billing may arise; for instance, COLA may charge for tokens differently depending on a prompt cache-hit status. Consequently, the scope of this paper is limited. - This paper ignores the fact that COLA may reveal a summary (instead of full) of CoT traces to users. The reviewer believes that the work should include this information for semantic validity verification. - Another concern is that, is there a
- The paper identifies and formalizes an important emerging problem in commercial LLM services, namely the lack of transparency in billing for invisible reasoning tokens, which becomes increasingly relevant as reasoning models proliferate. - The technical approach cleverly combines cryptographic verification (Merkle trees) with semantic validation (embedding-based matching), providing both quantity and quality checks on hidden tokens. - The experimental evaluation spans multiple domains (medical
- The threat model contains a fundamental contradiction by requiring malicious providers to actively cooperate with the auditing process, including generating embeddings using a fixed model and providing Merkle proofs, which a truly adversarial provider would simply refuse to do. - The "near-zero cost" assumption for token inflation is unrealistic because providers could easily use small language models (e.g., 1-4B parameter models) to generate semantically plausible fake reasoning at minimal co
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAccess Control and Trust
