HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application
Yiqian Yang, Tian Lan, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang

TL;DR
HSCodeComp is a new, realistic benchmark for evaluating deep search agents' ability to apply complex hierarchical rules, specifically for predicting product classification codes in e-commerce, highlighting significant performance gaps.
Contribution
This paper introduces HSCodeComp, the first expert-level benchmark for hierarchical rule application in deep search agents, based on real-world e-commerce data and expert annotations.
Findings
Best AI agent achieves 46.8% accuracy, far below human experts at 95%.
Hierarchical rule application remains challenging for current models.
Scaling test time does not significantly improve performance.
Abstract
Effective deep search agents must not only access open-domain and domain-specific knowledge but also apply complex rules-such as legal clauses, medical manuals and tariff rules. These rules often feature vague boundaries and implicit logic relationships, making precise application challenging for agents. However, this critical capability is largely overlooked by current agent benchmarks. To fill this gap, we introduce HSCodeComp, the first realistic, expert-level e-commerce benchmark designed to evaluate deep search agents in hierarchical rule application. In this task, the deep reasoning process of agents is guided by these rules to predict 10-digit Harmonized System Code (HSCode) of products with noisy but realistic descriptions. These codes, established by the World Customs Organization, are vital for global supply chain efficiency. Built from real-world data collected from…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
It has lots of LLM used
Quality of the paper is very poor.
1. The motivation is clear. he paper clearly identifies a missing evaluation angle: hierarchical rule following, which is indeed a challenging and realistic reasoning task. 2. The dataset is comprehensive. The dataset is built with expert validation and seems to capture realistic product diversity and textual noise. 3. The experiments compare many models and agent frameworks, giving a broad and fair view of the task difficulty.
I am not an expert in search agent. My concerns are only raised from the research perspective not specific to this certain domian. 1. Only 632 samples might be too few to show robust performance differences. 2. Since rules come from different sources (tariff codes, human rulings, etc.), it would be useful to test which part contributes most to performance. 3. I wonder how well models perform at intermediate steps (like predicting subcategories). 4. Maybe models tuned for other structured dom
This paper tackles a timely and important challenge: applying rules for HS code classification rather than relying on open-ended retrieval. The motivation and problem space are clearly illustrated in Figure 1 (left side, page 2). The dataset and setup are realistic—inputs combine noisy product listings, structured attributes, images, and URLs—and ablation studies show that including images improves accuracy in several scenarios (Table 4, page 7; Table 10, page 36). The data labeling process is
First, the current evaluation metric is too narrow. It only counts exact 10-digit matches as correct, even when the model predicts a valid but slightly different code. The authors themselves note that many predictions are “Error-but-Valid.” This shows a need for more flexible metrics—such as hierarchical distance, agreement at higher HS code levels (2, 4, 6, or 8 digits), and a rule-consistency score. As it stands, many reasonable answers are unfairly marked wrong (Section 4.2, p. 5; Figure 5, p
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Big Data and Digital Economy · Sentiment Analysis and Opinion Mining
