HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application

Yiqian Yang; Tian Lan; Qianghuai Jia; Li Zhu; Hui Jiang; Hang Zhu; Longyue Wang; Weihua Luo; Kaifu Zhang

arXiv:2510.19631·cs.AI·October 23, 2025

HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application

Yiqian Yang, Tian Lan, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

HSCodeComp is a new, realistic benchmark for evaluating deep search agents' ability to apply complex hierarchical rules, specifically for predicting product classification codes in e-commerce, highlighting significant performance gaps.

Contribution

This paper introduces HSCodeComp, the first expert-level benchmark for hierarchical rule application in deep search agents, based on real-world e-commerce data and expert annotations.

Findings

01

Best AI agent achieves 46.8% accuracy, far below human experts at 95%.

02

Hierarchical rule application remains challenging for current models.

03

Scaling test time does not significantly improve performance.

Abstract

Effective deep search agents must not only access open-domain and domain-specific knowledge but also apply complex rules-such as legal clauses, medical manuals and tariff rules. These rules often feature vague boundaries and implicit logic relationships, making precise application challenging for agents. However, this critical capability is largely overlooked by current agent benchmarks. To fill this gap, we introduce HSCodeComp, the first realistic, expert-level e-commerce benchmark designed to evaluate deep search agents in hierarchical rule application. In this task, the deep reasoning process of agents is guided by these rules to predict 10-digit Harmonized System Code (HSCode) of products with noisy but realistic descriptions. These codes, established by the World Customs Organization, are vital for global supply chain efficiency. Built from real-world data collected from…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 0Confidence 5

Strengths

It has lots of LLM used

Weaknesses

Quality of the paper is very poor.

Reviewer 02Rating 4Confidence 2

Strengths

1. The motivation is clear. he paper clearly identifies a missing evaluation angle: hierarchical rule following, which is indeed a challenging and realistic reasoning task. 2. The dataset is comprehensive. The dataset is built with expert validation and seems to capture realistic product diversity and textual noise. 3. The experiments compare many models and agent frameworks, giving a broad and fair view of the task difficulty.

Weaknesses

I am not an expert in search agent. My concerns are only raised from the research perspective not specific to this certain domian. 1. Only 632 samples might be too few to show robust performance differences. 2. Since rules come from different sources (tariff codes, human rulings, etc.), it would be useful to test which part contributes most to performance. 3. I wonder how well models perform at intermediate steps (like predicting subcategories). 4. Maybe models tuned for other structured dom

Reviewer 03Rating 6Confidence 3

Strengths

This paper tackles a timely and important challenge: applying rules for HS code classification rather than relying on open-ended retrieval. The motivation and problem space are clearly illustrated in Figure 1 (left side, page 2). The dataset and setup are realistic—inputs combine noisy product listings, structured attributes, images, and URLs—and ablation studies show that including images improves accuracy in several scenarios (Table 4, page 7; Table 10, page 36). The data labeling process is

Weaknesses

First, the current evaluation metric is too narrow. It only counts exact 10-digit matches as correct, even when the model predicts a valid but slightly different code. The authors themselves note that many predictions are “Error-but-Valid.” This shows a need for more flexible metrics—such as hierarchical distance, agreement at higher HS code levels (2, 4, 6, or 8 digits), and a rule-consistency score. As it stands, many reasonable answers are unfairly marked wrong (Section 4.2, p. 5; Figure 5, p

Code & Models

Datasets

AIDC-AI/HSCodeComp
dataset· 361 dl
361 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Big Data and Digital Economy · Sentiment Analysis and Opinion Mining