Automatic Classification of Text Databases through Query Probing
Panagiotis Ipeirotis, Luis Gravano, Mehran Sahami

TL;DR
This paper introduces an automated method for classifying web-based text databases by generating probing queries from a rule-based classifier, enabling better organization and discovery of hidden search-only databases.
Contribution
The paper presents a novel approach that combines rule-based classification with query probing to automatically categorize search-only text databases.
Findings
Initial experiments show promising results in characterizing web text databases.
The method effectively uses classifier rules to generate queries for database classification.
Automates the organization of hidden web databases for easier access.
Abstract
Many text databases on the web are "hidden" behind search interfaces, and their documents are only accessible through querying. Search engines typically ignore the contents of such search-only databases. Recently, Yahoo-like directories have started to manually organize these databases into categories that users can browse to find these valuable resources. We propose a novel strategy to automate the classification of search-only text databases. Our technique starts by training a rule-based document classifier, and then uses the classifier's rules to generate probing queries. The queries are sent to the text databases, which are then classified based on the number of matches that they produce for each query. We report some initial exploratory experiments that show that our approach is promising to automatically characterize the contents of text databases accessible on the web.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Text and Document Classification Technologies · Web Data Mining and Analysis
