A Search/Crawl Framework for Automatically Acquiring Scientific Documents
Sujatha Das Gollapalli, Krutarth Patel, Cornelia Caragea

TL;DR
This paper introduces a novel search-driven framework that uses web search engines and classification modules to automatically acquire a large collection of scientific documents, offering an effective alternative to traditional crawling methods.
Contribution
The paper presents a new framework that leverages web search and classification to automatically gather scientific documents, demonstrating high efficiency and accuracy.
Findings
Acquired approximately 0.665 million research documents.
Used about 0.076 million queries for data collection.
Showed web search as an effective alternative to crawling.
Abstract
Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. In this paper, we propose a novel search-driven framework for acquiring documents for scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. Next, research papers and sources of research papers are identified from the search results using accurate classification modules. Our experiments highlight not only the performance of our individual classifiers but also the effectiveness of our overall Search/Crawl framework. Indeed, we were able to obtain approximately 0.665 million research documents through our fully-automated framework using about 0.076…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Advanced Text Analysis Techniques · Information Retrieval and Search Behavior
