Benchmarking Apache Arrow Flight -- A wire-speed protocol for data transfer, querying and microservices
Tanveer Ahmad, Zaid Al Ars, H. Peter Hofstee

TL;DR
This paper benchmarks Apache Arrow Flight, a high-performance, platform-independent protocol for fast data transfer and querying in big data systems, demonstrating its scalability, throughput, and efficiency improvements over traditional protocols.
Contribution
It provides comprehensive benchmarking results of Arrow Flight's data transfer and querying capabilities, highlighting its high throughput and scalability advantages.
Findings
Achieves up to 6000 MB/s throughput for DoGet() operations.
Utilizes up to 95% of available network bandwidth on Mellanox interconnects.
Outperforms ODBC and turbodbc protocols by 20x and 30x in query systems.
Abstract
Moving structured data between different big data frameworks and/or data warehouses/storage systems often cause significant overhead. Most of the time more than 80\% of the total time spent in accessing data is elapsed in serialization/de-serialization step. Columnar data formats are gaining popularity in both analytics and transactional databases. Apache Arrow, a unified columnar in-memory data format promises to provide efficient data storage, access, manipulation and transport. In addition, with the introduction of the Arrow Flight communication capabilities, which is built on top of gRPC, Arrow enables high performance data transfer over TCP networks. Arrow Flight allows parallel Arrow RecordBatch transfer over networks in a platform and language-independent way, and offers high performance, parallelism and security based on open-source standards. In this paper, we bring together…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Cloud Computing and Resource Management · Advanced Database Systems and Queries
