Aggregation Consistency Errors in Semantic Layers and How to Avoid Them
Zezhou Huang, Pavan Kalyan Damalapati, Eugene Wu

TL;DR
This paper addresses aggregation consistency errors caused by joins in semantic layers and proposes a weighing method with human-in-the-loop to improve metric accuracy and interpretability.
Contribution
It introduces a weighing primitive to ensure aggregation consistency in semantic layers and presents a human-in-the-loop framework for strategy exploration.
Findings
Weighing effectively prevents double counting in join fanouts.
The human-in-the-loop approach allows iterative refinement of weighing strategies.
The method improves accuracy and interpretability of aggregated metrics.
Abstract
Analysts often struggle with analyzing data from multiple tables in a database due to their lack of knowledge on how to join and aggregate the data. To address this, data engineers pre-specify "semantic layers" which include the join conditions and "metrics" of interest with aggregation functions and expressions. However, joins can cause "aggregation consistency issues". For example, analysts may observe inflated total revenue caused by double counting from join fanouts. Existing BI tools rely on heuristics for deduplication, resulting in imprecise and challenging-to-understand outcomes. To overcome these challenges, we propose "weighing" as a core primitive to counteract join fanouts. "Weighing" has been used in various areas, such as market attribution and order management, ensuring metrics consistency (e.g., total revenue remains the same) even for many-to-many joins. The idea is to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
