Semantic Risk Scoring of Aggregated Metrics: An AI-Driven Approach for Healthcare Data Governance
Mohammed Omer Shakeel Ahmed

TL;DR
This paper introduces an AI-driven framework that evaluates the privacy risks of healthcare data metrics by analyzing SQL queries to prevent overexposure, ensuring compliance and enabling secure data sharing.
Contribution
It presents a novel static, explainable risk scoring system for SQL-based healthcare metrics using semantic and syntactic analysis with pretrained embeddings and machine learning.
Findings
High accuracy in risk detection (>85%)
Effective flagging of sensitive query patterns
Supports privacy-preserving healthcare data governance
Abstract
Large healthcare institutions typically operate multiple business intelligence (BI) teams segmented by domain, including clinical performance, fundraising, operations, and compliance. Due to HIPAA, FERPA, and IRB restrictions, these teams face challenges in sharing patient-level data needed for analytics. To mitigate this, A metric aggregation table is proposed, which is a precomputed, privacy-compliant summary. These abstractions enable decision-making without direct access to sensitive data. However, even aggregated metrics can inadvertently lead to privacy risks if constructed without rigorous safeguards. A modular AI framework is proposed that evaluates SQL-based metric definitions for potential overexposure using both semantic and syntactic features. Specifically, the system parses SQL queries into abstract syntax trees (ASTs), extracts sensitive patterns (e.g., fine-grained GROUP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Access Control and Trust
