Forge-UGC: FX optimization and register-graph engine for universal graph compiler
Satyam Kumar, Saurabh Jha

TL;DR
Forge-UGC is a hardware-agnostic compiler for transformer models that improves compilation speed, reduces runtime latency and energy, and supports modern transformer components on heterogeneous accelerators.
Contribution
It introduces a four-phase, transparent compilation pipeline with novel optimization passes, lowering overhead and enhancing performance for transformer deployment on NPUs.
Findings
Achieves 6.9 to 9.2x faster compilation than existing frameworks.
Reduces inference latency by 18.2 to 35.7%.
Lowers energy consumption per inference by 30.2 to 40.9%.
Abstract
We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with torch.export at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
