psc2code: Denoising Code Extraction from Programming Screencasts
Lingfeng Bao, Zhenchang Xing, Xin Xia, David Lo, Minghui Wu, Xiaohu, Yang

TL;DR
This paper introduces psc2code, a comprehensive system that denoises and extracts source code from programming screencasts using CNN-based classification, image segmentation, OCR, and language models, enabling effective search and interaction tools.
Contribution
The paper presents a novel pipeline combining CNN, image segmentation, OCR, and language models specifically for extracting clean source code from screencasts, which was not previously addressed.
Findings
CNN achieves 0.95 F1-score in frame classification
Search engine precision@5 is 0.93
OCR error correction improves code accuracy
Abstract
In this paper, we propose an approach named psc2code to denoise the process of extracting source code from programming screencasts. First, psc2code leverages the Convolutional Neural Network based image classification to remove non-code and noisy-code frames. Then, psc2code performs edge detection and clustering-based image segmentation to detect sub-windows in a code frame, and based on the detected sub-windows, it identifies and crops the screen region that is most likely to be a code editor. Finally, psc2code calls the API of a professional OCR tool to extract source code from the cropped code regions and leverages the OCRed cross-frame information in the programming screencast and the statistical language model of a large corpus of source code to correct errors in the OCRed source code. We conduct an experiment on 1,142 programming screencasts from YouTube. We find that our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
