Deep-Bench: Deep Learning Benchmark Dataset for Code Generation

Alireza Daghighfarsoodeh; Chung-Yu Wang; Hamed Taherkhani; Melika; Sepidband; Mohammad Abdollahi; Hadi Hemmati; Hung Viet Pham

arXiv:2502.18726·cs.SE·February 27, 2025

Deep-Bench: Deep Learning Benchmark Dataset for Code Generation

Alireza Daghighfarsoodeh, Chung-Yu Wang, Hamed Taherkhani, Melika, Sepidband, Mohammad Abdollahi, Hadi Hemmati, Hung Viet Pham

PDF

Open Access

TL;DR

DeepBench is a comprehensive benchmark dataset for function-level deep learning code generation, revealing significant challenges and performance disparities among large language models across different DL tasks and phases.

Contribution

The paper introduces DeepBench, a new benchmark dataset covering full DL pipelines, and provides an analysis of LLM performance and issues in DL code generation.

Findings

01

GPT-4 achieved 31% accuracy on DeepBench

02

LLMs perform significantly worse on DeepBench compared to existing benchmarks

03

Performance varies substantially across DL phases and tasks

Abstract

Deep learning (DL) has revolutionized areas such as computer vision, natural language processing, and more. However, developing DL systems is challenging due to the complexity of DL workflows. Large Language Models (LLMs), such as GPT, Claude, Llama, Mistral, etc., have emerged as promising tools to assist in DL code generation, offering potential solutions to these challenges. Despite this, existing benchmarks such as DS-1000 are limited, as they primarily focus on small DL code snippets related to pre/post-processing tasks and lack a comprehensive coverage of the full DL pipeline, including different DL phases and input data types. To address this, we introduce DeepBench, a novel benchmark dataset designed for function-level DL code generation. DeepBench categorizes DL problems based on three key aspects: phases such as pre-processing, model construction, and training; tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Materials Science · Natural Language Processing Techniques

MethodsLinear Layer · Multi-Head Attention · Adam · Softmax · Dropout · Weight Decay · Cosine Annealing · Linear Warmup With Cosine Annealing · Dense Connections · Attention Dropout