On the Consistency and Variety of Automated C Source Code Generation for Training Software Defect Detectors

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, machine learning (ML) techniques have become popular for detecting software bugs. However, a common challenge in ML-based bug detection arises from the unequal distribution of correct and incorrect training data. Specifically, there is a scarcity of incorrect data (containing bugs) compared to the abundance of correct data, negatively impacting ML model performance. To address this issue, researchers suggest artificially injecting bugs into correct samples. In addition to the equal distribution of samples, the diversity of training data significantly affects ML performance. Our work focuses on generating various incorrect samples stemming from a single root cause (bug). Specifically, we plan to inject bugs into LLVM IR codes and translate them into source codes written in high-level programming languages. For diversity, we use probabilistic language models in the translator. In this paper, we present an IR-to-C translator using seq2seq and explore the resulting consistency and diversity of generated samples by learning real-world software codes.

Original languageEnglish
Title of host publicationConference Proceedings - 30th ISSAT International Conference on Reliability and Quality in Design, RQD 2025
EditorsHoang Pham
PublisherInternational Society of Science and Applied Technologies
Pages241-245
Number of pages5
ISBN (Electronic)9798986576152
Publication statusPublished - 2025
Event30th ISSAT International Conference on Reliability and Quality in Design, RQD 2025 - Honolulu, United States
Duration: 6 Aug 20258 Aug 2025

Publication series

NameConference Proceedings - 30th ISSAT International Conference on Reliability and Quality in Design, RQD 2025

Conference

Conference30th ISSAT International Conference on Reliability and Quality in Design, RQD 2025
Country/TerritoryUnited States
CityHonolulu
Period6/08/258/08/25

Keywords

  • C language
  • LLVM IR
  • Machine learning
  • Sequence to sequence
  • Software defect detection
  • Software defect injection

Fingerprint

Dive into the research topics of 'On the Consistency and Variety of Automated C Source Code Generation for Training Software Defect Detectors'. Together they form a unique fingerprint.

Cite this