TY - GEN
T1 - On the Consistency and Variety of Automated C Source Code Generation for Training Software Defect Detectors
AU - Ohara, Mamoru
AU - Sugawa, Yuto
AU - Murakami, Chisato
N1 - Publisher Copyright:
© 2025 RQD. All Rights Reserved.
PY - 2025
Y1 - 2025
N2 - In recent years, machine learning (ML) techniques have become popular for detecting software bugs. However, a common challenge in ML-based bug detection arises from the unequal distribution of correct and incorrect training data. Specifically, there is a scarcity of incorrect data (containing bugs) compared to the abundance of correct data, negatively impacting ML model performance. To address this issue, researchers suggest artificially injecting bugs into correct samples. In addition to the equal distribution of samples, the diversity of training data significantly affects ML performance. Our work focuses on generating various incorrect samples stemming from a single root cause (bug). Specifically, we plan to inject bugs into LLVM IR codes and translate them into source codes written in high-level programming languages. For diversity, we use probabilistic language models in the translator. In this paper, we present an IR-to-C translator using seq2seq and explore the resulting consistency and diversity of generated samples by learning real-world software codes.
AB - In recent years, machine learning (ML) techniques have become popular for detecting software bugs. However, a common challenge in ML-based bug detection arises from the unequal distribution of correct and incorrect training data. Specifically, there is a scarcity of incorrect data (containing bugs) compared to the abundance of correct data, negatively impacting ML model performance. To address this issue, researchers suggest artificially injecting bugs into correct samples. In addition to the equal distribution of samples, the diversity of training data significantly affects ML performance. Our work focuses on generating various incorrect samples stemming from a single root cause (bug). Specifically, we plan to inject bugs into LLVM IR codes and translate them into source codes written in high-level programming languages. For diversity, we use probabilistic language models in the translator. In this paper, we present an IR-to-C translator using seq2seq and explore the resulting consistency and diversity of generated samples by learning real-world software codes.
KW - C language
KW - LLVM IR
KW - Machine learning
KW - Sequence to sequence
KW - Software defect detection
KW - Software defect injection
UR - https://www.scopus.com/pages/publications/105021079976
M3 - Conference contribution
AN - SCOPUS:105021079976
T3 - Conference Proceedings - 30th ISSAT International Conference on Reliability and Quality in Design, RQD 2025
SP - 241
EP - 245
BT - Conference Proceedings - 30th ISSAT International Conference on Reliability and Quality in Design, RQD 2025
A2 - Pham, Hoang
PB - International Society of Science and Applied Technologies
T2 - 30th ISSAT International Conference on Reliability and Quality in Design, RQD 2025
Y2 - 6 August 2025 through 8 August 2025
ER -