TY - GEN
T1 - C Source Code Generation from IR towards Making Bug Samples to Fluctuate for Machine Learning
AU - Sugawa, Yuto
AU - Murakami, Chisato
AU - Ohara, Mamoru
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In recent years, machine learning (ML) techniques have become popular for detecting software bugs. However, a common challenge in ML-based bug detection arises from the unequal distribution of correct and incorrect training data. Specifically, there is a scarcity of incorrect data (containing bugs) compared to the abundance of correct data, negatively impacting ML model performance. To address this issue, researchers suggest artificially injecting bugs into correct samples. In addition to the equal distribution of samples, the diversity of training data significantly affects ML performance. Our work focuses on generating various incorrect samples stemming from a single root cause (bug). Specifically, we plan to inject bugs into LLVM IR codes and translate them into source codes written in high-level programming languages. For diversity, we use probabilistic language models in the translator. In this paper, we present an IR-to-C translator using seq2seq and explore the resulting diversity of generated samples.
AB - In recent years, machine learning (ML) techniques have become popular for detecting software bugs. However, a common challenge in ML-based bug detection arises from the unequal distribution of correct and incorrect training data. Specifically, there is a scarcity of incorrect data (containing bugs) compared to the abundance of correct data, negatively impacting ML model performance. To address this issue, researchers suggest artificially injecting bugs into correct samples. In addition to the equal distribution of samples, the diversity of training data significantly affects ML performance. Our work focuses on generating various incorrect samples stemming from a single root cause (bug). Specifically, we plan to inject bugs into LLVM IR codes and translate them into source codes written in high-level programming languages. For diversity, we use probabilistic language models in the translator. In this paper, we present an IR-to-C translator using seq2seq and explore the resulting diversity of generated samples.
KW - bug injection
KW - diversity of training samples
KW - LLVM IR-to-C translation
KW - seq2seq
UR - http://www.scopus.com/inward/record.url?scp=85215305818&partnerID=8YFLogxK
U2 - 10.1109/ISSREW63542.2024.00098
DO - 10.1109/ISSREW63542.2024.00098
M3 - Conference contribution
AN - SCOPUS:85215305818
T3 - Proceedings - 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops, ISSREW 2024
SP - 321
EP - 328
BT - Proceedings - 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops, ISSREW 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 35th IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2024
Y2 - 28 October 2024 through 31 October 2024
ER -