C Source Code Generation from IR towards Making Bug Samples to Fluctuate for Machine Learning

Yuto Sugawa, Chisato Murakami, Mamoru Ohara

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In recent years, machine learning (ML) techniques have become popular for detecting software bugs. However, a common challenge in ML-based bug detection arises from the unequal distribution of correct and incorrect training data. Specifically, there is a scarcity of incorrect data (containing bugs) compared to the abundance of correct data, negatively impacting ML model performance. To address this issue, researchers suggest artificially injecting bugs into correct samples. In addition to the equal distribution of samples, the diversity of training data significantly affects ML performance. Our work focuses on generating various incorrect samples stemming from a single root cause (bug). Specifically, we plan to inject bugs into LLVM IR codes and translate them into source codes written in high-level programming languages. For diversity, we use probabilistic language models in the translator. In this paper, we present an IR-to-C translator using seq2seq and explore the resulting diversity of generated samples.

Original languageEnglish
Title of host publicationProceedings - 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops, ISSREW 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages321-328
Number of pages8
ISBN (Electronic)9798350367041
DOIs
Publication statusPublished - 2024
Event35th IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2024 - Tsukuba, Japan
Duration: 28 Oct 202431 Oct 2024

Publication series

NameProceedings - 2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops, ISSREW 2024

Conference

Conference35th IEEE International Symposium on Software Reliability Engineering Workshops, ISSREW 2024
Country/TerritoryJapan
CityTsukuba
Period28/10/2431/10/24

Keywords

  • bug injection
  • diversity of training samples
  • LLVM IR-to-C translation
  • seq2seq

Fingerprint

Dive into the research topics of 'C Source Code Generation from IR towards Making Bug Samples to Fluctuate for Machine Learning'. Together they form a unique fingerprint.

Cite this