Detecting secrets in source code: using machine learning to reduce false positives

Detecting secrets in source code: using machine learning to reduce false positives

Title	Detecting secrets in source code: using machine learning to reduce false positives
Publication Type	thesis
School or College	College of Engineering
Department	Computing
Author	Saha, Aakanksha
Date	2019
Description	Private and public git repositories often contain sensitive information in the source code. The sensitive information includes Application Programming Interface (API) keys, asymmetric private keys, passwords, and connection strings. It is possible to find this information by scanning the source code and configuration files using pattern matching. However, scanning of repositories often results in a large number of false positives, that is, entries wrongly identified as a secret. Our research aims to reduce the number of false positives by combining the regular expression based approach with machine learning. The regular expression based approach helps in identifying different types of potential secrets such as RSA private keys (private keys based on RSA algorithm), API keys, client secrets, and generic passwords hard-coded in the source code, whereas machine learning intelligently distinguishes a real secret from a false positive. The combination of these two techniques allows for the identification of secrets and subsequent reduction of possible false positives. We apply a trained voting classifier, an ensemble of major supervised machine learning algorithms, on a dataset consisting of approximately two thousand test examples. With this combination, we achieve a precision of 84% and a recall of 89%. We also evaluate our machine learning model using a precision-recall curve that can be used by an operator to find the optimal trade-off between precision and recall. Our results show a significant motivation to use machine learning in increasing the efficiency and efficacy of a secret detector tool.
Type	Text
Publisher	University of Utah
Dissertation Name	Master of Science
Language	eng
Rights Management	(c) Aakanksha Saha
Format	application/pdf
Format Medium	application/pdf
ARK	ark:/87278/s6gb859f
Setname	ir_etd
ID	1714238
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6gb859f