Description |
Private and public git repositories often contain sensitive information in the source code. The sensitive information includes Application Programming Interface (API) keys, asymmetric private keys, passwords, and connection strings. It is possible to find this information by scanning the source code and configuration files using pattern matching. However, scanning of repositories often results in a large number of false positives, that is, entries wrongly identified as a secret. Our research aims to reduce the number of false positives by combining the regular expression based approach with machine learning. The regular expression based approach helps in identifying different types of potential secrets such as RSA private keys (private keys based on RSA algorithm), API keys, client secrets, and generic passwords hard-coded in the source code, whereas machine learning intelligently distinguishes a real secret from a false positive. The combination of these two techniques allows for the identification of secrets and subsequent reduction of possible false positives. We apply a trained voting classifier, an ensemble of major supervised machine learning algorithms, on a dataset consisting of approximately two thousand test examples. With this combination, we achieve a precision of 84% and a recall of 89%. We also evaluate our machine learning model using a precision-recall curve that can be used by an operator to find the optimal trade-off between precision and recall. Our results show a significant motivation to use machine learning in increasing the efficiency and efficacy of a secret detector tool. |