Introducing PinataHub: Explore the world of leaked secrets in GitHub.

IncognitaTech
6 min readJan 17, 2022

In our first article, we will walk you through the just-released PinataHub platform, the largest archive of leaked credentials and secrets in public GitHub repositories.

Some background:

Forgetting secrets in the source code is a very common problem in the DevOps world, which can have disastrous consequences for any organization. Many have attempted to tackle this problem, but there is no clear answer on how effective are the existing solutions when confronted with the real-world source code landscape. This can be attributed to several factors such as:

  • Not every hardcoded password is in a configuration file or part of a URL.
  • Not every API key matches a specific regex pattern.
  • Not every credential exceeds an entropy threshold.
  • An AI model cannot learn every function taking a password as a parameter.

Most secret scanning solutions are rather constrained. They are built on the concept that source code complies with certain conventions and norms, and merely leverages popular services, libraries and frameworks. The reality is far wilder than that.

This is why we are developing GoldDigger, the most advanced credential and secret discovery solution with proven accuracy at scale (more on this later), which you can verify for yourself using PinataHub.

What is PinataHub ?

PinataHub is a platform that allows you to explore a small fraction of the 4M+ passwords and secrets committed in public GitHub repositories, as detected by GoldDigger. The exposed repositories have an open-source license that allows us to publicly replicate their content, so PinataHub fully adheres to the rules set by GitHub and the developers themselves. You can explore PinataHub at: https://pinatahub.incognita.tech

After signing up and logging in, you can browse the leaks and apply several filters such as type and programming language. More precisely, the disclosed leaks are organized in two groups, namely passwords and secrets.

Depending on the leaks’ context, GoldDigger classifies the identified passwords in one of the following four categories:

  • Mail (i.e. likely for establishing connection to a mail server)
  • Database (this one is pretty self explanatory)
  • Generic (could not be classified in a specific category)
  • Automation (credentials used in the context of browser automation)
  • Web Service (passwords used for APIs, proxies, panels, etc.)

Additionally, when possible, GoldDigger also retrieves the usernames associated with a hardcoded credential.

Exploring exposed passwords in PinataHub.

Secrets comprise API/access keys/secrets, tokens, application identifiers, etc., and strings that could be characterized as “sensitive”. Their type is determined via a set of regular expressions when possible, or if that is not possible, they are tagged as “Generic”.

The prevalence of “Generic” secrets is a clear indicator that relying on pattern matching for secret detection (as Github Advanced Security and other major SecDevOps vendors do), is simply not good enough for capturing the majority of leaks.

Additionally, GoldDigger is capable of intelligent analysis of each leak and identification of various attributes, severity, and possible endpoints associated with detected secrets.

Moreover, hard-coded cryptographic objects (private/public keys, certificates, etc.) are also detected. Please note that currently, PinataHub only includes secret types that appear more than 100 times in the data sample.

API keys, JWT secrets, you name it…

PinataHub also allows you to filter the leaks dataset according to a specific programming language, demonstrating the language-agnostic secret detection capability of GoldDigger, which is not bound by syntax rules.

It is even possible to detect passwords and secrets in text files (no matter how convenient this may appear, storing login information in text files within a project directory is an extremely bad practice, yet it is rather often).

Making a note of your cPanel password in a .txt file named after your hosting provider seems like a pretty great idea. Right ?

Moreover, PinataHub allows you to check if you have leaked any credentials captured in our dataset. For this feature no registration is required, just select “Was I Careless ?” in the navigation bar, and enter the username you would like to check.

Check if you (or someone you know) has been careless enough to leak valuable secrets in a public GitHub repo.

If you have made it this far into the article you probably have many questions, some of which we will attempt to answer in the rest of this post.

Q: How bad are things, really?

Short answer: Pretty bad. The vast amount of hardcoded credentials left in public repositories, indicates that the secret scanning feature of Github Advanced Security (enabled by default for all public repositories) is not effective in capturing credentials, beyond those provided by the token-issuing parties enlisted in GitHub’s Secret scanning partner program.

This critical limitation leaves massive troves of secrets that evaded detection exposed to the prying eyes of malicious actors.

Q: How were you able to collect these source files?

The files containing these findings are collected from public GitHub repositories using the focused crawling capabilities of Magellan, our upcoming cyber intelligence platform. Some details regarding our data collection pipeline will be covered in a future post. The resulting dataset of leaks will be regularly updated. Nonetheless, PinataHub platform is solely for demonstration purposes, and only approximately 2.5% of the identified leaks are currently indexed and publicly available. If you have a legitimate reason for requesting access to the complete dataset, feel free to contact us!

Q: How is GoldDigger different from the existing secret scanning/detection tools?

If you have used any open-source tools for secret detection (many exist), or even a pricey SaaS solution, you will probably have figured out that they work well only when the secrets are leaked in obvious and expected ways (e.g. have a clear declaration, high complexity, etc.), or when they are issued by a service that provides credentials in a identifiable and distinguishable format (e.g. AIza[0–9A-Za-z-_]{35}). But in reality, source code can be arbitrarily complex. GoldDigger outperforms every existing secret detection solution, leveraging an intelligent context analysis scheme which uses an extensive set of novel lexical and statistical features to tackle the challenge of capturing credential leaks in the wild. These features were inspired by our research in the field of Algorithmically Generated Domain (AGD) detection. Although the problem of detecting secrets in source code and detecting malicious AGDs employed by threat actors might appear totally different, they have much common ground: In both cases, the effectiveness of a detector is determined by its capability to capture patternless outliers in relatively short strings.

In terms of detecting leaked secrets in source code, we based our solution on the following two assumptions:

  1. The majority of public Github repositories do not include any leaked secrets (hopefully).
  2. Meaningful credentials are relatively unique compared to other tokens.

To this end, we were able to create a baseline model of tokens that commonly occur close to credential-related keywords, and use it to identify sensitive strings that are highly likely to be secrets. This way, the secret detection scheme implemented in GoldDigger is capable of pinpointing actual secrets with great accuracy, vastly reducing the amount of false positives commonly returned by regex or entropy based approaches.

More technical details regarding how GoldDigger works and comparisons with existing tools will be presented in future posts, so stay tuned !

Follow us on Twitter for more updates and news.

--

--