Betterleaks (betterleaks/betterleaks)

Betterleaks

https://github.com/betterleaks/betterleaks
Admin
Betterleaks is a tool for detecting secrets like passwords, API keys, and tokens in git...

Tokens:16,680
Snippets:140
Trust Score:5.4
Update:1 week ago
Show doc for...
Context Summary (auto-generated)
Raw
# Betterleaks

Betterleaks is a secrets-scanning tool built on the legacy of Gitleaks, maintained by its original authors and sponsored by Aikido Security. It detects hardcoded credentials, API keys, and other sensitive values in git repositories, local filesystems, and piped input. The tool is designed as a drop-in evolution of Gitleaks, retaining full backwards compatibility with `.gitleaks.toml` config and `.gitleaksignore` files while introducing a significantly more powerful detection and filtering system.

The core of Betterleaks is its CEL (Common Expression Language) based configuration. Instead of static allowlists, every rule can carry `prefilter` and `filter` CEL expressions that evaluate metadata (file path, git author, commit message) and finding data (secret, match, entropy) to eliminate false positives dynamically. Secrets can also be actively validated against live APIs via `validate` CEL expressions that fire asynchronous HTTP requests, enabling real-time confirmation of whether a detected credential is still active. The scanner achieves high throughput through an Aho-Corasick keyword pre-filter trie, RE2 regex matching, BPE token-efficiency filtering, and parallelized git history scanning.

---

## CLI Commands

### `betterleaks git` — Scan a Git repository's full commit history

Traverses all commits via `git log` and scans each patch for secrets. Supports parallel workers, pre-commit mode, and staged-only scanning.

```bash
# Scan the current repo's full history with verbose output
betterleaks git . -v

# Scan a remote repo clone at high parallelism and emit a JSON report
betterleaks git /path/to/repo --git-workers=16 --report-path=findings.json --report-format=json

# Pre-commit hook: scan only staged changes, redact secrets in output
betterleaks git --pre-commit --staged --redact -v

# Scope to specific commits and generate SARIF output for CI
betterleaks git . --log-opts="--since=2024-01-01" -f sarif -r results.sarif

# Use a custom config, suppress banner, and set exit code 0 always
betterleaks git . -c /etc/betterleaks.toml --no-banner --exit-code=0

# Validate detected secrets live (only report valid ones)
betterleaks git . --validation --validation-status=valid --validation-workers=20 -v
```

---

### `betterleaks dir` — Scan files and directories (no git)

Scans plain files and directories without any git involvement. Accepts multiple paths; nested paths are deduplicated automatically.

```bash
# Scan a single directory
betterleaks dir /path/to/project -v

# Scan multiple independent paths
betterleaks dir /app/config /app/secrets -v

# Scan a specific file, output CSV
betterleaks dir /deploy/.env -f csv -r secrets.csv

# Skip large files, follow symlinks
betterleaks dir /srv --max-target-megabytes=5 --follow-symlinks -v

# Scan inside nested archives (e.g., zip inside tar)
betterleaks dir /backups --max-archive-depth=3 -v

# Redact 50% of each secret in log output
betterleaks dir . --redact=50 -v
```

---

### `betterleaks stdin` — Scan piped input

Reads content from standard input and scans it as a single file fragment. Useful for scanning command output, build artifacts, or log streams.

```bash
# Scan a file piped through stdin
cat .env.production | betterleaks stdin -v

# Scan the output of a command
env | betterleaks stdin -v

# Scan a file and write JSON results to stdout
cat app.log | betterleaks stdin -f json -r -

# Use a specific config and disable color
cat config.yaml | betterleaks stdin -c rules.toml --no-color -v
```

---

## Configuration File (`betterleaks.toml`)

### Config resolution order

Betterleaks resolves configuration from the following sources, in order of precedence:

```
1. --config / -c flag
2. BETTERLEAKS_CONFIG or GITLEAKS_CONFIG environment variable (file path)
3. BETTERLEAKS_CONFIG_TOML or GITLEAKS_CONFIG_TOML environment variable (inline TOML content)
4. .betterleaks.toml or .gitleaks.toml in the target directory
5. Built-in default config (embedded in the binary)
```

---

### Top-level config fields

The full set of top-level fields in a `betterleaks.toml`:

```toml
# Minimum binary version required to use this config
betterleaksMinVersion = "1.0.0"

# Minimum Gitleaks-format version (backwards compatibility)
minVersion = "8.0.0"

# Global prefilter: evaluated BEFORE any regex, has access to `attributes` only.
# Return true to SKIP the entire file/commit. Good for binary files or bot commits.
prefilter = '''
matchesAny(attributes[?"path"].orValue(""), [
  r"""(?i)\.(?:png|jpg|gif|svg|pdf|exe|bin)$""",
  r"""(?:^|/)node_modules(?:/.*)?$"""
])
|| attributes[?"git.author_name"].orValue("") == "renovate[bot]"
'''

# Global filter: evaluated AFTER regex match, has access to `attributes` + `finding`.
# Return true to DISCARD the finding.
filter = '''
containsAny(finding["secret"], [
  "EXAMPLE", "CHANGEME", "YOUR_API_KEY_HERE", "REDACTED"
])
|| (entropy(finding["secret"]) <= 2.5 && failsTokenEfficiency(finding["secret"]))
'''

# Inherit all default built-in rules and also load a remote base config
[extend]
useDefault = true
path = "https://raw.githubusercontent.com/example/configs/main/extra.toml"

# Detection rules (see below)
[[rules]]
# ...
```

---

### `[[rules]]` — Defining a detection rule

Each rule identifies a specific secret type. `keywords` are required for performance; the Aho-Corasick pre-filter only executes the `regex` when a keyword matches.

```toml
[[rules]]
id          = "github-fine-grained-pat"
description = "GitHub Fine-Grained Personal Access Token"
keywords    = ["github_pat_"]
regex       = '''github_pat_\w{82}'''

# Rule-level filter: discards false positives for this rule only
filter = '''
(
    attributes[?"git.author_name"].orValue("").endsWith("[bot]") &&
    attributes[?"path"].orValue("").startsWith("tests/fixtures/") &&
    containsAny(finding["secret"], ["_MOCK_", "_TEST_"])
)
|| entropy(finding["secret"]) <= 3.0
'''

# Live validation against the GitHub API
validate = '''
cel.bind(r,
  http.get("https://api.github.com/user", {
    "Accept": "application/vnd.github+json",
    "Authorization": "token " + secret
  }),
  r.status == 200 && r.json.?login.orValue("") != "" ? {
    "result": "valid",
    "username": r.json.?login.orValue(""),
    "name":     r.json.?name.orValue(""),
    "scopes":   r.headers[?"x-oauth-scopes"].orValue("")
  } : r.status in [401, 403] ? {
    "result": "invalid",
    "reason": "Unauthorized"
  } : unknown(r)
)
'''
```

---

### `[[rules.required]]` — Composite (multi-part) rules

Require auxiliary findings to be present near a primary match before a finding is emitted. Both `withinLines` and `withinColumn` are optional proximity constraints.

```toml
# Primary rule: AWS Access Key ID
[[rules]]
id       = "aws-credentials"
keywords = ["AKIA"]
regex    = '''(?:A3T[A-Z0-9]|AKIA|AGPA|AIDA|AROA|AIPA|ANPA|ANVA|ASIA)[A-Z0-9]{16}'''

  # The primary match is only valid if a Secret Access Key is found within 5 lines
  [[rules.required]]
  id          = "aws-secret-key"
  withinLines = 5

# Auxiliary rule (SkipReport = true means it is only used as a component)
[[rules]]
id         = "aws-secret-key"
keywords   = ["secret", "key"]
regex      = '''[A-Za-z0-9/+=]{40}'''
skipReport = true
```

---

## CEL Filter Bindings

### `prefilter` — File/commit-level skip expressions

Available inside `prefilter` only. Returns `true` to skip the entire resource before any regex runs.

```toml
prefilter = '''
(
  // Skip binary/media files
  matchesAny(attributes[?"path"].orValue(""), [
    r"""(?i)\.(png|jpg|gif|mp4|zip|tar\.gz|exe)$"""
  ])
)
|| (
  // Skip the entire commit if authored by any known bot
  matchesAny(attributes[?"git.author_name"].orValue(""), [
    r"""(?i)\[bot\]$""",
    r"""^renovate$"""
  ])
)
|| (
  // Skip vendor and generated directories
  matchesAny(attributes[?"path"].orValue(""), [
    r"""(?:^|/)vendor/""",
    r"""(?:^|/)\.gen/"""
  ])
)
'''
```

---

### `filter` — Post-match finding discard expressions

Available at global level and per-rule. Returns `true` to discard the candidate finding. Has access to both `attributes` and `finding`.

```toml
filter = '''
(
  // Discard if authored by a CI bot AND the file is a test fixture AND secret is a placeholder
  attributes[?"git.author_name"].orValue("").endsWith("[bot]") &&
  attributes[?"path"].orValue("").startsWith("tests/fixtures/") &&
  containsAny(finding["secret"], ["_MOCK_", "_TEST_", "placeholder"])
)
||
(
  // Discard if it's a markdown or text file with instructional language on the same line
  matchesAny(attributes[?"path"].orValue(""), [r"""(?i)\.(md|txt|rst)$"""]) &&
  containsAny(finding["line"], ["Example:", "Replace this:", "YOUR_KEY_HERE"])
)
||
(
  // Discard low-entropy natural-language false positives
  entropy(finding["secret"]) <= 2.5 &&
  failsTokenEfficiency(finding["secret"])
)
'''
```

**Available bindings:**

| Binding | Signature | Description |
|---|---|---|
| `attributes` | `map[string]string` | Metadata: `path`, `git.sha`, `git.author_name`, `git.author_email`, `git.date`, `git.message`, `git.remote_url`, `git.platform`, `fs.symlink` |
| `finding` | `map[string]string` | Keys: `secret`, `match`, `line`, `rule_id`, `description` |
| `matchesAny` | `(string, list<string>) → bool` | True if string matches any regex pattern in list (Aho-Corasick + RE2) |
| `containsAny` | `(string, list<string>) → bool` | True if string contains any substring in list (Aho-Corasick) |
| `entropy` | `(string) → double` | Shannon entropy in bits |
| `failsTokenEfficiency` | `(string) → bool` | True if string tokenizes like natural language (BPE cl100k_base) |

---

## CEL Validation Bindings

### `validate` — Live secret verification via HTTP

The `validate` expression must return a `map` with a `"result"` key. Valid statuses: `"valid"`, `"invalid"`, `"revoked"`, `"unknown"`, `"error"`. All additional keys are attached to the finding as metadata.

```toml
# Validate a Stripe secret key
validate = '''
cel.bind(r,
  http.get("https://api.stripe.com/v1/balance", {
    "Authorization": "Bearer " + secret
  }),
  r.status == 200 ? {
    "result": "valid"
  } : r.status in [401, 403] ? {
    "result": "invalid",
    "reason": "Unauthorized"
  } : unknown(r)
)
'''
```

**Available bindings:**

| Binding | Signature | Description |
|---|---|---|
| `secret` | `string` | The extracted secret value |
| `captures` | `map[string]string` | Named regex capture groups from the rule's regex |
| `http.get` | `(url: string, headers: map) → map` | GET request; returns `{status, json, body, headers}` |
| `http.post` | `(url, headers, body: string) → map` | POST request; same response map |
| `cel.bind` | `(name, value, expr)` | Binds a variable to avoid repeating sub-expressions |
| `unknown` | `(response: map) → map` | Returns `{"result": "unknown", "reason": "HTTP <N>"}` |
| `crypto.md5` | `(bytes) → bytes` | MD5 hash |
| `crypto.sha1` | `(bytes) → bytes` | SHA-1 hash |
| `crypto.hmac_sha256` | `(key: bytes, msg: bytes) → bytes` | HMAC-SHA256 |
| `hex.encode` | `(bytes) → string` | Lowercase hex encoding |
| `time.now_unix` | `() → string` | Current Unix timestamp as string |
| `aws.validate` | `(access_key_id, secret_access_key: string) → map` | SigV4-signed STS GetCallerIdentity call; returns `{status, arn, account, userid}` |

```toml
# AWS credential validation using the built-in SigV4 helper
[[rules]]
id       = "aws-access-key"
keywords = ["AKIA"]
regex    = '''(?P<key>(?:A3T[A-Z0-9]|AKIA|AGPA)[A-Z0-9]{16})'''

validate = '''
cel.bind(r,
  aws.validate(captures["key"], secret),
  r.status == 200 ? {
    "result":  "valid",
    "arn":     r[?"arn"].orValue(""),
    "account": r[?"account"].orValue(""),
    "userid":  r[?"userid"].orValue("")
  } : r.status in [400, 403] ? {
    "result": "invalid",
    "reason": r[?"error_code"].orValue("InvalidClientTokenId")
  } : unknown(r)
)
'''

# Validate using a custom HMAC-signed POST request (generic API example)
validate = '''
cel.bind(ts, time.now_unix(),
  cel.bind(sig,
    hex.encode(crypto.hmac_sha256(bytes(secret), bytes("ts=" + ts))),
    cel.bind(r,
      http.post("https://api.example.com/v1/verify", {
        "Content-Type": "application/json",
        "X-Timestamp":  ts,
        "X-Signature":  sig
      }, "{\"token\": \"" + secret + "\"}"),
      r.status == 200 ? {"result": "valid"} : unknown(r)
    )
  )
)
'''
```

---

## `Detector` — Programmatic Go API

### `detect.NewDetectorContext` — Create a detector with full options

Compiles all CEL filter and validation programs from the config, sets up the validation worker pool, and builds the Aho-Corasick keyword trie. This is the primary constructor for programmatic use.

```go
package main

import (
    "context"
    "fmt"
    "strings"

    "github.com/betterleaks/betterleaks/config"
    "github.com/betterleaks/betterleaks/detect"
    "github.com/betterleaks/betterleaks/sources"
    "github.com/spf13/viper"
)

func main() {
    // Load config from TOML
    viper.SetConfigType("toml")
    if err := viper.ReadConfig(strings.NewReader(config.DefaultConfig)); err != nil {
        panic(err)
    }
    var vc config.ViperConfig
    if err := viper.Unmarshal(&vc); err != nil {
        panic(err)
    }
    cfg, err := vc.Translate()
    if err != nil {
        panic(err)
    }

    // Create detector with validation enabled
    valOpts := detect.ValidationOptions{
        Enabled:      true,
        Workers:      10,
        StatusFilter: "valid,revoked", // only emit valid or revoked findings
    }
    d := detect.NewDetectorContext(context.Background(), cfg, valOpts)

    // Scan a fragment (e.g., a file's contents)
    ctx := context.Background()
    src := &sources.Files{Path: "/path/to/scan", MaxFileSize: 5_000_000}

    for result := range d.Run(ctx, src) {
        if result.Err != nil {
            fmt.Println("error:", result.Err)
            continue
        }
        f := result.Finding
        fmt.Printf("[%s] %s at %s:%d (status=%s)\n",
            f.RuleID, f.Secret[:min(len(f.Secret), 12)]+"...",
            f.Attr("path"), f.StartLine, f.ValidationStatus)
    }
}

func min(a, b int) int {
    if a < b { return a }
    return b
}
```

---

### `detect.Detector.Run` — Streaming results iterator

`Run` returns a Go 1.23 `iter.Seq[Result]` that yields `Result{Finding, Err}` as the scan proceeds. Context cancellation stops the scan gracefully.

```go
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()

src := &sources.Git{
    Cmd:        gitCmd,
    ShouldSkip: d.SkipFunc(), // applies the CEL prefilter
    Platform:   scm.NoPlatform,
    Sema:       d.Sema,
}

var findings []report.Finding
for result := range d.Run(ctx, src) {
    if result.Err != nil {
        log.Printf("scan error: %v", result.Err)
        continue
    }
    findings = append(findings, result.Finding)
    fmt.Printf("Found secret: rule=%s file=%s line=%d\n",
        result.Finding.RuleID,
        result.Finding.Attr(sources.AttrPath),
        result.Finding.StartLine,
    )
}
fmt.Printf("Total findings: %d\n", len(findings))
```

---

### `detect.Detector.DetectString` — Scan an in-memory string

Scans a raw string with no source metadata (no file path, no git attributes). Useful for unit tests or scanning generated content.

```go
d, _ := detect.NewDetectorDefaultConfig()

content := `
DB_PASSWORD=super_secret_password_abc123
AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
`

findings := d.DetectString(content)
for _, f := range findings {
    fmt.Printf("Rule: %-30s Secret: %s\n", f.RuleID, f.Secret)
}
// Output:
// Rule: aws-access-key-id              Secret: AKIAIOSFODNN7EXAMPLE
// Rule: aws-secret-access-key          Secret: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```

---

### `detect.Detector.AddBaseline` — Suppress previously known findings

Loads a prior JSON report as a baseline; any finding that already appears in the baseline is silently suppressed. Use to focus CI scans on net-new secrets.

```go
// Step 1: generate baseline on main branch
betterleaks git . -f json -r baseline.json

// Step 2: in a PR scan, suppress all baseline findings
d.AddBaseline("baseline.json", "/path/to/repo")

// Step 3: only new findings are emitted
for result := range d.Run(ctx, src) { ... }
```

```go
// Programmatic baseline usage
if err := d.AddBaseline("./baseline.json", "/repo"); err != nil {
    log.Fatalf("baseline load failed: %v", err)
}
```

---

### `detect.Detector.AddGitleaksIgnore` — Load a `.betterleaksignore` file

Loads fingerprint entries from a `.betterleaksignore` (or `.gitleaksignore`) file. Each line is a fingerprint in one of two formats:

```
# Global fingerprint: file:rule-id:start-line
config/database.yml:generic-api-key:42

# Commit fingerprint: commit-sha:file:rule-id:start-line
a1b2c3d4:config/database.yml:generic-api-key:42
```

```go
if err := d.AddGitleaksIgnore("./.betterleaksignore"); err != nil {
    log.Fatal(err)
}
```

---

## Report Formats

### JSON report (`-f json`)

The default structured output format. All `Finding` fields are serialized; validation metadata is included when `--validation` is active.

```bash
betterleaks git . -f json -r findings.json -v
# findings.json will contain a JSON array of Finding objects, e.g.:
# [
#   {
#     "RuleID": "github-pat",
#     "Description": "GitHub Personal Access Token",
#     "StartLine": 12,
#     "EndLine": 12,
#     "Match": "ghp_abc123...",
#     "Secret": "ghp_abc123...",
#     "Fingerprint": "a1b2c3:config/.env:github-pat:12",
#     "Attributes": {"path": "config/.env", "git.sha": "a1b2c3", ...},
#     "ValidationStatus": "valid",
#     "ValidationMeta": {"username": "octocat"}
#   }
# ]
```

---

### SARIF report (`-f sarif`)

Standard SARIF 2.1.0 format for integration with GitHub Advanced Security, VS Code, and other tools.

```bash
betterleaks git . -f sarif -r results.sarif
# Upload to GitHub code scanning:
# gh api repos/OWNER/REPO/code-scanning/sarifs \
#   -F commit_sha=$(git rev-parse HEAD) \
#   -F ref=refs/heads/main \
#   -F sarif=@results.sarif.b64
```

---

### CSV report (`-f csv`)

```bash
betterleaks dir /app -f csv -r secrets.csv
# RuleID,Description,StartLine,EndLine,Secret,File,Commit,Author,Email,Date,Fingerprint
# github-pat,GitHub PAT,5,5,ghp_xxx,src/config.go,,,,,src/config.go:github-pat:5
```

---

### Template report (`-f template`)

Use a Go `text/template` file to generate any custom output format.

```bash
betterleaks git . -f template --report-template=./my-template.tmpl -r report.txt
```

```go-template
{{/* my-template.tmpl */}}
{{range .}}FINDING: {{.RuleID}} in {{.File}} at line {{.StartLine}}
  Secret:      {{.Secret}}
  Fingerprint: {{.Fingerprint}}
{{end}}
```

---

## Pre-commit Integration

### Native pre-commit hook (Go binary)

Add to `.pre-commit-config.yaml` to run Betterleaks on every `git commit`:

```yaml
repos:
  - repo: https://github.com/betterleaks/betterleaks
    rev: v1.0.0  # use the latest tag
    hooks:
      - id: betterleaks          # uses installed Go binary
      # - id: betterleaks-docker  # uses Docker image (no local install needed)
      # - id: betterleaks-system  # uses system-installed binary
```

Each hook runs:
```
betterleaks git --pre-commit --redact --staged --verbose
```

---

## Diagnostics / Profiling

### `--diagnostics` — CPU, memory, trace, and HTTP pprof

```bash
# CPU + memory profiles saved to /tmp/prof/
betterleaks git . --diagnostics=cpu,mem --diagnostics-dir=/tmp/prof

# Execution trace
betterleaks git . --diagnostics=trace --diagnostics-dir=/tmp/prof

# Live HTTP pprof server at http://localhost:6060/debug/pprof/
betterleaks git . --diagnostics=http

# Analyze CPU profile
go tool pprof /tmp/prof/cpu.pprof

# Analyze trace
go tool trace /tmp/prof/trace.out
```

---

## Global Flags Reference

The following flags are available on all subcommands:

```
-c, --config string              Config file path (default: auto-discovered .betterleaks.toml)
    --exit-code int              Exit code when leaks are found (default 1)
-r, --report-path string         Report output file path (use "-" for stdout)
-f, --report-format string       Output format: json, csv, junit, sarif, template
    --report-template string     Template file for --report-format=template
-b, --baseline-path string       Baseline JSON report; matching findings are suppressed
-l, --log-level string           Log level: trace, debug, info, warn, error, fatal (default "info")
-v, --verbose                    Print each finding as it is found
    --no-color                   Disable ANSI color in output
    --max-target-megabytes int   Skip files larger than this many MB
    --redact uint                Redact secrets (0=none, 1–99=partial %, 100=REDACTED)
    --enable-rule stringSlice    Only run specific rule IDs
-i, --gitleaks-ignore-path       Path to .betterleaksignore file or directory
    --match-context string       Context around matches, e.g. "10L", "100C", "-2C,+4C"
    --max-decode-depth int       Recursive decode passes for base64/URL-encoded data (default 5)
    --max-archive-depth int      Scan inside nested archives up to N levels deep (default 0)
    --timeout int                Global timeout in seconds (default 0, no timeout)
    --regex-engine string        Regex engine: re2 (default) or stdlib
    --validation                 Enable live API validation of findings
    --validation-status string   Filter by status: valid, invalid, revoked, error, unknown, none
    --validation-timeout dur     Per-request HTTP timeout (default 10s)
    --validation-workers int     Concurrent validation workers (default 10)
    --validation-debug           Include raw HTTP request/response in finding metadata
    --ignore-gitleaks-allow      Ignore // betterleaks:allow and // gitleaks:allow comments
    --diagnostics string         Profiling: cpu, mem, trace, http (comma-separated)
    --diagnostics-dir string     Output directory for profiling files
```

---

Betterleaks covers the full spectrum of secrets-scanning use cases: automated CI/CD pipeline integration via the `betterleaks git` command with SARIF or JSON output, developer-side protection through pre-commit hooks that scan staged changes before they are committed, filesystem auditing with `betterleaks dir` for scanning deployment artifacts and configuration directories, and continuous pipe-based scanning of any command output through `betterleaks stdin`. In all modes, a shared global configuration allows teams to encode organization-specific exclusion rules (test fixtures, bot authors, known-safe placeholder strings) once in a `.betterleaks.toml` committed at the repository root.

The most powerful integration pattern is combining CEL-based filtering with live validation: rules that detect a specific credential type define both a `filter` to cut noise and a `validate` expression to confirm the secret is active before alerting. Findings from validated scans can be consumed programmatically via the `detector.Run(ctx, src)` iterator in Go services, or reported in SARIF format to GitHub Advanced Security, enabling automatic PR annotations for newly introduced secrets. Baseline files (`--baseline-path`) allow teams to suppress pre-existing historical findings and focus developer attention only on secrets introduced since the baseline was captured.