### DataFog Development Setup

Source: https://github.com/datafog/datafog-python/blob/dev/README.md

Clone the repository, set up a virtual environment, install dependencies, and run tests for development.

```bash
git clone https://github.com/datafog/datafog-python
cd datafog-python
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -e ".[all,dev]"
pip install -r requirements-dev.txt
pytest tests/
```

--------------------------------

### Install DataFog Core and Optional Extras

Source: https://github.com/datafog/datafog-python/blob/dev/docs/roadmap.md

Install the lightweight core package or include optional extras for advanced NLP or OCR capabilities. The `[all]` option installs the full functionality.

```bash
pip install datafog
```

```bash
pip install datafog[nlp]
```

```bash
pip install datafog[ocr]
```

```bash
pip install datafog[all]
```

--------------------------------

### Install DataFog with Dev Dependencies

Source: https://github.com/datafog/datafog-python/blob/dev/Claude.md

Clone the repository, set up a virtual environment, and install the package with development dependencies. Also includes commands for installing optional ML extras.

```bash
git clone https://github.com/datafog/datafog-python.git
cd datafog-python
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]" && pip install -r requirements-dev.txt
pre-commit install
pip install -e ".[nlp]"
pip install -e ".[nlp-advanced]"
pip install -e ".[all]"
```

--------------------------------

### Install DataFog Python Packages

Source: https://github.com/datafog/datafog-python/blob/dev/README.md

Install the core DataFog library or with optional NLP and advanced NLP support. Use 'all' for a complete installation.

```bash
# Core install (regex engine)
pip install datafog
```

```bash
# Add spaCy support
pip install datafog[nlp]
```

```bash
# Add GLiNER + spaCy support
pip install datafog[nlp-advanced]
```

```bash
# Everything
pip install datafog[all]
```

--------------------------------

### Install DataFog Python Library

Source: https://context7.com/datafog/datafog-python/llms.txt

Install the DataFog library with different levels of support. Core includes the regex engine. '[nlp]' adds spaCy NER. '[nlp-advanced]' adds GLiNER ML-based NER. '[ocr]' adds OCR image processing. '[all]' installs everything.

```bash
pip install datafog
```

```bash
pip install datafog[nlp]
```

```bash
pip install datafog[nlp-advanced]
```

```bash
pip install datafog[ocr]
```

```bash
pip install datafog[all]
```

--------------------------------

### Clone and Set Up Local Development Environment

Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md

Clone the repository, create a virtual environment, and install the project with development and CLI dependencies.

```bash
git clone https://github.com/datafog/datafog-python
cd datafog-python
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
python -m pip install --upgrade pip
pip install -e ".[dev,cli]"
```

--------------------------------

### DataFog Performance Showcase Setup

Source: https://github.com/datafog/datafog-python/blob/dev/examples/quick_start.ipynb

Prepare for a performance benchmark by defining a large, realistic document. This setup is crucial for evaluating DataFog's speed and efficiency on substantial text inputs.

```python
# Realistic business document (similar to what you'd process in production)
large_document = """
CONFIDENTIAL EMPLOYEE REPORT - Q1 2024

=== EXECUTIVE SUMMARY ===
Report generated by: Sarah Johnson (sarah.johnson@company.com)
Date: March 15, 2024
Department: Human Resources
Contact: (555) 100-HR00 ext. 1234

=== EMPLOYEE RECORDS ===

1. John Smith (ID: EMP-001)
   Email: john.smith@company.com
   Phone: (555) 123-4567
   SSN: 123-45-6789
   Address: 123 Oak Street, San Francisco, CA 94102
   Manager: David Chen (david.chen@company.com)
   Salary: $85,000 annually
   Start Date: January 15, 2020

2. Maria Rodriguez (ID: EMP-002)
   Email: maria.rodriguez@company.com
   Phone: (555) 987-6543
   SSN: 987-65-4321
   Address: 456 Pine Ave, Los Angeles, CA 90210
   Manager: Lisa Wang (lisa.wang@company.com)
   Emergency Contact: Carlos Rodriguez (555) 111-2233

3. Michael Johnson (ID: EMP-003)
   Email: michael.j@company.com
   Personal Email: mike.personal@gmail.com
   Phone: (555) 456-7890
   SSN: 456-78-9012
   Credit Card on file: 4532-1234-5678-9012 (expires 12/26) 
   
=== PAYROLL INFORMATION ===
Bank routing: 123456789
Direct deposit accounts verified on 2024-03-01
Tax ID: 12-3456789

=== CONTACT INFORMATION ===
HR Helpline: (555) 888-4HR7
Benefits questions: benefits@company.com
IT Support: support@company.com
Office address: 789 Corporate Blvd, Suite 100, Business City, NY 10001

This document contains sensitive employee information and should be handled according to 
company privacy policies and applicable laws including GDPR, CCPA, and HIPAA where applicable.

Report ID: RPT-2024-Q1-001
Classification: CONFIDENTIAL
Retention: 7 years from creation date
"""

print("🚀 Performance Benchmark\n")
print(f"📄 Document size: {len(large_document):,} characters")
print(f"📝 Lines of text: {len(large_document.splitlines())}")
print("=" * 60)
```

--------------------------------

### Install Optional NLP and OCR Dependencies

Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md

Install additional dependency profiles for NLP or advanced NLP/OCR functionalities.

```bash
pip install -e ".[dev,cli,nlp]"
```

```bash
pip install -e ".[dev,cli,nlp,nlp-advanced]"
```

```bash
pip install -e ".[all,dev]"
```

--------------------------------

### Install DataFog with Optional ML Engines

Source: https://github.com/datafog/datafog-python/blob/dev/Claude.md

Demonstrates how to install the core DataFog library and optional packages for specific ML engines like spaCy and GLiNER.

```python
# Lightweight core (<2MB)
pip install datafog

# Optional ML engines
pip install datafog[nlp]           # spaCy (traditional NLP)
pip install datafog[nlp-advanced]  # GLiNER (modern NER)
pip install datafog[ocr]           # Image processing
pip install datafog[all]           # Everything
```

--------------------------------

### Install Pinned Local Tooling

Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md

After installing the editable package, install the pinned development requirements for consistent local tooling.

```bash
pip install -r requirements-dev.txt
```

--------------------------------

### Install Tesseract OCR and Dependencies

Source: https://github.com/datafog/datafog-python/blob/dev/examples/image_processing.ipynb

Installs Tesseract OCR, its development libraries, and the nest_asyncio package. This is often a prerequisite for OCR tasks.

```bash
! apt install tesseract-ocr
! apt install libtesseract-dev
! pip install nest_asyncio
```

--------------------------------

### GitHub Release Description Template

Source: https://github.com/datafog/datafog-python/blob/dev/templates/social_media_templates.md

A comprehensive template for GitHub release descriptions. It includes sections for 'What's New', 'Quick Start' with options for lightweight or full installations, performance metrics, resources, and community engagement prompts.

```text
## What's New in {{version}}

{{changelog_content}}

## 🚀 Quick Start

### Lightweight Core (Recommended)

```bash
pip install datafog=={{version}}
```

### Full Features

```bash
pip install datafog[all]=={{version}}
```

## 📊 Performance Metrics

- **Processing Speed:** 190x faster than spaCy
- **Package Size:** ~2MB (core), ~8MB (full)
- **Install Time:** <15 seconds
- **Python Support:** 3.10, 3.11, 3.12

## 🔗 Resources

- [Documentation](https://docs.datafog.ai)
- [Quick Start Guide](https://github.com/datafog/datafog-python#quick-start)
- [Discord Community](https://discord.gg/bzDth394R4)

## 🙏 Community

Thanks to all contributors and users providing feedback!

⭐ Star us on GitHub if DataFog is helpful for your projects.
🐛 Report issues or request features in our [issue tracker](https://github.com/datafog/datafog-python/issues).

---

**Weekly releases every Friday** • Next release: {{next_friday}}
```

--------------------------------

### Install DataFog Library

Source: https://github.com/datafog/datafog-python/blob/dev/examples/image_processing.ipynb

Installs or upgrades the datafog library quietly using pip.

```bash
!pip install --upgrade datafog --quiet
```

--------------------------------

### Install DataFog via Pip

Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md

Installs the latest stable version of DataFog with CLI support. Use this command to set up the tool on your system.

```bash
pip install datafog
```

--------------------------------

### DataFog CLI: Get Help

Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md

Displays a list of all available operations and commands for the DataFog CLI. Useful for understanding the tool's capabilities.

```bash
datafog --help
```

--------------------------------

### Install DataFog Python Package

Source: https://github.com/datafog/datafog-python/blob/dev/templates/release_announcement.md

Use pip to install the DataFog Python package. Choose the lightweight core or the full features version.

```bash
# Lightweight core
pip install datafog=={{version}}
```

```bash
# Full features
pip install datafog[all]=={{version}}
```

--------------------------------

### Simple PII Detection with DataFog

Source: https://github.com/datafog/datafog-python/blob/dev/examples/quick_start.ipynb

Quickly detect Personally Identifiable Information (PII) in a given text using DataFog's default regex engine. This is the fastest way to get started with PII scanning.

```python
from datafog import DataFog

# Create a DataFog instance
detector = DataFog()

# Sample text with various PII types
sample_text = """
Hi there! I'm Dr. Sarah Johnson, and you can reach me at sarah.johnson@hospital.com 
or call my office at (555) 123-4567. My SSN is 123-45-6789 for verification.
I work at General Hospital located at 123 Main St, New York, NY 10001.
My credit card ending in 4111-1111-1111-1111 expires on 12/25.
"""

# Detect PII - this uses the fast regex engine by default
results = detector.scan_text(sample_text)

print("🔍 PII Detection Results:")
print(f"Found {len(results)} pieces of PII:")
for entity_type, entities in results.items():
    if entities:  # Only show types that were found
        print(f"  {entity_type}: {entities}")
```

--------------------------------

### Spark Service Initialization in Datafog

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline-term-missing.txt

This code demonstrates the initialization of the Spark service within the Datafog project. It's part of the internal setup for creating Spark sessions.

```python
self.spark = self.create_spark_session()
```

--------------------------------

### Install DataFog with Advanced NLP Features

Source: https://github.com/datafog/datafog-python/blob/dev/examples/quick_start.ipynb

Install DataFog along with advanced Natural Language Processing features, including GLiNER and spaCy support. This command is useful for enabling all ML-based detection capabilities.

```python
# Install DataFog with advanced ML features
!pip install datafog[nlp-advanced] --quiet

print("✅ DataFog installed successfully!")
```

--------------------------------

### Initialize Spark Context

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline-term-missing.txt

Ensures the Spark context is initialized, optionally with a gateway. This method handles the setup of the Spark context, including determining the gateway configuration.

```python
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
```

--------------------------------

### DataFog Python SDK: Scan Text for PII

Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md

Uses the DataFog Python SDK to scan text for PII. This example demonstrates initializing the client, fetching text data, and running the pipeline synchronously.

```python
import requests
from datafog import DataFog

# For text annotation
client = DataFog(operations="scan")

# Fetch sample medical record
doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt"
response = requests.get(doc_url)
text_lines = [line for line in response.text.splitlines() if line.strip()]

# Run annotation
annotations = client.run_text_pipeline_sync(str_list=text_lines)
print(annotations)
```

--------------------------------

### DataFog Python SDK: Scan Image for PII with OCR

Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md

Uses the DataFog Python SDK to perform OCR and scan images for PII. This example shows how to initialize the client for extraction and scanning, and run the OCR pipeline asynchronously.

```python
import asyncio
from datafog import DataFog

# For OCR and PII annotation
ocr_client = DataFog(operations="extract,scan")

async def run_ocr_pipeline_demo():
    image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png"
    results = await ocr_client.run_ocr_pipeline(image_urls=[image_url])
    print("OCR Pipeline Results:", results)

# Run the async function
asyncio.run(run_ocr_pipeline_demo())
```

--------------------------------

### Build Documentation

Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md

Build the HTML documentation using Sphinx for documentation-only changes.

```bash
sphinx-build -b html docs docs/_build/html
```

--------------------------------

### DataFog CLI: Show Configuration

Source: https://context7.com/datafog/datafog-python/llms.txt

Display the current DataFog configuration settings using the `show-config` command.

```bash
# Show current config
datafog show-config
```

--------------------------------

### Setting up temporary directory for connection info

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md

This code creates a temporary directory and a temporary file within it to store connection information for the gateway server. It ensures the file is unlinked after use.

```python
# Create a temporary directory where the gateway server should write the connection
# information.
conn_info_dir = tempfile.mkdtemp()
try:
    fd, conn_info_file = tempfile.mkstemp(dir=conn_info_dir)
    os.close(fd)
    os.unlink(conn_info_file)

env = dict(os.environ)
env["SPARK_CONNECT_MODE"] = "0"
env["_PYSPARK_DRIVER_CONN_INFO_PATH"] = conn_info_file
```

--------------------------------

### Advanced Entity Detection with GLiNER

Source: https://github.com/datafog/datafog-python/blob/dev/examples/quick_start.ipynb

Use the GLiNER engine for advanced entity detection in complex text. Install with `pip install datafog[nlp-advanced]` if unavailable.

```python
import time
from datafog.text import TextService

complex_text = """
Medical Report - Patient: Emily Rodriguez, DOB: 03/15/1985
Dr. Michael Chen from Stanford Medical Center treated the patient.
Insurance ID: INS-789-456-123, Policy expires December 2024.
Emergency contact: Maria Rodriguez at (408) 555-9876.
Address: 1234 Oak Street, San Francisco, CA 94102
Lab results show glucose level of 120 mg/dL on 2024-01-15.
"""

try:
    # Use GLiNER for advanced entity detection
    gliner_service = TextService(engine="gliner")
    
    print("🧠 GLiNER Advanced Detection Results:")
    print("=" * 50)
    
    results = gliner_service.annotate_text_sync(complex_text)
    
    for entity_type, entities in results.items():
        if entities:  # Only show found entities
            print(f"\n{entity_type}:")
            for entity in entities:
                print(f"  • {entity}")
    
    print(f"\n✅ Total entity types detected: {len([k for k, v in results.items() if v])}")
    
except ImportError:
    print("❌ GLiNER not available. Install with: pip install datafog[nlp-advanced]")
except Exception as e:
    print(f"⚠️  GLiNER error: {e}")
    print("Falling back to regex engine...")
    
    # Fallback to regex
    regex_service = TextService(engine="regex")
    results = regex_service.annotate_text_sync(complex_text)
    print("\n🚀 Regex Detection Results:")
    for entity_type, entities in results.items():
        if entities:
            print(f"  {entity_type}: {entities}")
```

--------------------------------

### Launching the Java gateway process

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md

This snippet demonstrates how to launch the Java gateway process using the `Popen` class, setting necessary environment variables and handling process termination. It includes logic for Windows and non-Windows systems.

```python
# Launch the Java gateway.
open_kwargs = {}
if open_kwargs is None else open_kwargs

# We open a pipe to stdin so that the Java gateway can die when the pipe is broken
open_kwargs["stdin"] = PIPE

# We always set the necessary environment variables.
open_kwargs["env"] = env

if not on_windows:
    # Don't send ctrl-c / SIGINT to the Java gateway:
    def preexec_func():
        signal.signal(signal.SIGINT, signal.SIG_IGN)

    open_kwargs["preexec_fn"] = preexec_func
    proc = Popen(command, **open_kwargs)
else:
    # preexec_fn not supported on Windows
    proc = Popen(command, **open_kwargs)
```

--------------------------------

### Initialize TextService Engines

Source: https://github.com/datafog/datafog-python/blob/dev/Claude.md

Instantiate the TextService with different engine options: regex for speed, GLiNER and spaCy for ML capabilities, and smart/auto for cascading.

```python
from datafog.services.text_service import TextService

regex_service = TextService(engine="regex")      # 190x faster, structured PII

# ML engines (require extras)
gliner_service = TextService(engine="gliner")    # 32x faster, modern NER
spacy_service = TextService(engine="spacy")      # Comprehensive NLP

# Smart combinations
smart_service = TextService(engine="smart")      # Cascading: regex→GLiNER→spaCy
auto_service = TextService(engine="auto")        # Legacy: regex→spaCy
```

--------------------------------

### DataFog CLI: Health Check

Source: https://context7.com/datafog/datafog-python/llms.txt

Check if the DataFog service is running using the CLI. Requires `pip install datafog[cli]`.

```bash
# Check service is running
datafog health
# DataFog is running.
```

--------------------------------

### Initialize DataFog Client

Source: https://github.com/datafog/datafog-python/blob/dev/examples/text_annotation_example.ipynb

Initialize the DataFog client with the 'scan' operation. This client is used for data scanning tasks.

```python
from datafog import DataFog

client = DataFog(operations="scan")
```

--------------------------------

### DataFog CLI: Scan Text for PII

Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md

Scans the provided text for Personally Identifiable Information (PII). This is a basic usage example for text analysis.

```bash
datafog scan-text "Your text here"
```

--------------------------------

### Configure and Launch PySpark Java Gateway

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline-term-missing.txt

This snippet shows how to set up environment variables and launch a Java gateway process for PySpark. It includes logic for handling different operating systems and ensuring the gateway process is properly managed.

```python
if conf:
    command += [
        "--conf", f"{k}={v}"
    ] for k, v in conf.get().items()]

submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
if os.environ.get("SPARK_TESTING"):
    submit_args = " ".join(["--conf spark.ui.enabled=false", submit_args])
command = command + shlex.split(submit_args)

# Create a temporary directory where the gateway server should write the connection
# information.
conn_info_dir = tempfile.mkdtemp()
try:
    fd, conn_info_file = tempfile.mkstemp(dir=conn_info_dir)
    os.close(fd)
    os.unlink(conn_info_file)

env = dict(os.environ)
env["SPARK_CONNECT_MODE"] = "0"
env["_PYSPARK_DRIVER_CONN_INFO_PATH"] = conn_info_file

# Launch the Java gateway.
open_kwargs = {}
if open_kwargs is None else open_kwargs

# We open a pipe to stdin so that the Java gateway can die when the pipe is broken
open_kwargs["stdin"] = PIPE
# We always set the necessary environment variables.
open_kwargs["env"] = env

if not on_windows:
    # Don't send ctrl-c / SIGINT to the Java gateway:
def preexec_func():
        signal.signal(signal.SIGINT, signal.SIG_IGN)

    open_kwargs["preexec_fn"] = preexec_func
    proc = Popen(command, **open_kwargs)
else:
    # preexec_fn not supported on Windows
    proc = Popen(command, **open_kwargs)

# Wait for the file to appear, or for the process to exit, whichever happens first.
while not proc.poll() and not os.path.isfile(conn_info_file):
    time.sleep(0.1)

if not os.path.isfile(conn_info_file):
    raise PySparkRuntimeError(
        errorClass="JAVA_GATEWAY_EXITED",
        messageParameters={}
    )

```

--------------------------------

### Quick Start: Sanitize Text with DataFog

Source: https://github.com/datafog/datafog-python/blob/dev/README.md

Use the sanitize function for a quick way to detect and redact PII in text using the regex engine.

```python
import datafog

text = "Contact john@example.com or call (555) 123-4567"
clean = datafog.sanitize(text, engine="regex")
print(clean)
# Contact [EMAIL_1] or call [PHONE_1]
```

--------------------------------

### Image OCR Extraction with Tesseract

Source: https://context7.com/datafog/datafog-python/llms.txt

Asynchronously extract text from local image files using the ImageService with Tesseract OCR. Ensure `datafog[ocr]` is installed.

```python
import asyncio
from datafog.services.image_service import ImageService

async def extract_from_local():
    service = ImageService(use_tesseract=True)
    texts = await service.ocr_extract(["tests/files/input_files/nokia-statement.png"])
    for text in texts:
        print("Extracted:", text[:80])

asyncio.run(extract_from_local())
```

--------------------------------

### Get or Create Spark Context

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline-term-missing.txt

Retrieves an existing Spark context or creates a new one with the provided configuration. This is a common entry point for PySpark applications.

```python
sc = SparkContext.getOrCreate(spark_conf)
```

--------------------------------

### Create and Activate Virtual Environment

Source: https://github.com/datafog/datafog-python/blob/dev/examples/image_processing.ipynb

Creates a new Python virtual environment named 'venv' and attempts to activate it. Note that direct activation in a script might not work as expected in all environments.

```bash
!python -m venv venv
!source venv/bin/activate
```

--------------------------------

### Get or Create SparkContext

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md

This function retrieves an existing SparkContext or creates a new one using the provided SparkConf. It ensures the context is initialized with gateway settings if available.

```python
sc = SparkContext.getOrCreate(sparkConf)

```

--------------------------------

### Path C: TextService(engine="gliner").annotate_text_sync("some text")

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/03-architecture-review.md

This synchronous path initializes TextService with the GLiNER engine. It can fail if the GLiNER import or model loading fails during annotation, raising an ImportError.

```python
TextService(engine="gliner").annotate_text_sync("some text")
```

--------------------------------

### Discord Release Announcement Template

Source: https://github.com/datafog/datafog-python/blob/dev/templates/social_media_templates.md

A template for Discord announcements, featuring bold text for emphasis and a separate code block for installation commands. Encourages feedback in a specific channel.

```text
🚀 **DataFog {{version}} is live!**

**This week's highlights:**
{{highlights}}

**Performance:** {{speed_stat}}
**Package size:** {{size_stat}}

**Install now:**
```bash
pip install datafog=={{version}}
```

Drop your feedback in #general! 🙏
```

--------------------------------

### Twitter Release Announcement Template

Source: https://github.com/datafog/datafog-python/blob/dev/templates/social_media_templates.md

Use this template for concise Twitter updates about DataFog releases. It includes placeholders for version, key features, performance stats, and installation instructions.

```text
🚀 DataFog {{version}} is out!

{{key_feature}}

⚡ {{speed_stat}}
📦 {{size_stat}}
🔧 pip install datafog=={{version}}

#PII #DataProtection #Privacy #Python #OpenSource

{{github_link}}
```

--------------------------------

### Path D: TextService(engine="smart").annotate_text_sync("some text")

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/03-architecture-review.md

This synchronous path uses the 'smart' engine, which cascades through regex, GLiNER, and spaCy annotators. It silently degrades if ML dependencies are missing and can be short-circuited by regex false positives.

```python
TextService(engine="smart").annotate_text_sync("some text")
```

--------------------------------

### Quick PII Presence Check with scan_text

Source: https://context7.com/datafog/datafog-python/llms.txt

Use `scan_text` for a fast boolean check if text contains any PII. Set `return_entities=True` to get a dictionary of detected entities instead of a boolean.

```python
from datafog import scan_text

# Boolean check
print(scan_text("Hello world"))           # False
print(scan_text("Email: a@b.com"))        # True
```

```python
# Get the detected entities as a dict
entities = scan_text("SSN 123-45-6789", return_entities=True)
print(entities)
```

```python
# Use in a conditional pipeline
def safe_log(msg: str) -> None:
    if scan_text(msg):
        msg = "[PII DETECTED - REDACTED]"
    print(f"LOG: {msg}")

safe_log("Server started")          # LOG: Server started
safe_log("Login: user@test.com")    # LOG: [PII DETECTED - REDACTED]
```

--------------------------------

### Entity-Type Filtering with Masking

Source: https://context7.com/datafog/datafog-python/llms.txt

Create a guardrail to specifically filter or mask certain entity types like EMAIL. This example demonstrates masking email addresses while leaving phone numbers untouched.

```python
email_guard = datafog.create_guardrail(
    entity_types=["EMAIL"],
    engine="regex",
    strategy="mask",
    on_detect="redact",
)
result = email_guard.filter("Phone: (555) 000-1234, Email: x@y.com")
print(result.redacted_text)
```

--------------------------------

### Constructing the command for term-missing

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md

This snippet shows how to construct the command to execute the `term-missing` functionality, including adding configuration flags based on environment variables.

```python
command = [os.path.join(SPARK_HOME, script)]
if conf:
    for k, v in conf.getall():
        command += ["--conf", "%s=%s" % (k, v)]

submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")

if os.environ.get("SPARK_TESTING"):
    submit_args = " ".join(["--conf spark.ui.enabled=false", submit_args])

command = command + shlex.split(submit_args)
```

--------------------------------

### Deprecation Warning for Class-Based Config

Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md

This warning indicates that support for class-based `conf` is deprecated in PySpark and will be removed in future versions. Users are advised to use `ConfigDict` instead. The warning includes a link to the migration guide.

```python
PySparkDeprecateSince20: Support for class-based `conf` is deprecated, use ConfigDict instead. Deprecated in PySpark V2.0 to be removed in V3.0. See PySpark V2 Migration Guide at https://errors.py.dev/2.12/migration/
```

--------------------------------

### Run Core Test Suite

Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md

Execute the main test suite, excluding slow tests and specific integration tests, before submitting a pull request.

```bash
pytest tests/ -m "not slow" \
  --ignore=tests/test_gliner_annotator.py \
  --ignore=tests/test_image_service.py \
  --ignore=tests/test_ocr_integration.py \
  --ignore=tests/test_spark_integration.py \
  --ignore=tests/test_text_service_integration.py
```

--------------------------------

### Manage GLiNER Models via CLI

Source: https://github.com/datafog/datafog-python/blob/dev/Claude.md

Command-line interface commands to download and list available GLiNER models for the DataFog library.

```bash
# CLI model management
subprocess.run(["datafog", "download-model", "urchade/gliner_base", "--engine", "gliner"])
subprocess.run(["datafog", "list-models", "--engine", "gliner"])
```

--------------------------------

### Initialize DataFog Client and Apply Asyncio Patch

Source: https://github.com/datafog/datafog-python/blob/dev/examples/image_processing.ipynb

Initializes the DataFog client for 'extract' operations and applies the nest_asyncio patch to allow running asyncio code within environments that might already have a running event loop.

```python
import asyncio
import nest_asyncio
nest_asyncio.apply()
from datafog import DataFog
client = DataFog(operations="extract")
```