### DataFog Development Setup Source: https://github.com/datafog/datafog-python/blob/dev/README.md Clone the repository, set up a virtual environment, install dependencies, and run tests for development. ```bash git clone https://github.com/datafog/datafog-python cd datafog-python python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -e ".[all,dev]" pip install -r requirements-dev.txt pytest tests/ ``` -------------------------------- ### Install DataFog Core and Optional Extras Source: https://github.com/datafog/datafog-python/blob/dev/docs/roadmap.md Install the lightweight core package or include optional extras for advanced NLP or OCR capabilities. The `[all]` option installs the full functionality. ```bash pip install datafog ``` ```bash pip install datafog[nlp] ``` ```bash pip install datafog[ocr] ``` ```bash pip install datafog[all] ``` -------------------------------- ### Install DataFog with Dev Dependencies Source: https://github.com/datafog/datafog-python/blob/dev/Claude.md Clone the repository, set up a virtual environment, and install the package with development dependencies. Also includes commands for installing optional ML extras. ```bash git clone https://github.com/datafog/datafog-python.git cd datafog-python python -m venv .venv && source .venv/bin/activate pip install -e ".[dev]" && pip install -r requirements-dev.txt pre-commit install pip install -e ".[nlp]" pip install -e ".[nlp-advanced]" pip install -e ".[all]" ``` -------------------------------- ### Install DataFog Python Packages Source: https://github.com/datafog/datafog-python/blob/dev/README.md Install the core DataFog library or with optional NLP and advanced NLP support. Use 'all' for a complete installation. ```bash # Core install (regex engine) pip install datafog ``` ```bash # Add spaCy support pip install datafog[nlp] ``` ```bash # Add GLiNER + spaCy support pip install datafog[nlp-advanced] ``` ```bash # Everything pip install datafog[all] ``` -------------------------------- ### Install DataFog Python Library Source: https://context7.com/datafog/datafog-python/llms.txt Install the DataFog library with different levels of support. Core includes the regex engine. '[nlp]' adds spaCy NER. '[nlp-advanced]' adds GLiNER ML-based NER. '[ocr]' adds OCR image processing. '[all]' installs everything. ```bash pip install datafog ``` ```bash pip install datafog[nlp] ``` ```bash pip install datafog[nlp-advanced] ``` ```bash pip install datafog[ocr] ``` ```bash pip install datafog[all] ``` -------------------------------- ### Clone and Set Up Local Development Environment Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md Clone the repository, create a virtual environment, and install the project with development and CLI dependencies. ```bash git clone https://github.com/datafog/datafog-python cd datafog-python python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate python -m pip install --upgrade pip pip install -e ".[dev,cli]" ``` -------------------------------- ### DataFog Performance Showcase Setup Source: https://github.com/datafog/datafog-python/blob/dev/examples/quick_start.ipynb Prepare for a performance benchmark by defining a large, realistic document. This setup is crucial for evaluating DataFog's speed and efficiency on substantial text inputs. ```python # Realistic business document (similar to what you'd process in production) large_document = """ CONFIDENTIAL EMPLOYEE REPORT - Q1 2024 === EXECUTIVE SUMMARY === Report generated by: Sarah Johnson (sarah.johnson@company.com) Date: March 15, 2024 Department: Human Resources Contact: (555) 100-HR00 ext. 1234 === EMPLOYEE RECORDS === 1. John Smith (ID: EMP-001) Email: john.smith@company.com Phone: (555) 123-4567 SSN: 123-45-6789 Address: 123 Oak Street, San Francisco, CA 94102 Manager: David Chen (david.chen@company.com) Salary: $85,000 annually Start Date: January 15, 2020 2. Maria Rodriguez (ID: EMP-002) Email: maria.rodriguez@company.com Phone: (555) 987-6543 SSN: 987-65-4321 Address: 456 Pine Ave, Los Angeles, CA 90210 Manager: Lisa Wang (lisa.wang@company.com) Emergency Contact: Carlos Rodriguez (555) 111-2233 3. Michael Johnson (ID: EMP-003) Email: michael.j@company.com Personal Email: mike.personal@gmail.com Phone: (555) 456-7890 SSN: 456-78-9012 Credit Card on file: 4532-1234-5678-9012 (expires 12/26) === PAYROLL INFORMATION === Bank routing: 123456789 Direct deposit accounts verified on 2024-03-01 Tax ID: 12-3456789 === CONTACT INFORMATION === HR Helpline: (555) 888-4HR7 Benefits questions: benefits@company.com IT Support: support@company.com Office address: 789 Corporate Blvd, Suite 100, Business City, NY 10001 This document contains sensitive employee information and should be handled according to company privacy policies and applicable laws including GDPR, CCPA, and HIPAA where applicable. Report ID: RPT-2024-Q1-001 Classification: CONFIDENTIAL Retention: 7 years from creation date """ print("πŸš€ Performance Benchmark\n") print(f"πŸ“„ Document size: {len(large_document):,} characters") print(f"πŸ“ Lines of text: {len(large_document.splitlines())}") print("=" * 60) ``` -------------------------------- ### Install Optional NLP and OCR Dependencies Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md Install additional dependency profiles for NLP or advanced NLP/OCR functionalities. ```bash pip install -e ".[dev,cli,nlp]" ``` ```bash pip install -e ".[dev,cli,nlp,nlp-advanced]" ``` ```bash pip install -e ".[all,dev]" ``` -------------------------------- ### Install DataFog with Optional ML Engines Source: https://github.com/datafog/datafog-python/blob/dev/Claude.md Demonstrates how to install the core DataFog library and optional packages for specific ML engines like spaCy and GLiNER. ```python # Lightweight core (<2MB) pip install datafog # Optional ML engines pip install datafog[nlp] # spaCy (traditional NLP) pip install datafog[nlp-advanced] # GLiNER (modern NER) pip install datafog[ocr] # Image processing pip install datafog[all] # Everything ``` -------------------------------- ### Install Pinned Local Tooling Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md After installing the editable package, install the pinned development requirements for consistent local tooling. ```bash pip install -r requirements-dev.txt ``` -------------------------------- ### Install Tesseract OCR and Dependencies Source: https://github.com/datafog/datafog-python/blob/dev/examples/image_processing.ipynb Installs Tesseract OCR, its development libraries, and the nest_asyncio package. This is often a prerequisite for OCR tasks. ```bash ! apt install tesseract-ocr ! apt install libtesseract-dev ! pip install nest_asyncio ``` -------------------------------- ### GitHub Release Description Template Source: https://github.com/datafog/datafog-python/blob/dev/templates/social_media_templates.md A comprehensive template for GitHub release descriptions. It includes sections for 'What's New', 'Quick Start' with options for lightweight or full installations, performance metrics, resources, and community engagement prompts. ```text ## What's New in {{version}} {{changelog_content}} ## πŸš€ Quick Start ### Lightweight Core (Recommended) ```bash pip install datafog=={{version}} ``` ### Full Features ```bash pip install datafog[all]=={{version}} ``` ## πŸ“Š Performance Metrics - **Processing Speed:** 190x faster than spaCy - **Package Size:** ~2MB (core), ~8MB (full) - **Install Time:** <15 seconds - **Python Support:** 3.10, 3.11, 3.12 ## πŸ”— Resources - [Documentation](https://docs.datafog.ai) - [Quick Start Guide](https://github.com/datafog/datafog-python#quick-start) - [Discord Community](https://discord.gg/bzDth394R4) ## πŸ™ Community Thanks to all contributors and users providing feedback! ⭐ Star us on GitHub if DataFog is helpful for your projects. πŸ› Report issues or request features in our [issue tracker](https://github.com/datafog/datafog-python/issues). --- **Weekly releases every Friday** β€’ Next release: {{next_friday}} ``` -------------------------------- ### Install DataFog Library Source: https://github.com/datafog/datafog-python/blob/dev/examples/image_processing.ipynb Installs or upgrades the datafog library quietly using pip. ```bash !pip install --upgrade datafog --quiet ``` -------------------------------- ### Install DataFog via Pip Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md Installs the latest stable version of DataFog with CLI support. Use this command to set up the tool on your system. ```bash pip install datafog ``` -------------------------------- ### DataFog CLI: Get Help Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md Displays a list of all available operations and commands for the DataFog CLI. Useful for understanding the tool's capabilities. ```bash datafog --help ``` -------------------------------- ### Install DataFog Python Package Source: https://github.com/datafog/datafog-python/blob/dev/templates/release_announcement.md Use pip to install the DataFog Python package. Choose the lightweight core or the full features version. ```bash # Lightweight core pip install datafog=={{version}} ``` ```bash # Full features pip install datafog[all]=={{version}} ``` -------------------------------- ### Simple PII Detection with DataFog Source: https://github.com/datafog/datafog-python/blob/dev/examples/quick_start.ipynb Quickly detect Personally Identifiable Information (PII) in a given text using DataFog's default regex engine. This is the fastest way to get started with PII scanning. ```python from datafog import DataFog # Create a DataFog instance detector = DataFog() # Sample text with various PII types sample_text = """ Hi there! I'm Dr. Sarah Johnson, and you can reach me at sarah.johnson@hospital.com or call my office at (555) 123-4567. My SSN is 123-45-6789 for verification. I work at General Hospital located at 123 Main St, New York, NY 10001. My credit card ending in 4111-1111-1111-1111 expires on 12/25. """ # Detect PII - this uses the fast regex engine by default results = detector.scan_text(sample_text) print("πŸ” PII Detection Results:") print(f"Found {len(results)} pieces of PII:") for entity_type, entities in results.items(): if entities: # Only show types that were found print(f" {entity_type}: {entities}") ``` -------------------------------- ### Spark Service Initialization in Datafog Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline-term-missing.txt This code demonstrates the initialization of the Spark service within the Datafog project. It's part of the internal setup for creating Spark sessions. ```python self.spark = self.create_spark_session() ``` -------------------------------- ### Install DataFog with Advanced NLP Features Source: https://github.com/datafog/datafog-python/blob/dev/examples/quick_start.ipynb Install DataFog along with advanced Natural Language Processing features, including GLiNER and spaCy support. This command is useful for enabling all ML-based detection capabilities. ```python # Install DataFog with advanced ML features !pip install datafog[nlp-advanced] --quiet print("βœ… DataFog installed successfully!") ``` -------------------------------- ### Initialize Spark Context Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline-term-missing.txt Ensures the Spark context is initialized, optionally with a gateway. This method handles the setup of the Spark context, including determining the gateway configuration. ```python SparkContext._ensure_initialized(self, gateway=gateway, conf=conf) ``` -------------------------------- ### DataFog Python SDK: Scan Text for PII Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md Uses the DataFog Python SDK to scan text for PII. This example demonstrates initializing the client, fetching text data, and running the pipeline synchronously. ```python import requests from datafog import DataFog # For text annotation client = DataFog(operations="scan") # Fetch sample medical record doc_url = "https://gist.githubusercontent.com/sidmohan0/b43b72693226422bac5f083c941ecfdb/raw/b819affb51796204d59987893f89dee18428ed5d/note1.txt" response = requests.get(doc_url) text_lines = [line for line in response.text.splitlines() if line.strip()] # Run annotation annotations = client.run_text_pipeline_sync(str_list=text_lines) print(annotations) ``` -------------------------------- ### DataFog Python SDK: Scan Image for PII with OCR Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md Uses the DataFog Python SDK to perform OCR and scan images for PII. This example shows how to initialize the client for extraction and scanning, and run the OCR pipeline asynchronously. ```python import asyncio from datafog import DataFog # For OCR and PII annotation ocr_client = DataFog(operations="extract,scan") async def run_ocr_pipeline_demo(): image_url = "https://s3.amazonaws.com/thumbnails.venngage.com/template/dc377004-1c2d-49f2-8ddf-d63f11c8d9c2.png" results = await ocr_client.run_ocr_pipeline(image_urls=[image_url]) print("OCR Pipeline Results:", results) # Run the async function asyncio.run(run_ocr_pipeline_demo()) ``` -------------------------------- ### Build Documentation Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md Build the HTML documentation using Sphinx for documentation-only changes. ```bash sphinx-build -b html docs docs/_build/html ``` -------------------------------- ### DataFog CLI: Show Configuration Source: https://context7.com/datafog/datafog-python/llms.txt Display the current DataFog configuration settings using the `show-config` command. ```bash # Show current config datafog show-config ``` -------------------------------- ### Setting up temporary directory for connection info Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md This code creates a temporary directory and a temporary file within it to store connection information for the gateway server. It ensures the file is unlinked after use. ```python # Create a temporary directory where the gateway server should write the connection # information. conn_info_dir = tempfile.mkdtemp() try: fd, conn_info_file = tempfile.mkstemp(dir=conn_info_dir) os.close(fd) os.unlink(conn_info_file) env = dict(os.environ) env["SPARK_CONNECT_MODE"] = "0" env["_PYSPARK_DRIVER_CONN_INFO_PATH"] = conn_info_file ``` -------------------------------- ### Advanced Entity Detection with GLiNER Source: https://github.com/datafog/datafog-python/blob/dev/examples/quick_start.ipynb Use the GLiNER engine for advanced entity detection in complex text. Install with `pip install datafog[nlp-advanced]` if unavailable. ```python import time from datafog.text import TextService complex_text = """ Medical Report - Patient: Emily Rodriguez, DOB: 03/15/1985 Dr. Michael Chen from Stanford Medical Center treated the patient. Insurance ID: INS-789-456-123, Policy expires December 2024. Emergency contact: Maria Rodriguez at (408) 555-9876. Address: 1234 Oak Street, San Francisco, CA 94102 Lab results show glucose level of 120 mg/dL on 2024-01-15. """ try: # Use GLiNER for advanced entity detection gliner_service = TextService(engine="gliner") print("🧠 GLiNER Advanced Detection Results:") print("=" * 50) results = gliner_service.annotate_text_sync(complex_text) for entity_type, entities in results.items(): if entities: # Only show found entities print(f"\n{entity_type}:") for entity in entities: print(f" β€’ {entity}") print(f"\nβœ… Total entity types detected: {len([k for k, v in results.items() if v])}") except ImportError: print("❌ GLiNER not available. Install with: pip install datafog[nlp-advanced]") except Exception as e: print(f"⚠️ GLiNER error: {e}") print("Falling back to regex engine...") # Fallback to regex regex_service = TextService(engine="regex") results = regex_service.annotate_text_sync(complex_text) print("\nπŸš€ Regex Detection Results:") for entity_type, entities in results.items(): if entities: print(f" {entity_type}: {entities}") ``` -------------------------------- ### Launching the Java gateway process Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md This snippet demonstrates how to launch the Java gateway process using the `Popen` class, setting necessary environment variables and handling process termination. It includes logic for Windows and non-Windows systems. ```python # Launch the Java gateway. open_kwargs = {} if open_kwargs is None else open_kwargs # We open a pipe to stdin so that the Java gateway can die when the pipe is broken open_kwargs["stdin"] = PIPE # We always set the necessary environment variables. open_kwargs["env"] = env if not on_windows: # Don't send ctrl-c / SIGINT to the Java gateway: def preexec_func(): signal.signal(signal.SIGINT, signal.SIG_IGN) open_kwargs["preexec_fn"] = preexec_func proc = Popen(command, **open_kwargs) else: # preexec_fn not supported on Windows proc = Popen(command, **open_kwargs) ``` -------------------------------- ### Initialize TextService Engines Source: https://github.com/datafog/datafog-python/blob/dev/Claude.md Instantiate the TextService with different engine options: regex for speed, GLiNER and spaCy for ML capabilities, and smart/auto for cascading. ```python from datafog.services.text_service import TextService regex_service = TextService(engine="regex") # 190x faster, structured PII # ML engines (require extras) gliner_service = TextService(engine="gliner") # 32x faster, modern NER spacy_service = TextService(engine="spacy") # Comprehensive NLP # Smart combinations smart_service = TextService(engine="smart") # Cascading: regexβ†’GLiNERβ†’spaCy auto_service = TextService(engine="auto") # Legacy: regexβ†’spaCy ``` -------------------------------- ### DataFog CLI: Health Check Source: https://context7.com/datafog/datafog-python/llms.txt Check if the DataFog service is running using the CLI. Requires `pip install datafog[cli]`. ```bash # Check service is running datafog health # DataFog is running. ``` -------------------------------- ### Initialize DataFog Client Source: https://github.com/datafog/datafog-python/blob/dev/examples/text_annotation_example.ipynb Initialize the DataFog client with the 'scan' operation. This client is used for data scanning tasks. ```python from datafog import DataFog client = DataFog(operations="scan") ``` -------------------------------- ### DataFog CLI: Scan Text for PII Source: https://github.com/datafog/datafog-python/blob/dev/docs/index.md Scans the provided text for Personally Identifiable Information (PII). This is a basic usage example for text analysis. ```bash datafog scan-text "Your text here" ``` -------------------------------- ### Configure and Launch PySpark Java Gateway Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline-term-missing.txt This snippet shows how to set up environment variables and launch a Java gateway process for PySpark. It includes logic for handling different operating systems and ensuring the gateway process is properly managed. ```python if conf: command += [ "--conf", f"{k}={v}" ] for k, v in conf.get().items()] submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell") if os.environ.get("SPARK_TESTING"): submit_args = " ".join(["--conf spark.ui.enabled=false", submit_args]) command = command + shlex.split(submit_args) # Create a temporary directory where the gateway server should write the connection # information. conn_info_dir = tempfile.mkdtemp() try: fd, conn_info_file = tempfile.mkstemp(dir=conn_info_dir) os.close(fd) os.unlink(conn_info_file) env = dict(os.environ) env["SPARK_CONNECT_MODE"] = "0" env["_PYSPARK_DRIVER_CONN_INFO_PATH"] = conn_info_file # Launch the Java gateway. open_kwargs = {} if open_kwargs is None else open_kwargs # We open a pipe to stdin so that the Java gateway can die when the pipe is broken open_kwargs["stdin"] = PIPE # We always set the necessary environment variables. open_kwargs["env"] = env if not on_windows: # Don't send ctrl-c / SIGINT to the Java gateway: def preexec_func(): signal.signal(signal.SIGINT, signal.SIG_IGN) open_kwargs["preexec_fn"] = preexec_func proc = Popen(command, **open_kwargs) else: # preexec_fn not supported on Windows proc = Popen(command, **open_kwargs) # Wait for the file to appear, or for the process to exit, whichever happens first. while not proc.poll() and not os.path.isfile(conn_info_file): time.sleep(0.1) if not os.path.isfile(conn_info_file): raise PySparkRuntimeError( errorClass="JAVA_GATEWAY_EXITED", messageParameters={} ) ``` -------------------------------- ### Quick Start: Sanitize Text with DataFog Source: https://github.com/datafog/datafog-python/blob/dev/README.md Use the sanitize function for a quick way to detect and redact PII in text using the regex engine. ```python import datafog text = "Contact john@example.com or call (555) 123-4567" clean = datafog.sanitize(text, engine="regex") print(clean) # Contact [EMAIL_1] or call [PHONE_1] ``` -------------------------------- ### Image OCR Extraction with Tesseract Source: https://context7.com/datafog/datafog-python/llms.txt Asynchronously extract text from local image files using the ImageService with Tesseract OCR. Ensure `datafog[ocr]` is installed. ```python import asyncio from datafog.services.image_service import ImageService async def extract_from_local(): service = ImageService(use_tesseract=True) texts = await service.ocr_extract(["tests/files/input_files/nokia-statement.png"]) for text in texts: print("Extracted:", text[:80]) asyncio.run(extract_from_local()) ``` -------------------------------- ### Get or Create Spark Context Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline-term-missing.txt Retrieves an existing Spark context or creates a new one with the provided configuration. This is a common entry point for PySpark applications. ```python sc = SparkContext.getOrCreate(spark_conf) ``` -------------------------------- ### Create and Activate Virtual Environment Source: https://github.com/datafog/datafog-python/blob/dev/examples/image_processing.ipynb Creates a new Python virtual environment named 'venv' and attempts to activate it. Note that direct activation in a script might not work as expected in all environments. ```bash !python -m venv venv !source venv/bin/activate ``` -------------------------------- ### Get or Create SparkContext Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md This function retrieves an existing SparkContext or creates a new one using the provided SparkConf. It ensures the context is initialized with gateway settings if available. ```python sc = SparkContext.getOrCreate(sparkConf) ``` -------------------------------- ### Path C: TextService(engine="gliner").annotate_text_sync("some text") Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/03-architecture-review.md This synchronous path initializes TextService with the GLiNER engine. It can fail if the GLiNER import or model loading fails during annotation, raising an ImportError. ```python TextService(engine="gliner").annotate_text_sync("some text") ``` -------------------------------- ### Discord Release Announcement Template Source: https://github.com/datafog/datafog-python/blob/dev/templates/social_media_templates.md A template for Discord announcements, featuring bold text for emphasis and a separate code block for installation commands. Encourages feedback in a specific channel. ```text πŸš€ **DataFog {{version}} is live!** **This week's highlights:** {{highlights}} **Performance:** {{speed_stat}} **Package size:** {{size_stat}} **Install now:** ```bash pip install datafog=={{version}} ``` Drop your feedback in #general! πŸ™ ``` -------------------------------- ### Twitter Release Announcement Template Source: https://github.com/datafog/datafog-python/blob/dev/templates/social_media_templates.md Use this template for concise Twitter updates about DataFog releases. It includes placeholders for version, key features, performance stats, and installation instructions. ```text πŸš€ DataFog {{version}} is out! {{key_feature}} ⚑ {{speed_stat}} πŸ“¦ {{size_stat}} πŸ”§ pip install datafog=={{version}} #PII #DataProtection #Privacy #Python #OpenSource {{github_link}} ``` -------------------------------- ### Path D: TextService(engine="smart").annotate_text_sync("some text") Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/03-architecture-review.md This synchronous path uses the 'smart' engine, which cascades through regex, GLiNER, and spaCy annotators. It silently degrades if ML dependencies are missing and can be short-circuited by regex false positives. ```python TextService(engine="smart").annotate_text_sync("some text") ``` -------------------------------- ### Quick PII Presence Check with scan_text Source: https://context7.com/datafog/datafog-python/llms.txt Use `scan_text` for a fast boolean check if text contains any PII. Set `return_entities=True` to get a dictionary of detected entities instead of a boolean. ```python from datafog import scan_text # Boolean check print(scan_text("Hello world")) # False print(scan_text("Email: a@b.com")) # True ``` ```python # Get the detected entities as a dict entities = scan_text("SSN 123-45-6789", return_entities=True) print(entities) ``` ```python # Use in a conditional pipeline def safe_log(msg: str) -> None: if scan_text(msg): msg = "[PII DETECTED - REDACTED]" print(f"LOG: {msg}") safe_log("Server started") # LOG: Server started safe_log("Login: user@test.com") # LOG: [PII DETECTED - REDACTED] ``` -------------------------------- ### Entity-Type Filtering with Masking Source: https://context7.com/datafog/datafog-python/llms.txt Create a guardrail to specifically filter or mask certain entity types like EMAIL. This example demonstrates masking email addresses while leaving phone numbers untouched. ```python email_guard = datafog.create_guardrail( entity_types=["EMAIL"], engine="regex", strategy="mask", on_detect="redact", ) result = email_guard.filter("Phone: (555) 000-1234, Email: x@y.com") print(result.redacted_text) ``` -------------------------------- ### Constructing the command for term-missing Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md This snippet shows how to construct the command to execute the `term-missing` functionality, including adding configuration flags based on environment variables. ```python command = [os.path.join(SPARK_HOME, script)] if conf: for k, v in conf.getall(): command += ["--conf", "%s=%s" % (k, v)] submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell") if os.environ.get("SPARK_TESTING"): submit_args = " ".join(["--conf spark.ui.enabled=false", submit_args]) command = command + shlex.split(submit_args) ``` -------------------------------- ### Deprecation Warning for Class-Based Config Source: https://github.com/datafog/datafog-python/blob/dev/docs/audit/01-coverage-baseline.md This warning indicates that support for class-based `conf` is deprecated in PySpark and will be removed in future versions. Users are advised to use `ConfigDict` instead. The warning includes a link to the migration guide. ```python PySparkDeprecateSince20: Support for class-based `conf` is deprecated, use ConfigDict instead. Deprecated in PySpark V2.0 to be removed in V3.0. See PySpark V2 Migration Guide at https://errors.py.dev/2.12/migration/ ``` -------------------------------- ### Run Core Test Suite Source: https://github.com/datafog/datafog-python/blob/dev/CONTRIBUTING.md Execute the main test suite, excluding slow tests and specific integration tests, before submitting a pull request. ```bash pytest tests/ -m "not slow" \ --ignore=tests/test_gliner_annotator.py \ --ignore=tests/test_image_service.py \ --ignore=tests/test_ocr_integration.py \ --ignore=tests/test_spark_integration.py \ --ignore=tests/test_text_service_integration.py ``` -------------------------------- ### Manage GLiNER Models via CLI Source: https://github.com/datafog/datafog-python/blob/dev/Claude.md Command-line interface commands to download and list available GLiNER models for the DataFog library. ```bash # CLI model management subprocess.run(["datafog", "download-model", "urchade/gliner_base", "--engine", "gliner"]) subprocess.run(["datafog", "list-models", "--engine", "gliner"]) ``` -------------------------------- ### Initialize DataFog Client and Apply Asyncio Patch Source: https://github.com/datafog/datafog-python/blob/dev/examples/image_processing.ipynb Initializes the DataFog client for 'extract' operations and applies the nest_asyncio patch to allow running asyncio code within environments that might already have a running event loop. ```python import asyncio import nest_asyncio nest_asyncio.apply() from datafog import DataFog client = DataFog(operations="extract") ```