### Install dirhash-python Source: https://github.com/andhus/dirhash-python/blob/master/README.md Install the package using pip from PyPI or directly from source. ```commandline pip install dirhash ``` ```commandline git clone git@github.com:andhus/dirhash-python.git pip install dirhash/ ``` -------------------------------- ### CLI Examples Source: https://context7.com/andhus/dirhash-python/llms.txt Examples of using the dirhash command-line interface for various hashing scenarios. ```APIDOC ## CLI Examples ### Exclude multiple extensions ```bash dirhash path/to/dir -a md5 --ignore "*.pyc" "*.log" "*.tmp" ``` ### Include empty directories ```bash dirhash path/to/dir -a sha256 --empty-dirs ``` ### Skip symlinked directories and files ```bash dirhash path/to/dir -a md5 --no-linked-dirs --no-linked-files ``` ### Hash only file names/structure (ignore content) ```bash dirhash path/to/dir -a sha256 --properties name ``` ### Hash only file content (ignore names) ```bash dirhash path/to/dir -a sha256 --properties data ``` ### Include symlink status in hash ```bash dirhash path/to/dir -a sha256 --properties name data is_link ``` ### Allow cyclic symbolic links ```bash dirhash path/to/dir -a sha256 --allow-cyclic-links ``` ### Parallel hashing with 8 workers (~4–6x speed-up for large dirs) ```bash dirhash path/to/large_dataset -a sha256 --jobs 8 ``` ### Tune chunk size for I/O (default 1 MiB) ```bash dirhash path/to/dir -a sha256 --chunk-size 4194304 ``` ### List included paths (audit filters without computing the hash) ```bash dirhash path/to/dir --list dirhash path/to/dir -a md5 --ignore ".*" ".*/" --list ``` ### Print version ```bash dirhash --version ``` ``` -------------------------------- ### Print version Source: https://context7.com/andhus/dirhash-python/llms.txt Display the installed version of dirhash. ```bash dirhash --version ``` -------------------------------- ### Guaranteed hash algorithms Source: https://context7.com/andhus/dirhash-python/llms.txt Access the module-level set `algorithms_guaranteed` to see hash algorithms always available on any Python 3 install. ```python import dirhash # Always available on any Python 3 install print(dirhash.algorithms_guaranteed) # {'md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512'} ``` -------------------------------- ### List Paths Included by dirhash Source: https://context7.com/andhus/dirhash-python/llms.txt Use `included_paths` to get a list of relative paths that `dirhash` would include. This is useful for debugging filter configurations before computing a full hash. Empty directories are represented as `"subdir/."`, and cyclic symlinks are handled based on the `allow_cyclic_links` argument. ```python from dirhash import included_paths # List every path included by default paths = included_paths("path/to/directory") # ['dir/subdir/file.py', 'dir/subdir/data.csv', 'readme.txt', ...] ``` ```python # Preview which paths survive a filter before computing the hash visible_paths = included_paths( "path/to/project", match=["*.py", "*.txt"], ignore=[ ".*", ".*/" ], linked_dirs=True, linked_files=True, empty_dirs=False, ) for p in visible_paths: print(p) # dir/module.py # dir/notes.txt # main.py ``` ```python # Verify that cyclic links are handled (allow_cyclic_links=True shows them as dirs) cycle_paths = included_paths("path/with/cycles", allow_cyclic_links=True) # ['d1/link_back/.'] ``` ```python # Confirm empty directories appear when requested paths_with_empty = included_paths("path/to/directory", empty_dirs=True) # ['data/results/', 'empty_dir/.', 'src/main.py'] ``` -------------------------------- ### Include empty directories Source: https://context7.com/andhus/dirhash-python/llms.txt Use the --empty-dirs flag to include empty directories in the hash calculation. ```bash dirhash path/to/dir -a sha256 --empty-dirs ``` -------------------------------- ### dirhash CLI: Basic Hashing and Filtering Source: https://context7.com/andhus/dirhash-python/llms.txt The `dirhash` command-line tool allows hashing directories directly from the shell. Use the `-a` flag for algorithms and `--match` or `--ignore` for filtering. The `-l` flag lists included paths. ```bash # Basic: hash a directory with MD5 dirhash path/to/directory -a md5 # 3c631c7f5771468a2187494f802fad8f ``` ```bash # Use SHA-256 dirhash path/to/directory -a sha256 # ef7e95269fbc0e3478ad31fddd1c7d08907d189c61725332e8a2fd14448fe175 ``` ```bash # Include only Python files dirhash path/to/project -a md5 --match "*.py" ``` ```bash # Exclude hidden files and directories dirhash path/to/project -a sha1 --ignore ".*" ".*/" ``` -------------------------------- ### Available hash algorithms Source: https://context7.com/andhus/dirhash-python/llms.txt Access the module-level set `algorithms_available` to see all hash algorithms supported on the current platform, including OpenSSL-backed ones. ```python import dirhash # All algorithms on the current platform (superset of guaranteed) print(dirhash.algorithms_available) # {'md5', 'sha1', 'sha256', 'sha512', 'blake2b', 'blake2s', 'sha3_256', ...} ``` -------------------------------- ### Compute Directory Hash with CLI Source: https://github.com/andhus/dirhash-python/blob/master/README.md Utilize the dirhash command-line interface to compute directory hashes. Options include specifying the algorithm (-a) and using --match and --ignore for file filtering. ```commandline dirhash path/to/directory -a md5 dirhash path/to/directory -a md5 --match "*.py" dirhash path/to/directory -a sha1 --ignore ".*" .*/" ``` -------------------------------- ### CLI - dirhash command-line interface Source: https://context7.com/andhus/dirhash-python/llms.txt The `dirhash` command-line tool provides shell access to the library's functionality, including basic hashing, filter application, and listing included files. ```APIDOC ## CLI — `dirhash` command-line interface ### Description The installed console script `dirhash` exposes the full library API from the shell. The `-l` / `--list` flag prints included paths instead of computing the hash, making it easy to audit filters interactively. ### Usage Examples ```bash # Basic: hash a directory with MD5 dirhash path/to/directory -a md5 # 3c631c7f5771468a2187494f802fad8f # Use SHA-256 dirhash path/to/directory -a sha256 # ef7e95269fbc0e3478ad31fddd1c7d08907d189c61725332e8a2fd14448fe175 # Include only Python files dirhash path/to/project -a md5 --match "*.py" # Exclude hidden files and directories dirhash path/to/project -a sha1 --ignore ".*" ".*/" ``` ``` -------------------------------- ### Supported Hash Algorithms Source: https://context7.com/andhus/dirhash-python/llms.txt Information about the supported hash algorithms available through the dirhash library. ```APIDOC ## `algorithms_guaranteed` and `algorithms_available` — Supported hash algorithms Module-level sets exposing which hashing algorithm names are accepted by `dirhash()`. `algorithms_guaranteed` mirrors `hashlib.algorithms_guaranteed`; `algorithms_available` mirrors `hashlib.algorithms_available` and includes additional OpenSSL-backed algorithms on the current platform. ### Guaranteed Algorithms ```python import dirhash # Always available on any Python 3 install print(dirhash.algorithms_guaranteed) # Expected output: {'md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512'} ``` ### Available Algorithms ```python import dirhash # All algorithms on the current platform (superset of guaranteed) print(dirhash.algorithms_available) # Example output: {'md5', 'sha1', 'sha256', 'sha512', 'blake2b', 'blake2s', 'sha3_256', ...} ``` ### Dynamic Algorithm Selection ```python import dirhash # Dynamically choose the strongest available algorithm preferred = ["sha3_256", "sha256", "sha1", "md5"] algorithm = next(a for a in preferred if a in dirhash.algorithms_available) hash_value = dirhash.dirhash("path/to/directory", algorithm) print(f"[{algorithm}] {hash_value}") ``` ### Custom Hashlib Algorithms ```python import dirhash import hashlib # Use a custom hashlib.new-compatible algorithm name hash_value = dirhash.dirhash("path/to/directory", "blake2b") ``` ``` -------------------------------- ### Allow cyclic symbolic links Source: https://context7.com/andhus/dirhash-python/llms.txt Use --allow-cyclic-links to handle directories containing cyclic symbolic links. ```bash dirhash path/to/dir -a sha256 --allow-cyclic-links ``` -------------------------------- ### Skip symlinked directories and files Source: https://context7.com/andhus/dirhash-python/llms.txt Use --no-linked-dirs and --no-linked-files to prevent hashing symlinked directories and files. ```bash dirhash path/to/dir -a md5 --no-linked-dirs --no-linked-files ``` -------------------------------- ### Tune chunk size for I/O Source: https://context7.com/andhus/dirhash-python/llms.txt Adjust the --chunk-size option to tune the I/O chunk size for hashing, default is 1 MiB. ```bash dirhash path/to/dir -a sha256 --chunk-size 4194304 ``` -------------------------------- ### Customizing File/Directory Filtering with Filter Class Source: https://context7.com/andhus/dirhash-python/llms.txt Create `Filter` instances to control which files and directories are included in the hash. You can specify glob patterns, control symlink following, and decide whether to include empty directories. Pass these to `dirhash_impl`. ```python from dirhash import Filter, get_match_patterns # Exclude hidden files and directories, follow symlinks, ignore empty dirs filt = Filter( match_patterns=get_match_patterns(ignore_hidden=True), linked_dirs=True, linked_files=True, empty_dirs=False, ) ``` ```python # Only traverse real directories (no symlinks), include empty dirs strict_filt = Filter( match_patterns=["*.csv", "*.json"], linked_dirs=False, linked_files=False, empty_dirs=True, ) ``` ```python from dirhash import dirhash_impl result = dirhash_impl("path/to/data", "md5", filter_=strict_filt) ``` -------------------------------- ### List included paths Source: https://context7.com/andhus/dirhash-python/llms.txt Use the --list flag to audit filters without computing the hash. ```bash dirhash path/to/dir --list ``` ```bash dirhash path/to/dir -a md5 --ignore ".*" ".*/" --list ``` -------------------------------- ### Dynamically choose strongest available algorithm Source: https://context7.com/andhus/dirhash-python/llms.txt Select the strongest available hashing algorithm from a preferred list for use with `dirhash()`. ```python import dirhash # Dynamically choose the strongest available algorithm preferred = ["sha3_256", "sha256", "sha1", "md5"] algorithm = next(a for a in preferred if a in dirhash.algorithms_available) hash_value = dirhash.dirhash("path/to/directory", algorithm) print(f"[{algorithm}] {hash_value}") ``` -------------------------------- ### Include symlink status in hash Source: https://context7.com/andhus/dirhash-python/llms.txt Use --properties name data is_link to include file names, content, and symlink status in the hash. ```bash dirhash path/to/dir -a sha256 --properties name data is_link ``` -------------------------------- ### Hash only file content Source: https://context7.com/andhus/dirhash-python/llms.txt Use --properties data to hash only the file content, ignoring file names and directory structure. ```bash dirhash path/to/dir -a sha256 --properties data ``` -------------------------------- ### Configuring Hash Computation Properties with Protocol Class Source: https://context7.com/andhus/dirhash-python/llms.txt Use the `Protocol` class to define which entry properties (name, data, symlink status) contribute to the hash. This allows for fine-grained control over what changes are detected. Cyclic symlinks can be handled by setting `allow_cyclic_links=True`. ```python from dirhash import Protocol # Default: both file content and path names affect the hash default_proto = Protocol(entry_properties=("name", "data")) # Data-only: detect content changes regardless of renaming data_proto = Protocol(entry_properties=("data",)) # Name-only: detect structural changes (add/move/remove) ignoring content name_proto = Protocol(entry_properties=("name",)) # Full: name + data + symlink awareness full_proto = ( entry_properties=("name", "data", "is_link"), allow_cyclic_links=False, ) # Allow cyclic symlinks (hash relative path to target instead of raising) cyclic_proto = Protocol( entry_properties=("name", "data"), allow_cyclic_links=True, ) ``` ```python # Access property name constants print(Protocol.EntryProperties.NAME) # "name" print(Protocol.EntryProperties.DATA) # "data" print(Protocol.EntryProperties.IS_LINK) # "is_link" ``` ```python from dirhash import dirhash_impl, Filter result = dirhash_impl("path/to/dir", "sha256", protocol=data_proto) ``` -------------------------------- ### Low-level Hashing with dirhash_impl, Filter, and Protocol Source: https://context7.com/andhus/dirhash-python/llms.txt Use `dirhash_impl` for advanced hashing with custom `Filter` and `Protocol` instances. This is efficient for reusing configurations across multiple calls or when subclassing for custom behavior. Multiprocessing is also supported. ```python from dirhash import dirhash_impl, Filter, Protocol, get_match_patterns # Build reusable filter and protocol objects filt = Filter( match_patterns=get_match_patterns(ignore=[".*", ".*/", "*.tmp"]), linked_dirs=True, linked_files=True, empty_dirs=False, ) proto = Protocol( entry_properties=("name", "data"), allow_cyclic_links=False, ) # Hash multiple directories with the same configuration efficiently directories = ["dataset/train", "dataset/val", "dataset/test"] hashes = {d: dirhash_impl(d, "sha256", filter_=filt, protocol=proto) for d in directories} for path, h in hashes.items(): print(f"{path}: {h}") # dataset/train: a3f9... # dataset/val: b71c... # dataset/test: 9de2... ``` ```python fast_hash = dirhash_impl( "path/to/large_dir", "sha256", filter_=filt, protocol=proto, jobs=8, chunk_size=2 ** 22, # 4 MiB chunks for large files ) ``` -------------------------------- ### Use custom hashlib.new-compatible algorithm name Source: https://context7.com/andhus/dirhash-python/llms.txt Specify a custom algorithm name compatible with `hashlib.new` when calling `dirhash()`. ```python import hashlib hash_value = dirhash.dirhash("path/to/directory", "blake2b") ``` -------------------------------- ### Parallel hashing Source: https://context7.com/andhus/dirhash-python/llms.txt Use the --jobs N flag to enable parallel hashing for faster processing of large directories. ```bash dirhash path/to/large_dataset -a sha256 --jobs 8 ``` -------------------------------- ### Hash only file names/structure Source: https://context7.com/andhus/dirhash-python/llms.txt Use --properties name to hash only the file names and directory structure, ignoring file content. ```bash dirhash path/to/dir -a sha256 --properties name ``` -------------------------------- ### Compute Directory Hash with Python Module Source: https://github.com/andhus/dirhash-python/blob/master/README.md Use the dirhash function to compute the hash of a directory. Specify the hashing algorithm and optionally use 'match' and 'ignore' arguments for file filtering. ```python from dirhash import dirhash dirpath = "path/to/directory" dir_md5 = dirhash(dirpath, "md5") pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"]) no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"]) ``` -------------------------------- ### Build Glob Patterns with get_match_patterns Source: https://context7.com/andhus/dirhash-python/llms.txt The `get_match_patterns` helper function constructs a deduplicated list of `.gitignore`-style glob patterns from various filtering options. The output is suitable for direct use with `dirhash` or `included_paths` as the `match` argument, with ignore patterns prefixed by `!`. ```python from dirhash import get_match_patterns # Default: match everything patterns = get_match_patterns() # ['*'] ``` ```python # Include only certain paths, exclude others patterns = get_match_patterns(match=["src/*", "tests/*"], ignore=["*.pyc"]) # ['src/*', 'tests/*', '!*.pyc'] ``` -------------------------------- ### dirhash(directory, algorithm, ...) Source: https://context7.com/andhus/dirhash-python/llms.txt Computes a single, collision-resistant hash of a directory based on its file structure and content. It supports various algorithms, glob filtering, symbolic link handling, and parallel processing. ```APIDOC ## dirhash(directory, algorithm, ...) ### Description Computes a hexdigest string representing the hash of the directory tree, incorporating file content and/or path names depending on `entry_properties`. Raises `ValueError` for invalid arguments or an empty directory (when `empty_dirs=False`), and `SymlinkRecursionError` for cyclic symlinks (when `allow_cyclic_links=False`). ### Method `dirhash` ### Parameters #### Path Parameters - **directory** (string) - Required - The path to the directory to hash. - **algorithm** (string) - Required - The hashing algorithm to use (e.g., 'md5', 'sha256'). #### Keyword Arguments - **entry_properties** (list of strings) - Optional - Specifies which properties of directory entries to include in the hash. Defaults to ['name', 'data']. Possible values: 'name', 'data', 'size', 'mode', 'mtime', 'is_link'. - **match** (list of strings) - Optional - Glob patterns to include files/directories. - **ignore** (list of strings) - Optional - Glob patterns to exclude files/directories. - **ignore_extensions** (list of strings) - Optional - List of file extensions to ignore. - **ignore_hidden** (boolean) - Optional - If True, ignores hidden files and directories (starting with '.'). Defaults to False. - **empty_dirs** (boolean) - Optional - If True, includes empty directories in the hash. Defaults to False. - **allow_cyclic_links** (boolean) - Optional - If True, allows hashing of cyclic symbolic links. Defaults to False. - **linked_dirs** (boolean) - Optional - If True, includes symbolic links to directories in the hash. Defaults to True. - **linked_files** (boolean) - Optional - If True, includes symbolic links to files in the hash. Defaults to True. - **jobs** (integer) - Optional - Number of worker processes to use for parallel hashing. Defaults to 1. ### Request Example ```python from dirhash import dirhash # Basic usage: hash everything with MD5 hash_value = dirhash("path/to/directory", "md5") # Hash only Python source files using SHA-256 py_hash = dirhash("path/to/project", "sha256", match=["*.py"]) # Exclude hidden files and directories clean_hash = dirhash("path/to/project", "sha1", ignore=[".*", ".*/"]) # Check only file names/structure, not content name_hash = dirhash("path/to/directory", "sha256", entry_properties=["name"]) # Check only file content, not names data_hash = dirhash("path/to/directory", "sha256", entry_properties=["data"]) # Include empty directories in the hash hash_with_empty = dirhash("path/to/directory", "md5", empty_dirs=True) # Parallelise hashing across 8 worker processes fast_hash = dirhash("path/to/large_dataset", "sha256", jobs=8) # Exclude symlinked files/dirs from the hash strict_hash = dirhash("path/to/directory", "md5", linked_dirs=False, linked_files=False) # Allow cyclic symlinks safe_hash = dirhash("path/to/directory", "sha256", allow_cyclic_links=True) # Combine multiple options combined_hash = dirhash("path/to/directory", "sha256", match=["*.txt"], ignore=[".*", ".*/"], entry_properties=["name", "data", "is_link"], jobs=4) ``` ### Response #### Success Response (200) - **hash_value** (string) - The hexdigest string of the directory hash. ``` -------------------------------- ### Compute Directory Hash with dirhash Source: https://context7.com/andhus/dirhash-python/llms.txt Use the primary `dirhash` function to compute a hash for a directory. Specify the hashing algorithm, and optionally use `match` and `ignore` for filtering, `entry_properties` to control what is hashed, `empty_dirs` to include empty directories, and `jobs` for parallel processing. ```python from dirhash import dirhash # Basic usage: hash everything with MD5 hash_value = dirhash("path/to/directory", "md5") print(hash_value) # e.g. "3c631c7f5771468a2187494f802fad8f" ``` ```python # Hash only Python source files using SHA-256 py_hash = dirhash("path/to/project", "sha256", match=["*.py"]) print(py_hash) # e.g. "ef7e95269fbc0e3478ad31fddd1c7d08907d189c..." ``` ```python # Exclude hidden files and directories (like .git, .DS_Store) clean_hash = dirhash("path/to/project", "sha1", ignore=[ ".*", ".*/" ]) print(clean_hash) ``` ```python # Check only file names/structure, not content (detect renames/moves) name_hash = dirhash("path/to/directory", "sha256", entry_properties=["name"]) ``` ```python # Check only file content, not names (detect data changes regardless of renaming) data_hash = dirhash("path/to/directory", "sha256", entry_properties=["data"]) ``` ```python # Include empty directories in the hash hash_with_empty = dirhash("path/to/directory", "md5", empty_dirs=True) ``` ```python # Parallelise hashing across 8 worker processes for large directories fast_hash = dirhash("path/to/large_dataset", "sha256", jobs=8) ``` ```python # Exclude symlinked files/dirs from the hash strict_hash = dirhash( "path/to/directory", "md5", linked_dirs=False, linked_files=False ) ``` ```python # Allow cyclic symlinks (hashes the relative path to the link target instead) safe_hash = dirhash("path/to/directory", "sha256", allow_cyclic_links=True) ``` ```python # Combine: only .txt files, ignore hidden, include symlink status, 4 workers combined_hash = dirhash( "path/to/directory", "sha256", match=["*.txt"], ignore=[ ".*", ".*/" ], entry_properties=["name", "data", "is_link"], jobs=4, ) ``` -------------------------------- ### get_match_patterns(match, ignore, ignore_extensions, ignore_hidden) Source: https://context7.com/andhus/dirhash-python/llms.txt Composes a deduplicated list of `.gitignore`-style match patterns from higher-level filter options. The resulting list is suitable for passing directly as `match` to `dirhash()` or `included_paths()`. ```APIDOC ## get_match_patterns(match, ignore, ignore_extensions, ignore_hidden) ### Description A helper that composes a deduplicated list of `.gitignore`-style match patterns from higher-level filter options. The resulting list is suitable for passing directly as `match` to `dirhash()` or `included_paths()`. ### Method `get_match_patterns` ### Parameters #### Keyword Arguments - **match** (list of strings) - Optional - Glob patterns to include files/directories. - **ignore** (list of strings) - Optional - Glob patterns to exclude files/directories. - **ignore_extensions** (list of strings) - Optional - List of file extensions to ignore. - **ignore_hidden** (boolean) - Optional - If True, ignores hidden files and directories (starting with '.'). Defaults to False. ### Request Example ```python from dirhash import get_match_patterns # Default: match everything patterns = get_match_patterns() # Include only certain paths, exclude others patterns = get_match_patterns(match=["src/*", "tests/*"], ignore=["*.pyc"]) ``` ### Response #### Success Response (200) - **patterns** (list of strings) - A list of composed glob patterns. ``` -------------------------------- ### Shorthand for Ignoring Files and Directories Source: https://context7.com/andhus/dirhash-python/llms.txt Use `get_match_patterns` for concise filtering. Leading dots are optional for extensions. Hidden files and directories can be ignored with `ignore_hidden=True`. ```python patterns = get_match_patterns(ignore_extensions=["pyc", ".log", "tmp"]) # ['*', '!*.pyc', '!*.log', '!*.tmp'] ``` ```python patterns = get_match_patterns(ignore_hidden=True) # ['*', '!.*', '!.*/'] ``` ```python patterns = get_match_patterns( match=["data/**"], ignore=[ ".*"], ignore_extensions=["tmp"], ignore_hidden=True, # '.*' already present, won't be duplicated ) # ['data/**', '!.*', '!*.tmp', '!.*/'] ``` -------------------------------- ### Protocol Class Source: https://context7.com/andhus/dirhash-python/llms.txt Controls which file/directory properties contribute to the hash value. Allows customization of hash computation based on name, data, and symlink status. ```APIDOC ## `Protocol` class ### Description Controls which file/directory properties contribute to the hash value. The `entry_properties` set determines whether name, data, and/or symlink status are included. Provides a nested `EntryProperties` class with string constants `NAME`, `DATA`, and `IS_LINK`. ### Parameters - **entry_properties** (tuple[str], optional) - A tuple of strings specifying which properties to include in the hash (e.g., "name", "data", "is_link"). - **allow_cyclic_links** (bool, optional) - Whether to allow cyclic symbolic links. ### Request Example ```python from dirhash import Protocol # Default: both file content and path names affect the hash default_proto = Protocol(entry_properties=("name", "data")) # Data-only: detect content changes regardless of renaming data_proto = Protocol(entry_properties=("data",)) # Name-only: detect structural changes (add/move/remove) ignoring content name_proto = Protocol(entry_properties=("name",)) # Full: name + data + symlink awareness full_proto = Protocol( entry_properties=("name", "data", "is_link"), allow_cyclic_links=False, ) # Allow cyclic symlinks (hash relative path to target instead of raising) cyclic_proto = Protocol( entry_properties=("name", "data"), allow_cyclic_links=True, ) # Access property name constants print(Protocol.EntryProperties.NAME) # "name" print(Protocol.EntryProperties.DATA) # "data" print(Protocol.EntryProperties.IS_LINK) # "is_link" from dirhash import dirhash_impl, Filter result = dirhash_impl("path/to/dir", "sha256", protocol=data_proto) ``` ### Response - **result** (str) - The computed hash value. ``` -------------------------------- ### Exclude multiple extensions Source: https://context7.com/andhus/dirhash-python/llms.txt Use glob patterns to exclude specific file extensions from the hash calculation. ```bash dirhash path/to/dir -a md5 --ignore "*.pyc" "*.log" "*.tmp" ``` -------------------------------- ### Filter Class Source: https://context7.com/andhus/dirhash-python/llms.txt Encapsulates all filtering logic for directory traversal, including glob patterns, symlink following, and empty directory inclusion. Can be subclassed for custom behavior. ```APIDOC ## `Filter` class ### Description Encapsulates all filtering logic: glob match patterns, symlink following, and empty-directory inclusion. Pass instances to `dirhash_impl()` for reuse or subclass for custom traversal logic. ### Parameters - **match_patterns** (list[str], optional) - List of glob patterns for inclusion/exclusion. - **linked_dirs** (bool, optional) - Whether to follow symbolic links to directories. - **linked_files** (bool, optional) - Whether to follow symbolic links to files. - **empty_dirs** (bool, optional) - Whether to include empty directories in the hash calculation. ### Request Example ```python from dirhash import Filter, get_match_patterns # Exclude hidden files and directories, follow symlinks, ignore empty dirs filt = Filter( match_patterns=get_match_patterns(ignore_hidden=True), linked_dirs=True, linked_files=True, empty_dirs=False, ) # Only traverse real directories (no symlinks), include empty dirs strict_filt = Filter( match_patterns=["*.csv", "*.json"], linked_dirs=False, linked_files=False, empty_dirs=True, ) from dirhash import dirhash_impl result = dirhash_impl("path/to/data", "md5", filter_=strict_filt) ``` ### Response - **result** (str) - The computed hash value. ``` -------------------------------- ### dirhash_impl Source: https://context7.com/andhus/dirhash-python/llms.txt Low-level hashing function that accepts pre-constructed Filter and Protocol instances. Useful for reusing configurations or implementing custom traversal logic. ```APIDOC ## `dirhash_impl(directory, algorithm, filter_=None, protocol=None, ...)` ### Description Advanced entry point that accepts pre-constructed `Filter` and `Protocol` instances instead of individual keyword arguments. Useful when the same filter/protocol configuration is reused across many calls, or when subclassing `Filter` / `Protocol` for custom behavior. ### Parameters - **directory** (str) - The path to the directory to hash. - **algorithm** (str) - The hashing algorithm to use (e.g., "sha256", "md5"). - **filter_** (Filter, optional) - An instance of the `Filter` class to specify inclusion/exclusion patterns and traversal options. - **protocol** (Protocol, optional) - An instance of the `Protocol` class to specify which entry properties contribute to the hash. - **jobs** (int, optional) - Number of worker processes to use for multiprocessing. - **chunk_size** (int, optional) - Size of chunks for processing large files. ### Request Example ```python from dirhash import dirhash_impl, Filter, Protocol, get_match_patterns filt = Filter( match_patterns=get_match_patterns(ignore=[".*", ".*/", "*.tmp"]), linked_dirs=True, linked_files=True, empty_dirs=False, ) proto = Protocol( entry_properties=("name", "data"), allow_cyclic_links=False, ) directories = ["dataset/train", "dataset/val", "dataset/test"] hashes = {d: dirhash_impl(d, "sha256", filter_=filt, protocol=proto) for d in directories} for path, h in hashes.items(): print(f"{path}: {h}") fast_hash = dirhash_impl( "path/to/large_dir", "sha256", filter_=filt, protocol=proto, jobs=8, chunk_size=2 ** 22, # 4 MiB chunks for large files ) ``` ### Response - **hashes** (dict) - A dictionary mapping directory paths to their computed hash values. - **fast_hash** (str) - The computed hash value for the large directory. ``` -------------------------------- ### included_paths(directory, ...) Source: https://context7.com/andhus/dirhash-python/llms.txt Returns a sorted list of all paths (relative to the specified directory) that would be included by `dirhash()` given the same filtering arguments. Useful for debugging filters. ```APIDOC ## included_paths(directory, ...) ### Description Returns a sorted list of all paths (relative to `directory`) that would be included by `dirhash()` given the same filtering arguments. Useful for debugging filters before committing to a hash computation. Empty directory entries appear as `"subdir/." `. ### Method `included_paths` ### Parameters #### Path Parameters - **directory** (string) - Required - The path to the directory to inspect. #### Keyword Arguments - **entry_properties** (list of strings) - Optional - Specifies which properties of directory entries to include in the hash. Defaults to ['name', 'data']. Possible values: 'name', 'data', 'size', 'mode', 'mtime', 'is_link'. - **match** (list of strings) - Optional - Glob patterns to include files/directories. - **ignore** (list of strings) - Optional - Glob patterns to exclude files/directories. - **ignore_extensions** (list of strings) - Optional - List of file extensions to ignore. - **ignore_hidden** (boolean) - Optional - If True, ignores hidden files and directories (starting with '.'). Defaults to False. - **empty_dirs** (boolean) - Optional - If True, includes empty directories in the hash. Defaults to False. - **allow_cyclic_links** (boolean) - Optional - If True, allows hashing of cyclic symbolic links. Defaults to False. - **linked_dirs** (boolean) - Optional - If True, includes symbolic links to directories in the hash. Defaults to True. - **linked_files** (boolean) - Optional - If True, includes symbolic links to files in the hash. Defaults to True. ### Request Example ```python from dirhash import included_paths # List every path included by default paths = included_paths("path/to/directory") # Preview which paths survive a filter before computing the hash visible_paths = included_paths("path/to/project", match=["*.py", "*.txt"], ignore=[".*", ".*/"], linked_dirs=True, linked_files=True, empty_dirs=False) # Verify that cyclic links are handled cycle_paths = included_paths("path/with/cycles", allow_cyclic_links=True) # Confirm empty directories appear when requested paths_with_empty = included_paths("path/to/directory", empty_dirs=True) ``` ### Response #### Success Response (200) - **paths** (list of strings) - A sorted list of relative paths included in the directory. ``` === COMPLETE CONTENT === This response contains all available snippets from this library. No additional content exists. Do not make further requests.