### Install CloudPathlib with Cloud SDKs Source: https://github.com/drivendataorg/cloudpathlib/blob/master/README.md Installs cloudpathlib along with specific cloud service SDKs using pip extras. Use quotes if your shell requires it. ```bash pip install cloudpathlib[s3,gs,azure] ``` ```bash pip install "cloudpathlib[s3,gs,azure]" ``` -------------------------------- ### Install Development Version with All SDKs Source: https://github.com/drivendataorg/cloudpathlib/blob/master/README.md Installs the latest development version of cloudpathlib from GitHub, including all available cloud service SDKs. ```bash pip install https://github.com/drivendataorg/cloudpathlib.git#egg=cloudpathlib[all] ``` -------------------------------- ### PR Checklist Example Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md This is a sample checklist provided in the PR template to ensure all necessary steps are completed before submitting a pull request. ```markdown - [ ] I have read and understood `CONTRIBUTING.md` - [ ] Confirmed an issue exists for the PR, and the text `Closes #issue` appears in the PR summary (e.g., `Closes #123`). - [ ] Confirmed PR is rebased onto the latest base - [ ] Confirmed failure before change and success after change - [ ] Any generic new functionality is replicated across cloud providers if necessary - [ ] Tested manually against live server backend for at least one provider - [ ] Added tests for any new functionality - [ ] Linting passes locally - [ ] Tests pass locally - [ ] Updated `HISTORY.md` with the issue that is addressed and the PR you are submitting. If the top section is not `## UNRELEASED``, then you need to add a new section to the top of the document for your change. ``` -------------------------------- ### Get Client Instance for Rig Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Instantiate a client class for use with the testing rig. This is necessary when testing functionality directly on the `*Client` classes. ```python new_client = rig.client_class() ``` -------------------------------- ### Setup Interactive Testing Environment Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Configure a Jupyter notebook for interactive testing of the library. Ensure autoreload is enabled to pick up code changes immediately. ```python %load_ext autoreload %autoreload 2 ``` ```python from cloudpathlib import CloudPath cp = CloudPath("s3://my-test-bucket/") ``` -------------------------------- ### Install Development Requirements Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Installs all development dependencies and an editable version of cloudpathlib. Ensure your Python environment is active before running. ```bash make reqs ``` -------------------------------- ### Open and Write to Cloud Path Source: https://github.com/drivendataorg/cloudpathlib/blob/master/README.md Opens a cloud path for writing and writes content to it. Ensure the necessary cloud SDK dependencies are installed. ```python with CloudPath("s3://bucket/filename.txt").open("w+") as f: f.write("Send my changes to the cloud!") ``` -------------------------------- ### Serve Documentation Locally Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Start a local development server to preview documentation changes. The server typically auto-reloads for most file changes, but requires a restart for changes in `index.md`, `HISTORY.md`, or `CONTRIBUTING.md` after running `make docs-setup`. ```bash make docs-serve ``` -------------------------------- ### Cross-platform path manipulation with pathlib Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/why_cloudpathlib.ipynb Illustrates how pathlib handles cross-platform path construction, including getting the user's home directory and joining path components. ```python path = Path.home() path docs = path / 'Documents' docs ``` -------------------------------- ### Get path information with pathlib Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/why_cloudpathlib.ipynb Demonstrates how to retrieve various attributes of a file path, such as name, stem, suffix, parent, and read its content. ```python notebook = Path("why_cloudpathlib.ipynb").resolve() print(f"{ 'Path:':15}{notebook}") print(f"{ 'Name:':15}{notebook.name}") print(f"{ 'Stem:':15}{notebook.stem}") print(f"{ 'Suffix:':15}{notebook.suffix}") print(f"{ 'With suffix:':15}{notebook.with_suffix('.cpp')}") print(f"{ 'Parent:':15}{notebook.parent}") print(f"{ 'Read_text:'} {notebook.read_text()[:200]} ") ``` -------------------------------- ### Install CloudPathlib with Conda Source: https://github.com/drivendataorg/cloudpathlib/blob/master/README.md Installs cloudpathlib with specific cloud service SDKs using conda from conda-forge. The suffix indicates the SDK to install. ```bash conda install cloudpathlib-s3 -c conda-forge ``` -------------------------------- ### Using `os.path` Functions with Patched `CloudPath` Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/patching_builtins.ipynb This example shows how `os.path` functions (`isdir`, `basename`, `dirname`, `join`) correctly handle `CloudPath` objects after `patch_os_functions` is applied. It demonstrates seamless integration with cloud paths. ```python with patch_os_functions(): result = os.path.isdir(folder) print("Patched version of `os.path.isdir` returns: ", result) print("basename:", os.path.basename(cp)) print("dirname:", os.path.dirname(cp)) joined = os.path.join(folder, "dir", "sub", "name.txt") print("join:", joined) ``` -------------------------------- ### Set and Get Default S3Client Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/authentication.md Illustrates how to set an explicitly instantiated client as the default for future CloudPath objects and how to retrieve the current default client. ```python client = S3Client(aws_access_key_id="myaccesskey", aws_secret_access_key="mysecretkey") client.set_as_default_client() S3Client.get_default_client() #> ``` -------------------------------- ### Create CloudPath Instance (File Exists) Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Use `rig.create_cloud_path` to get a `CloudPath` instance that refers to a file expected to exist on the provider. This is useful for testing scenarios where the file's presence is a prerequisite. ```python cp = rig.create_cloud_path("dir_0/file0_0.txt") ``` -------------------------------- ### List Files in S3 Bucket Directory Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/caching.ipynb Iterate through files in a specified S3 directory using `iterdir()`. This example shows how to list the first 5 images for a given incident. ```python from cloudpathlib import CloudPath from itertools import islice ladi = CloudPath("s3://ladi/Images/FEMA_CAP/2020/70349") # list first 5 images for this incident for p in islice(ladi.iterdir(), 5): print(p) ``` -------------------------------- ### Configuring S3 Client with ExtraArgs Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/authentication.md You can configure an S3Client with various boto3 ExtraArgs that will be passed to upload, download, or copy operations. This example demonstrates setting 'ChecksumMode' for downloads and 'ACL' for uploads on a client that will be used as the default. ```python from cloudpathlib import S3Client c = S3Client(extra_args={ "ChecksumMode": "ENABLED", # download extra arg, only used when downloading "ACL": "public-read", # upload extra arg, only used when uploading }) # use these extras for all CloudPaths c.set_as_default_client() ``` -------------------------------- ### Pandas to CSV (Using .open() Workaround) Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/patching_builtins.ipynb Provides the recommended workaround for writing pandas DataFrames to `CloudPath` by using `CloudPath.open()` to get a file-like buffer, which pandas can then write to. ```python # instead, use .open with cloud_path.open("w") as f: df.to_csv(f) assert cloud_path.exists() print("Successfully wrote to ", cloud_path) ``` -------------------------------- ### Using `library_function` with Patched `open` in Jupyter Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/patching_builtins.ipynb This example demonstrates successful use of `library_function` with a `CloudPath` after patching the notebook's `open` function. It verifies that the file is written and can be read back correctly. ```python from cloudpathlib import CloudPath, patch_open # enable patch and rebind notebook's open open = patch_open().patched # create file to read cp = CloudPath("s3://cloudpathlib-test-bucket/patching_builtins/file.txt") library_function(cp) assert cp.read_text() == "hello!" print("Succeeded!") ``` -------------------------------- ### Azure Blob Storage Path Operations with CloudPathLib Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/why_cloudpathlib.ipynb This snippet demonstrates file operations on an Azure Blob Storage container. It mirrors the S3 example, showing path creation, writing, existence checks, and deletion. ```python from cloudpathlib import CloudPath # Changing this root path is the ONLY change! cloud_directory = CloudPath("az://cloudpathlib-test-container/why_cloudpathlib/") upload = cloud_directory / "user_upload.txt" upload.write_text("A user made this file!") assert upload.exists() upload.unlink() assert not upload.exists() ``` -------------------------------- ### Configure Basic HTTP Authentication Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/http.md Pass a urllib.request.BaseHandler implementation, such as HTTPBasicAuthHandler, to the HttpClient constructor to enable authentication for requests. This example shows how to add credentials for a specific realm and URI. ```python import urllib.request auth_handler = urllib.request.HTTPBasicAuthHandler() auth_handler.add_password( realm="Some Realm", uri="http://www.example.com", user="username", passwd="password" ) client = HttpClient(auth=auth_handler) my_path = client.CloudPath("http://www.example.com/secret/data.txt") # Now GET requests will include basic auth headers content = my_path.read_text() ``` -------------------------------- ### Accessing Requester Pays S3 Buckets Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/authentication.md When accessing a Requester Pays S3 bucket, you must explicitly indicate that you will pay for the operations. This example shows how to list contents of such a bucket by creating an S3Client with the 'RequestPayer' extra argument set to 'requester'. ```python from cloudpathlib import CloudPath tars = list(CloudPath("s3://arxiv/src/").iterdir()) print(tars) #> ClientError: An error occurred (AccessDenied) ... ``` ```python from cloudpathlib import S3Client c = S3Client(extra_args={"RequestPayer": "requester"}) # use the client we created to build the path tars = list(c.CloudPath("s3://arxiv/src/").iterdir()) print(tars) ``` -------------------------------- ### Build Documentation Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Run this command in the project's root directory to build the latest version of the documentation. ```bash make docs ``` -------------------------------- ### Developer Commands with Make Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Lists available commands for local development and maintenance tasks. Run `make` to see all options. ```bash clean remove all build, test, coverage and Python artifacts clean-build remove build artifacts clean-pyc remove Python file artifacts clean-test remove test and coverage artifacts dist builds source and wheel package docs-setup setup docs pages based on README.md and HISTORY.md docs build the static version of the docs docs-serve serve documentation to livereload while you work format run black to format codebase install install the package to the active Python's site-packages lint check style with black, flake8, and mypy release package and upload a release reqs install development requirements test run tests with mocked cloud SDKs test-debug rerun tests that failed in last run and stop with pdb at failures test-live-cloud run tests on live cloud backends perf run performance measurement suite for s3 and save results to perf-results.csv ``` -------------------------------- ### Joining Paths and Creating New File Paths Source: https://github.com/drivendataorg/cloudpathlib/blob/master/README.md Demonstrates how to join path components to create a new file path within the cloud storage. The file does not need to exist beforehand. ```python new_file_copy = root_dir / "nested_dir/copy_file.txt" new_file_copy.exists() new_file_copy.write_text(text_data) ``` -------------------------------- ### Explicitly Instantiate S3Client with Credentials Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/authentication.md Shows how to create an S3Client instance directly using access key and secret key. This is useful for authentication methods other than environment variables. ```python from cloudpathlib import S3Client client = S3Client(aws_access_key_id="myaccesskey", aws_secret_access_key="mysecretkey") # these next two commands are equivalent # use client's factory method cp1 = client.CloudPath("s3://cloudpathlib-test-bucket/") # or pass client as keyword argument cp2 = CloudPath("s3://cloudpathlib-test-bucket/", client=client) ``` -------------------------------- ### Get S3 file statistics Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/why_cloudpathlib.ipynb Use the stat() method to retrieve file metadata, such as size, modification time, etc., for an S3 object. ```python stat = s3p.stat() print(f"File size in bytes: {stat.st_size}") stat ``` -------------------------------- ### Run Live Cloud Tests Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Executes tests against actual live cloud provider servers. Ensure you have the necessary credentials configured for each provider before running. ```bash make test-live-cloud ``` -------------------------------- ### Create Cloud Paths with Test Rigs Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Use the `create_cloud_path` method of a test rig to instantiate cloud paths. This method can create paths to existing assets or non-existent locations. You can also obtain a client instance from the rig. ```python def test_file_operations(rig): # Create a path to an existing file in the test assets cp = rig.create_cloud_path("dir_0/file0_0.txt") # Create a path to a non-existent file cp2 = rig.create_cloud_path("path/that/does/not/exist.txt") # Get a client instance client = rig.client_class() ``` -------------------------------- ### Basic CloudPath Usage with HTTP URLs Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/http.md Demonstrates creating CloudPath objects for HTTP/HTTPS URLs and performing common file operations like reading, joining paths, checking existence, and listing directory contents. ```python from cloudpathlib import CloudPath # Create a path object path = CloudPath("https://example.com/data/file.txt") # Read file contents text = path.read_text() binary = path.read_bytes() # Get parent directory parent = path.parent # https://example.com/data/ # Join paths subpath = path.parent / "other.txt" # https://example.com/data/other.txt # Check if file exists if path.exists(): print("File exists!") # Get file name and suffix print(path.name) # "file.txt" print(path.suffix) # ".txt" # List directory contents (if server supports directory listings) data_dir = CloudPath("https://example.com/data/") for child_path in data_dir.iterdir(): print(child_path) ``` -------------------------------- ### Run Full Test Suite (Mocked) Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Executes the complete test suite using mocked cloud SDKs, ensuring no network calls are made. This is the most common command during development. ```bash make test ``` -------------------------------- ### Open and display an image from cache Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/caching.ipynb Opening a file with `.open()` downloads it to the cache if it's not already present. Subsequent opens will use the cached version, leading to faster access. ```python %%time with flood_image.open("rb") as f: i = Image.open(f) plt.imshow(i) ``` -------------------------------- ### tmp_dir File Cache Mode Example Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/caching.ipynb Demonstrates the 'tmp_dir' file cache mode. Cached files are available while the CloudPath object exists and after it's deleted, but are removed when the Client object is garbage collected. ```python tmp_dir_client = S3Client() flood_image = tmp_dir_client.CloudPath( "s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg" ) with flood_image.open("rb") as f: i = Image.open(f) print("Image loaded...") # cache exists while the CloudPath object persists local_cached_file = flood_image._local print("Cache file exists after finished reading: ", local_cached_file.exists()) # decrement reference count so garbage collection runs del flood_image # file still exists print("Cache file exists after CloudPath is no longer referenced: ", local_cached_file.exists()) # decrement reference count so garbage collector removes the client del tmp_dir_client # file still exists print("Cache file exists after Client is no longer referenced: ", local_cached_file.exists()) ``` -------------------------------- ### Simulating Library Function Using `open` Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/patching_builtins.ipynb This code simulates a function within a third-party library that uses the built-in `open` to write to a file. It demonstrates a scenario where patching `open` would be necessary for `CloudPath` compatibility. ```python # Imagine that deep in a third-party library a function is implemented like this def library_function(filepath: str): with open(filepath, "w") as f: f.write("hello!") ``` -------------------------------- ### Instantiate CloudPath and Access Client Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/authentication.md Demonstrates how a default client is automatically instantiated when creating a CloudPath object. Subsequent instances of the same service's paths will reuse this client. ```python from cloudpathlib import CloudPath cloud_path = CloudPath("s3://cloudpathlib-test-bucket/") # same for S3Path(...) cloud_path.client #> ``` -------------------------------- ### Instantiate HTTP/HTTPS Paths with AnyPath or CloudPath Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/http.md Use `AnyPath` or `CloudPath` to automatically handle `http://` or `https://` URLs. `AnyPath` also supports local file paths. ```python from cloudpathlib import AnyPath, CloudPath # AnyPath will automatically detect "http://" or "https://" (or local file paths) my_path = AnyPath("https://www.example.com/files/info.txt") # CloudPath will dispatch to the correct subclass my_path = CloudPath("https://www.example.com/files/info.txt") ``` -------------------------------- ### Create CloudPath Instance (File May Not Exist) Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Use `rig.create_cloud_path` to create a `CloudPath` instance for a path that does not necessarily need to exist. This is suitable for testing operations that do not require the target file or directory to be present beforehand. ```python cp2 = rig.create_cloud_path("path/that/does/not/exist.txt") ``` -------------------------------- ### Persistent File Cache Mode Example Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/caching.ipynb Illustrates the 'persistent' file cache mode. Cached files persist even after both CloudPath and Client objects are deleted, requiring manual cleanup. This mode requires `local_cache_dir` to be specified. ```python persistent_client = S3Client(local_cache_dir="./cache") # cache mode set automatically to persistent if local_cache_dir and not explicit print("Client cache mode set to: ", persistent_client.file_cache_mode) # Just uses default client flood_image = persistent_client.CloudPath( "s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg" ) with flood_image.open("rb") as f: i = Image.open(f) print("Image loaded...") # cache exists while the CloudPath object persists local_cached_file = flood_image._local print("Cache file exists after finished reading: ", local_cached_file.exists()) # decrement reference count so garbage collection runs del flood_image # file still exists print("Cache file exists after CloudPath is no longer referenced: ", local_cached_file.exists()) # decrement reference count so garbage collector removes the client client_cache_dir = persistent_client._local_cache_dir del persistent_client # file still exists print("Cache file exists after Client is no longer referenced: ", local_cached_file.exists()) # explicitly remove persistent cache file import shutil shutil.rmtree(client_cache_dir) ``` -------------------------------- ### LocalClient Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/api-reference/local.md A generic client for local file system interactions. ```APIDOC ## cloudpathlib.local.LocalClient ### Description A generic client for interacting with the local file system. ### Usage Instantiate this class to perform operations on local file system paths. ``` -------------------------------- ### Performance Test Report Structure Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Example structure of a performance test report, showing metrics like Mean, Std, Max, and N Items for different test scenarios. This helps in understanding the performance impact of code changes. ```text Performance suite results: (2023-10-08T13:18:04.774823) ┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━┓ ┃ Test Name ┃ Config Name ┃ Iterations ┃ Mean ┃ Std ┃ Max ┃ N Items ┃ ┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━┩ │ List Folders │ List shallow recursive │ 10 │ 0:00:00.862476 │ ± 0:00:00.020222 │ 0:00:00.898143 │ 5,500 │ │ List Folders │ List shallow non-recursive │ 10 │ 0:00:00.884997 │ ± 0:00:00.086678 │ 0:00:01.117775 │ 5,500 │ │ List Folders │ List normal recursive │ 10 │ 0:00:01.248844 │ ± 0:00:00.095575 │ 0:00:01.506868 │ 7,877 │ │ List Folders │ List normal non-recursive │ 10 │ 0:00:00.060042 │ ± 0:00:00.003986 │ 0:00:00.064052 │ 113 │ │ List Folders │ List deep recursive │ 10 │ 0:00:02.004731 │ ± 0:00:00.130264 │ 0:00:02.353263 │ 7,955 │ │ List Folders │ List deep non-recursive │ 10 │ 0:00:00.054268 │ ± 0:00:00.003314 │ 0:00:00.062116 │ 31 │ │ Glob scenarios │ Glob shallow recursive │ 10 │ 0:00:01.056946 │ ± 0:00:00.160470 │ 0:00:01.447082 │ 5,500 │ │ Glob scenarios │ Glob shallow non-recursive │ 10 │ 0:00:00.978217 │ ± 0:00:00.091849 │ 0:00:01.230822 │ 5,500 │ │ Glob scenarios │ Glob normal recursive │ 10 │ 0:00:01.510334 │ ± 0:00:00.101108 │ 0:00:01.789393 │ 7,272 │ │ Glob scenarios │ Glob normal non-recursive │ 10 │ 0:00:00.058301 │ ± 0:00:00.002621 │ 0:00:00.063299 │ 12 │ │ Glob scenarios │ Glob deep recursive │ 10 │ 0:00:02.784629 │ ± 0:00:00.099764 │ 0:00:02.981882 │ 7,650 │ │ Glob scenarios │ Glob deep non-recursive │ 10 │ 0:00:00.051322 │ ± 0:00:00.002653 │ 0:00:00.054844 │ 25 │ │ Walk scenarios │ Walk shallow │ 10 │ 0:00:00.905571 │ ± 0:00:00.076332 │ 0:00:01.113957 │ 5,500 │ │ Walk scenarios │ Walk normal │ 10 │ 0:00:01.441215 │ ± 0:00:00.014923 │ 0:00:01.470414 │ 7,272 │ │ Walk scenarios │ Walk deep │ 10 │ 0:00:02.461520 │ ± 0:00:00.031832 │ 0:00:02.539132 │ 7,650 │ └────────────────┴────────────────────────────┴────────────┴────────────────┴──────────────────┴────────────────┴─────────┘ ``` -------------------------------- ### Run Code Formatting Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Execute the Black formatter to ensure code adheres to project style guidelines. This command should be run before submitting changes. ```bash make format ``` -------------------------------- ### Implement a Custom Client Class Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Define a new client class by inheriting from `cloudpathlib.client.Client`. This class will handle the specific interactions with a cloud storage provider. ```python from cloudpathlib.client import Client class MyClient(Client): # implementation here... ``` -------------------------------- ### Custom Directory Listing Parser Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/http.md Override the default directory listing parser by providing a custom function to HttpClient. This function should accept HTML content and yield strings representing file or directory names. This example uses BeautifulSoup to find links with a specific class. ```python def my_parser(html_content: str) -> Iterable[str]: # for example, just get a with href and class "file-link" # using beautifulsoup soup = BeautifulSoup(html_content, "html.parser") for link in soup.find_all("a", class_="file-link"): yield link.get("href") client = HttpClient(custom_list_page_parser=my_parser) my_dir = client.CloudPath("http://example.com/public/") for subpath, is_dir in my_dir.list_dir(recursive=False): print(subpath, "dir" if is_dir else "file") ``` -------------------------------- ### Accessing cached file via fspath (read-only) Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/caching.ipynb Use the `.fspath` property to get the local path to the cached file. This is useful for libraries that do not accept `PathLike` objects. Note that this operation downloads the file if it's not in the cache and should be treated as read-only, as changes will not be uploaded to the cloud. ```python # Warning: Using the `.fspath` property will download the file from the cloud if it does not exist yet in the cache. # Warning: Since we are no longer in control of opening/closing the file, we cannot upload any changes when the file is closed. Therefore, you should treat any code where you use fspath as _read only_. Writes directly to fspath will not be uploaded to the cloud. ``` -------------------------------- ### Initialize S3Path Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/why_cloudpathlib.ipynb Create an S3Path object representing a file in an S3 bucket. This requires the 's3://' prefix. ```python from cloudpathlib import S3Path s3p = S3Path("s3://cloudpathlib-test-bucket/why_cloudpathlib/file.txt") s3p.name ``` -------------------------------- ### Create an empty file on S3 with touch() Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/why_cloudpathlib.ipynb The touch() method creates an empty file at the specified S3 path, similar to the pathlib equivalent. ```python # Touch (just like with `pathlib.Path`) s3p.touch() ``` -------------------------------- ### Instantiate AnyPath for Local and Cloud Paths Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/anypath-polymorphism.md Use AnyPath to create path objects. It automatically resolves to a pathlib.Path for local paths or a CloudPath subclass (e.g., S3Path) for cloud URIs. ```python from cloudpathlib import AnyPath path = AnyPath("mydir/myfile.txt") path #> PosixPath('mydir/myfile.txt') cloud_path = AnyPath("s3://mybucket/myfile.txt") cloud_path #> S3Path('s3://mybucket/myfile.txt') isinstance(path, AnyPath) #> True isinstance(cloud_path, AnyPath) #> True ``` -------------------------------- ### LocalGSImplementation Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/api-reference/local.md Details the implementation for local Google Cloud Storage interactions. ```APIDOC ## cloudpathlib.local.local_gs_implementation ### Description Provides the implementation details for interacting with Google Cloud Storage (GCS) in a local context. ### Usage This module is typically used internally by cloudpathlib to handle GCS operations. ``` -------------------------------- ### Initialize AzureBlobPath Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/why_cloudpathlib.ipynb Create an AzureBlobPath object representing a file in an Azure Blob Storage container. This requires the 'az://' prefix. ```python from cloudpathlib import AzureBlobPath azp = AzureBlobPath("az://cloudpathlib-test-container/file.txt") azp.name ``` -------------------------------- ### List files in a directory with pathlib Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/why_cloudpathlib.ipynb Use the glob method to list all files and directories within the current directory. ```python list(Path(".").glob("*")) ``` -------------------------------- ### Run Performance Tests Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Execute performance tests using 'make perf'. This command generates a report detailing the performance of various listing, globbing, and walking operations across different configurations. Include these results in your Pull Request description. ```bash make perf ``` -------------------------------- ### Demonstrating `os.path` Functions with `CloudPath` (Unpatched) Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/patching_builtins.ipynb This snippet illustrates the failure when using `os.path` functions like `isdir` with a `CloudPath` object before patching. It highlights the need for patching to enable `CloudPath` compatibility. ```python import os from cloudpathlib import patch_os_functions, CloudPath cp = CloudPath("s3://cloudpathlib-test-bucket/patching_builtins/file.txt") folder = cp.parent try: print(os.path.isdir(folder)) except Exception as e: print("Unpatched version fails:") print(e) ``` -------------------------------- ### Import pathlib Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/why_cloudpathlib.ipynb Import the Path object from the standard pathlib library. ```python from pathlib import Path ``` -------------------------------- ### Listing Files After Writing Source: https://github.com/drivendataorg/cloudpathlib/blob/master/README.md Lists all text files in the specified directory and its subdirectories after a new file has been written. Confirms the new file is now present. ```python list(root_dir.glob('**/*.txt')) ``` -------------------------------- ### LocalGSClient Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/api-reference/local.md Represents a client for interacting with Google Cloud Storage locally. ```APIDOC ## cloudpathlib.local.LocalGSClient ### Description A client class for managing Google Cloud Storage (GCS) resources locally. ### Usage Instantiate this class to perform operations on GCS paths. ``` -------------------------------- ### List Available FileCacheMode Options Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/caching.ipynb Prints all available file cache modes from the FileCacheMode enum. Use these strings or enum members when configuring the cache mode. ```python from cloudpathlib.enums import FileCacheMode print("\n".join(FileCacheMode)) ``` ```text persistent tmp_dir cloudpath_object close_file ``` -------------------------------- ### Register Custom Client and Path Classes Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Register your custom client and path classes with `cloudpathlib` using decorators. This allows `CloudPath` to correctly dispatch to your provider based on the URI scheme. ```python from cloudpathlib.client import Client, register_client_class from cloudpathlib.cloudpath import CloudPath, register_path_class @register_client_class("my-prefix") class MyClient(Client): # implementation here... @register_path_class("my-prefix") class MyPath(CloudPath): cloud_prefix: str = "my-prefix://" client: "MyClient" # implementation here... ``` -------------------------------- ### Pillow with CloudPath (Patched) Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/patching_builtins.ipynb Demonstrates successful image saving to a `CloudPath` using Pillow after patching all built-ins with `patch_all_builtins()`. This confirms that patching enables compatibility. ```python # Patched: success with patching builtins with patch_all_builtins(): Image.new("RGB", (10, 10), color=(255, 0, 0)).save(img_path) assert img_path.read_bytes() print("With patches, Pillow successfully writes to a CloudPath") ``` -------------------------------- ### Implement a Custom Path Class Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Define a new path class by inheriting from `cloudpathlib.cloudpath.CloudPath`. This class represents paths within the custom cloud provider and should specify its `cloud_prefix` and `client` type. ```python from cloudpathlib.cloudpath import CloudPath class MyPath(CloudPath): cloud_prefix: str = "my-prefix://" client: "MyClient" # implementation here... ``` -------------------------------- ### Run Linting and Type Checking Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Execute Flake8 for linting and MyPy for type checking to ensure code quality and correctness. This command should be run before submitting changes. ```bash make lint ``` -------------------------------- ### Cloud Provider Abstraction in Client Class Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md The generic functionality for cloud providers, such as setting defaults and caching, is implemented in the `Client` class. This class also defines the interface that provider-specific `*Client` backends must implement. ```python from cloudpathlib.client import Client class S3Client(Client): # ... implementation for S3 ... pass ``` -------------------------------- ### Accessing Public S3 Bucket With `no_sign_request=True` Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/authentication.md Instantiate an `S3Client` with `no_sign_request=True` to access public S3 buckets without credentials. Use the created client to instantiate `CloudPath` objects for operations. ```python from cloudpathlib import S3Client c = S3Client(no_sign_request=True) # use this client object to create the CloudPath c.CloudPath("s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg").exists() #> True ``` -------------------------------- ### LocalS3Implementation Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/api-reference/local.md Details the implementation for local Amazon S3 interactions. ```APIDOC ## cloudpathlib.local.local_s3_implementation ### Description Provides the implementation details for interacting with Amazon S3 in a local context. ### Usage This module is typically used internally by cloudpathlib to handle S3 operations. ``` -------------------------------- ### Glob with CloudPath (Patched) Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/patching_builtins.ipynb Shows how `glob.glob` and `glob.iglob` work with `CloudPath` when `patch_glob` is used. This allows `CloudPath` objects to be used as patterns or `root_dir` arguments. ```python with patch_glob(): print("Patched succeeds:") print(glob(CloudPath("s3://cloudpathlib-test-bucket/manual-tests/**/*dir*/**/*"))) # or equivalently print(glob("**/*dir*/**/*", root_dir=CloudPath("s3://cloudpathlib-test-bucket/manual-tests/"))) ``` -------------------------------- ### Update Support Table Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Execute this script and update `README.md` if you add or remove methods from the `CloudPath` class or its subclasses. This ensures the support table in the README is accurate. ```bash python docs/make_support_table.py ``` -------------------------------- ### Instantiate S3Client with a local cache directory Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/caching.ipynb To maintain a persistent cache, instantiate `S3Client` with the `local_cache_dir` argument. This ensures downloaded files are stored in the specified directory, even after Python restarts. ```python from cloudpathlib import S3Client # explicitly instantiate a client that always uses the local cache client = S3Client(local_cache_dir="data") ladi = client.CloudPath("s3://ladi/Images/FEMA_CAP/2020/70349") ``` -------------------------------- ### Instantiate CloudPath Source: https://github.com/drivendataorg/cloudpathlib/blob/master/README.md Creates a CloudPath object, which dispatches to the appropriate cloud service path class based on the URI prefix. Authentication is handled by default via environment variables. ```python from cloudpathlib import CloudPath # dispatches to S3Path based on prefix root_dir = CloudPath("s3://drivendata-public-assets/") root_dir #> S3Path('s3://drivendata-public-assets/') ``` -------------------------------- ### LocalAzureBlobImplementation Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/api-reference/local.md Details the implementation for local Azure Blob storage interactions. ```APIDOC ## cloudpathlib.local.local_azure_blob_implementation ### Description Provides the implementation details for interacting with Azure Blob storage in a local context. ### Usage This module is typically used internally by cloudpathlib to handle Azure Blob operations. ``` -------------------------------- ### Azure Live Backend Test Environment Variables Source: https://github.com/drivendataorg/cloudpathlib/blob/master/CONTRIBUTING.md Set these environment variables to enable live testing against Azure Blob Storage and Azure Data Lake Storage Gen2. If AZURE_STORAGE_GEN2_CONNECTION_STRING is not set, only blob storage will be tested. ```bash AZURE_STORAGE_CONNECTION_STRING=your_connection_string AZURE_STORAGE_GEN2_CONNECTION_STRING=your_connection_string ``` -------------------------------- ### Pillow with CloudPath (Unpatched) Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/patching_builtins.ipynb Illustrates that third-party libraries like Pillow fail when attempting to save directly to a `CloudPath` without patching built-ins, due to expecting string or bytes file paths. ```python from cloudpathlib import CloudPath, patch_all_builtins from PIL import Image base = CloudPath("s3://cloudpathlib-test-bucket/patching_builtins/third_party/") img_path = base / "pillow_demo.png" # Unpatched: using CloudPath directly fails try: Image.new("RGB", (10, 10), color=(255, 0, 0)).save(img_path) except Exception as e: print("Pillow without patch: FAILED:", e) ``` -------------------------------- ### Check Cache Status Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/caching.ipynb Verify if files have been downloaded to the local cache. Initially, the cache directory will be empty even after listing files. ```bash !tree {ladi.fspath} ``` -------------------------------- ### Instantiate S3Client with Custom Endpoint Source: https://github.com/drivendataorg/cloudpathlib/blob/master/docs/docs/authentication.md Use this snippet to create an S3Client instance that connects to a custom S3-compatible object store endpoint. You can then use this client to create CloudPath objects or set it as the default client for all future paths. ```python from cloudpathlib import S3Client, CloudPath # create a client pointing to the endpoint client = S3Client(endpoint_url="http://my.s3.server:1234") # option 1: use the client to create paths cp1 = client.CloudPath("s3://cloudpathlib-test-bucket/") # option 2: pass the client as keyword argument cp2 = CloudPath("s3://cloudpathlib-test-bucket/", client=client) # option3: set this client as the default so it is used in any future paths client.set_as_default_client() cp3 = CloudPath("s3://cloudpathlib-test-bucket/") ```