### Install pynvml Package Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Install the pynvml package using pip. Alternatively, install from source after cloning the repository. ```bash pip install pynvml ``` ```bash cd pynvml pip install . ``` -------------------------------- ### Query String Mapping Example Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Demonstrates the mapping of string names to enumeration constants for GPU queries, such as mapping "memory.total" to NVSMI_MEMORY_TOTAL. ```python from pynvml_utils.smi import NVSMI_QUERY_GPU # Maps "memory.total" -> NVSMI_MEMORY_TOTAL, etc. ``` -------------------------------- ### Basic GPU Query Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Get a singleton instance of nvidia_smi and query GPU information as a dictionary or XML. Auto-initializes NVML. ```python from pynvml_utils import nvidia_smi # Get singleton instance (auto-initializes NVML) nvsmi = nvidia_smi.getInstance() # Query GPU information as dictionary results = nvsmi.DeviceQuery('memory.total, memory.free, utilization.gpu') print(results) # Or query as XML (matching nvidia-smi -q -x format) xml_output = nvsmi.XmlDeviceQuery('memory.total, memory.free') print(xml_output) ``` -------------------------------- ### Query Multiple GPU Metrics Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/README.md Query multiple metrics simultaneously from a GPU. This example shows how to request several metrics in a single call to DeviceQuery. ```python results = nvsmi.DeviceQuery('memory.total, utilization.gpu, temperature.gpu, power.draw') gpu = results['gpu'][0] ``` -------------------------------- ### Example DeviceQuery String Output Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md Illustrates the expected multi-line string output format for DeviceQuery results, showing various fields like timestamp, driver version, GPU count, and detailed GPU information. ```text timestamp: 2024-07-02 driver_version: 535.104.05 count: 1 gpu: [1 of 1] id: 0000:1E:00.0 product_name: NVIDIA A100 temperature: gpu_temp: 45 gpu_temp_max_threshold: 95 unit: C utilization: gpu_util: 25 memory_util: 15 unit: % # ... etc ``` -------------------------------- ### Start Periodic GPU Monitoring Loop Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Create an asynchronous loop to periodically query GPU devices. Use the callback to process results at each interval. Remember to cancel the task when done. ```python def on_result(async_task, results): print(f"GPU Memory: {results['gpu'][0]['fb_memory_usage']}") task = nvidia_smi.loop(time_in_milliseconds=5000, filter='memory.free', callback=on_result) # ... later task.cancel() ``` -------------------------------- ### Access Temperature Values in Celsius Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md Demonstrates how to retrieve GPU temperature, maximum threshold, and slowdown threshold values. It also shows how to get the temperature unit. ```python temp = gpu['temperature'] gpu_temp = temp['gpu_temp'] # int, e.g., 45 max_threshold = temp['gpu_temp_max_threshold'] # int, e.g., 95 slowdown_threshold = temp['gpu_temp_slow_threshold'] # int, e.g., 70 unit = temp['unit'] # "C" ``` -------------------------------- ### Access Utilization Values in Percent Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md Shows how to get GPU utilization, memory utilization, and optionally encoder utilization percentages. It also retrieves the unit for utilization metrics. ```python util = gpu['utilization'] gpu_pct = util['gpu_util'] # int, 0-100 mem_pct = util['memory_util'] # int, 0-100 enc_pct = util.get('encoder_util', 0) # int, optional unit = util['unit'] # "%" ``` -------------------------------- ### Start and Cancel Async Monitoring Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/INDEX.md Initiates continuous GPU metric polling asynchronously. The monitoring task can be cancelled when no longer needed. ```python task = nvidia_smi.loop( time_in_milliseconds=1000, filter='memory.free', callback=lambda task, result: print(result) ) # ... do other work ... task.cancel() # Stop monitoring ``` -------------------------------- ### Get Singleton Instance Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/INDEX.md Obtain the singleton instance of the nvidia_smi class. This instance should be reused throughout the program for consistency. ```python nvsmi = nvidia_smi.getInstance() results1 = nvsmi.DeviceQuery(...) # Same instance results2 = nvsmi.DeviceQuery(...) # Same instance ``` -------------------------------- ### Access Power Values in Watts Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md Provides examples for accessing the current power draw and the power limit from the 'power_readings' field. It also shows how to retrieve the unit for power measurements. ```python power = gpu['power_readings'] current_draw = power['power_draw'] # float, e.g., 125.5 limit = power['power_limit'] # float, e.g., 300.0 unit = power['unit'] # "W" ``` -------------------------------- ### Generate Comprehensive GPU Report Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md This script queries all available GPU metrics using pynvml_utils and prints a detailed report including memory usage, utilization, temperature, power, and clock speeds. Ensure pynvml_utils is installed and NVML is initialized. ```python from pynvml_utils import nvidia_smi from datetime import datetime def generate_gpu_report(): nvsmi = nvidia_smi.getInstance() # Query all metrics results = nvsmi.DeviceQuery() print(f"=== GPU Report ({results['timestamp']}) ===") print(f"Driver: {results['driver_version']}") print(f"GPUs: {results['count']}") print() for i, gpu in enumerate(results['gpu']): print(f"--- GPU {i}: {gpu['product_name']} ---") print(f"UUID: {gpu['uuid']}") print(f"PCI: {gpu['id']}") print(f"Serial: {gpu['serial']}") print() mem = gpu['fb_memory_usage'] print(f"Memory: {mem['used']:.0f} / {mem['total']:.0f} MiB ({100*mem['used']/mem['total']:.1f}%)") util = gpu['utilization'] print(f"Utilization: GPU {util['gpu_util']}%, Memory {util['memory_util']}%)") temp = gpu['temperature'] print(f"Temperature: {temp['gpu_temp']}°C (max {temp['gpu_temp_max_threshold']}°C)") power = gpu['power_readings'] print(f"Power: {power['power_draw']:.1f} / {power['power_limit']:.1f} W") clocks = gpu['clocks'] print(f"Clocks: GPU {clocks['graphics_clock']} MHz, Mem {clocks['mem_clock']} MHz") print() if __name__ == '__main__': generate_gpu_report() ``` -------------------------------- ### Brand Names Mapping Example Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Illustrates the mapping of NVML brand constants to human-readable names, like NVML_BRAND_TESLA to "Tesla". ```python from pynvml_utils.smi import NVSMI_BRAND_NAMES # Maps NVML_BRAND_* constants to human-readable names # e.g., NVML_BRAND_TESLA -> "Tesla" ``` -------------------------------- ### Access Clock Speeds in MHz Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md Demonstrates how to retrieve graphics clock, memory clock, and SM clock speeds. It also shows how to get the unit for clock speed measurements. ```python clocks = gpu['clocks'] graphics = clocks['graphics_clock'] # int, e.g., 1410 memory = clocks['mem_clock'] # int, e.g., 5001 sm = clocks['sm_clock'] # int, e.g., 1410 unit = clocks['unit'] # "MHz" ``` -------------------------------- ### Simple GPU Query Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/usage-patterns.md Get a singleton instance of the NVIDIA SMI utility and query specific GPU fields like total memory, free memory, and GPU utilization. Access results for a specific GPU. ```python from pynvml_utils import nvidia_smi # Get singleton instance nvsmi = nvidia_smi.getInstance() # Query specific fields results = nvsmi.DeviceQuery('memory.total, memory.free, utilization.gpu') # Access GPU 0 gpu = results['gpu'][0] print(f"GPU Memory: {gpu['fb_memory_usage']['free']} MiB free") print(f"GPU Usage: {gpu['utilization']['gpu_util']}%") ``` -------------------------------- ### Cache Query Results to Avoid Repeated Queries Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/usage-patterns.md Implement a caching mechanism to avoid redundant queries if the GPU data does not change frequently. This example demonstrates a simple cache with a Time-To-Live (TTL) of 1 second. ```python # Avoid repeated queries if data doesn't change frequently class GPUMonitor: def __init__(self): self.nvsmi = nvidia_smi.getInstance() self.cache = None self.cache_time = None self.cache_ttl = 1.0 # 1 second def get_gpu_stats(self): import time now = time.time() if self.cache and (now - self.cache_time) < self.cache_ttl: return self.cache self.cache = self.nvsmi.DeviceQuery('memory.free, utilization.gpu, temperature.gpu') self.cache_time = now return self.cache monitor = GPUMonitor() stats = monitor.get_gpu_stats() # Cached stats = monitor.get_gpu_stats() # Returns cache (if < 1 sec) ``` -------------------------------- ### Get Singleton Instance of NVML Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/SUMMARY.txt Access the singleton instance of the NVML class. This is the primary entry point for interacting with the library. ```python from pynvml import nvidia_smi nvml = nvidia_smi.getInstance() ``` -------------------------------- ### Query GPU Memory Information Source: https://github.com/gpuopenanalytics/pynvml/blob/master/README.md Use this snippet to query the free and total memory of a GPU. Ensure the nvidia-ml-py library is installed and imported. ```python from pynvml_utils import nvidia_smi nvsmi = nvidia_smi.getInstance() nvsmi.DeviceQuery('memory.free, memory.total') ``` -------------------------------- ### Query GPU Device Information (Dictionary) Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Use DeviceQuery to get GPU details as a dictionary. Supports filtering by string or enumeration constants. Useful for programmatic access to GPU metrics. ```python nvsmi = nvidia_smi.getInstance() # Query all fields results = nvsmi.DeviceQuery() # Query specific fields by string results = nvsmi.DeviceQuery('pci.bus_id, memory.total, memory.free') # Query by enumeration from pynvml_utils.smi import NVSMI_MEMORY_TOTAL, NVSMI_MEMORY_FREE results = nvsmi.DeviceQuery([NVSMI_MEMORY_TOTAL, NVSMI_MEMORY_FREE]) # Get help print(nvsmi.DeviceQuery('--help-query-gpu')) ``` -------------------------------- ### XmlDeviceQuery Return Type Example Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md The XmlDeviceQuery function returns an XML-formatted string that mirrors the structure of the nvidia-smi -q -x command output. This format provides detailed GPU information in a structured XML document. ```xml 2024-07-02 535.104.05 1 0000:1E:00.0 NVIDIA A100 Tesla ``` -------------------------------- ### Access Memory Values in MiB Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md Shows how to access total, used, and free memory in MiB from the 'fb_memory_usage' field. Includes an example of converting MiB to GiB. ```python mem = gpu['fb_memory_usage'] total_mib = mem['total'] # float, e.g., 81920.0 used_mib = mem['used'] # float, e.g., 8192.0 free_mib = mem['free'] # float, e.g., 73728.0 unit = mem['unit'] # "MiB" # Convert to GiB total_gib = total_mib / 1024 # 80.0 ``` -------------------------------- ### Access PCI Bus ID Information Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md Provides examples for accessing the PCI bus ID, domain, bus, device, and device ID. All values are returned as strings, often in hexadecimal format. ```python pci = gpu['pci'] bus_id = pci['pci_bus_id'] # str, e.g., "0000:1E:00.0" domain = pci['pci_domain'] # str, e.g., "0000" (hex) bus = pci['pci_bus'] # str, e.g., "1E" (hex) device = pci['pci_device'] # str, e.g., "00" (hex) device_id = pci['pci_device_id'] # str, e.g., "20B010DE" (hex) ``` -------------------------------- ### NVSMI_POWER_LIMIT Constant Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the current power limit in watts. Use this constant to get the active power limit. ```python NVSMI_POWER_LIMIT = 142 ``` -------------------------------- ### NVSMI_POWER_DRAW Constant Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the current power draw in watts. Use this constant to get the real-time power consumption. ```python NVSMI_POWER_DRAW = 141 ``` -------------------------------- ### Import Main Class Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Import the main `nvidia_smi` class from the pynvml_utils package. ```python from pynvml_utils import nvidia_smi ``` -------------------------------- ### NVSMI_CLOCKS_VIDEO_CUR Constant Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the current video clock in MHz. Use this constant to get the active video clock speed. ```python NVSMI_CLOCKS_VIDEO_CUR = 153 ``` -------------------------------- ### System Snapshot Query Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Fetches a snapshot of system-wide information including timestamp, device count, and driver version. ```python # System snapshot results = nvsmi.DeviceQuery('timestamp, count, driver_version') ``` -------------------------------- ### NVSMI_CLOCKS_MEMORY_CUR Constant Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the current memory clock in MHz. Use this constant to get the active memory clock speed. ```python NVSMI_CLOCKS_MEMORY_CUR = 152 ``` -------------------------------- ### Query GPU Information with Multiple Constants Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Demonstrates how to query multiple GPU metrics simultaneously using a list of pynvml constants. This is useful for gathering related performance data efficiently. ```python from pynvml_utils.smi import ( NVSMI_MEMORY_TOTAL, NVSMI_MEMORY_FREE, NVSMI_UTILIZATION_GPU, NVSMI_TEMPERATURE_GPU, NVSMI_POWER_DRAW ) results = nvsmi.DeviceQuery([ NVSMI_MEMORY_TOTAL, NVSMI_MEMORY_FREE, NVSMI_UTILIZATION_GPU, NVSMI_TEMPERATURE_GPU, NVSMI_POWER_DRAW ]) ``` -------------------------------- ### NVSMI_CLOCKS_GRAPHICS_CUR Constant Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the current graphics clock in MHz. Use this constant to get the active graphics clock speed. ```python NVSMI_CLOCKS_GRAPHICS_CUR = 150 ``` -------------------------------- ### Query BIOS and Board Information Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Retrieve the video BIOS version and board ID for a GPU. Ensure the 'nvsmi' library is imported. ```python results = nvsmi.DeviceQuery('vbios_version, board_id') gpu = results['gpu'][0] # {'vbios_version': '84.04.7F.00.75', 'board_id': '0x1234'} ``` -------------------------------- ### Query Display and Compute Modes Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Retrieve persistence and compute modes for a GPU. Ensure the 'nvsmi' library is imported. ```python results = nvsmi.DeviceQuery('persistence_mode, compute_mode') gpu = results['gpu'][0] # {'persistence_mode': 'Enabled', 'compute_mode': 'Default'} ``` -------------------------------- ### NVSMI_CLOCKS_APPL_MEMORY_DEFAULT Constant Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the default application memory clock in MHz. Use this constant to get the default memory clock setting. ```python NVSMI_CLOCKS_APPL_MEMORY_DEFAULT = 157 ``` -------------------------------- ### NVSMI_CLOCKS_APPL_GRAPHICS_DEFAULT Constant Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the default application graphics clock in MHz. Use this constant to get the default graphics clock setting. ```python NVSMI_CLOCKS_APPL_GRAPHICS_DEFAULT = 156 ``` -------------------------------- ### Display nvidia-smi Help Information Source: https://github.com/gpuopenanalytics/pynvml/blob/master/README.md This snippet displays the help information for the nvidia-smi query command. It's useful for understanding available query parameters. ```python from pynvml_utils import nvidia_smi nvsmi = nvidia_smi.getInstance() print(nvsmi.DeviceQuery('--help-query-gpu'), end='\n') ``` -------------------------------- ### Background Monitoring with Callback Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/SUMMARY.txt Set up a background task to monitor GPU metrics and execute a callback function whenever new data is available. This is useful for real-time dashboards or alerts. ```python from pynvml import nvidia_smi import time nvml = nvidia_smi.getInstance() def gpu_monitor_callback(gpu_data): print(f"[{time.strftime('%H:%M:%S')}] GPU Usage: {gpu_data[0].get('utilization.gpu')}% ") # Start the async loop in a separate thread or process if needed # For simplicity, this example shows starting it directly. # In a real application, consider using threading or asyncio. nvml.loop(callback=gpu_monitor_callback, interval=2) # Check every 2 seconds ``` -------------------------------- ### NVSMI_CLOCKS_SM_CUR Constant Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the current SM (Streaming Multiprocessor) clock in MHz. Use this constant to get the active SM clock speed. ```python NVSMI_CLOCKS_SM_CUR = 151 ``` -------------------------------- ### Handle Unsupported Features Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/INDEX.md Demonstrates how to check for unsupported features by inspecting query results. Errors or unsupported values are returned as strings, not exceptions. ```python value = gpu.get('ecc_mode', {}).get('current_ecc') if value == 'N/A': print("Feature not supported") ``` -------------------------------- ### Get Compute Processes Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Retrieves a list of running compute processes on the GPU. This is useful for monitoring which applications are actively using GPU compute resources. ```python results = nvsmi.DeviceQuery('compute-apps') processes = results['gpu'][0].get('processes', []) for proc in processes: print(f"PID {proc['pid']}: {proc['process_name']} using {proc['used_memory']} MiB") ``` -------------------------------- ### Safely Access Optional GPU Fields Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md Demonstrates how to safely access optional fields from GPU data, providing default values when fields are missing or unsupported. It also shows how to handle lists, such as the processes list, by checking for emptiness before iteration. ```python gpu = results['gpu'][0] # Safe access patterns temp = gpu.get('temperature', {}).get('gpu_temp', 'N/A') power = gpu.get('power_readings', {}).get('power_draw', 'N/A') ecc_mode = gpu.get('ecc_mode', {}).get('current_ecc', 'Not supported') # Check if list is non-empty before accessing processes = gpu.get('processes') or [] for proc in processes: print(f"{proc['process_name']}: {proc['used_memory']} MiB") ``` -------------------------------- ### Complete System Monitoring Query Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Performs a comprehensive system monitoring query using NVSMI_ALL. This can be invoked by passing no arguments or a specific constant. ```python # Complete system monitoring (NVSMI_ALL) results = nvsmi.DeviceQuery() # or nvsmi.DeviceQuery([NVSMI_ALL]) ``` -------------------------------- ### Use Async Monitoring for GPU Metrics Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/README.md Implement asynchronous monitoring of GPU metrics using the loop method. This allows for periodic updates via a callback function. ```python def on_update(task, results): print(f"GPU: {results['gpu'][0]['utilization']['gpu_util']}%\n") task = nvidia_smi.loop(time_in_milliseconds=1000, filter='utilization.gpu', callback=on_update) # ... do other work ... task.cancel() ``` -------------------------------- ### XmlDeviceQuery() Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Query GPU device information and return results as XML string. Output format matches `nvidia-smi -q -x`. ```APIDOC ## XmlDeviceQuery() ### Description Query GPU device information and return results as XML string. Output format matches `nvidia-smi -q -x`. ### Method Signature ```python @classmethod def XmlDeviceQuery(self, filter: list | str | None = None) -> str ``` ### Parameters #### filter - **Type**: list | str | None - **Default**: None - **Description**: Query filter (same format as DeviceQuery). ### Returns - **Type**: str - **Description**: XML-formatted string with structure matching nvidia-smi XML output. ### Throws - **Type**: NVMLError - **Description**: If NVML operation fails (caught internally, error included in XML). ### Special Filter Arguments - Same as DeviceQuery: `"--help"`, `"-h"`, `"--help-query-gpu"` ### Example ```python nvsmi = nvidia_smi.getInstance() # Get all info as XML xml_output = nvsmi.XmlDeviceQuery() print(xml_output) # Get filtered fields as XML xml_output = nvsmi.XmlDeviceQuery('pci.bus_id, memory.total, utilization.gpu') print(xml_output) # Get available fields print(nvsmi.XmlDeviceQuery('--help-query-gpu')) ``` ``` -------------------------------- ### Manage and Display Multi-GPU Information Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Query all available GPUs to retrieve product name, PCI ID, and temperature. This is useful for managing multiple GPUs in a system. ```python results = nvsmi.DeviceQuery() for i, gpu in enumerate(results['gpu']): print(f"GPU {i}: {gpu['product_name']}, PCI: {gpu['id']}, Temp: {gpu['temperature']['gpu_temp']}°C") ``` -------------------------------- ### Asynchronous GPU Monitoring Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Set up asynchronous monitoring of GPU metrics with a specified polling interval and a callback function to process updates. The monitoring runs in a background thread and can be cancelled. ```python import time def on_update(async_task, results): gpu = results['gpu'][0] print(f"GPU Usage: {gpu['utilization']['gpu_util']}%", f"Temp: {gpu['temperature']['gpu_temp']}°C") # Poll every 5 seconds task = nvidia_smi.loop(time_in_milliseconds=5000, filter='utilization.gpu, temperature.gpu', callback=on_update) # Run for 30 seconds time.sleep(30) task.cancel() # Stop the background thread ``` -------------------------------- ### Query System Information with String Filters Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Retrieve system information such as timestamp, driver version, and GPU count using string-based filters. ```python results = nvsmi.DeviceQuery('timestamp, driver_version, count') # Output: {'timestamp': '2024-07-02', 'driver_version': '535.104.05', 'count': 1} ``` -------------------------------- ### Track GPU-Utilizing Processes Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Employ DeviceQuery to list compute applications and accounted applications, enabling monitoring of which processes are consuming GPU resources. ```python results = nvsmi.DeviceQuery('compute-apps, accounted-apps') # Monitor which processes consume GPU resources ``` -------------------------------- ### Aggregate Corrected ECC Error: Total Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the total number of corrected aggregate ECC errors. Use this constant to get an overall count of persistent, corrected ECC errors across all components. ```python NVSMI_ECC_ERROR_CORRECTED_AGGREGATE_TOTAL = 95 ``` -------------------------------- ### Graceful Degradation with GPU Metrics Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/usage-patterns.md This pattern demonstrates how to safely query GPU metrics and provide a fallback response in case of errors. It's useful for applications that need to continue running even if GPU data is temporarily unavailable. ```python def safe_get_gpu_metrics(): """Get GPU metrics with fallback on error""" nvsmi = nvidia_smi.getInstance() try: results = nvsmi.DeviceQuery('memory.used, utilization.gpu, temperature.gpu') return results except Exception as e: print(f"Error querying GPU: {e}") # Return minimal valid response return { 'gpu': [{'error': str(e)}], 'count': 0, } # Usage data = safe_get_gpu_metrics() if data['gpu'][0].get('error'): print("GPU data unavailable") else: print(f"GPU usage: {data['gpu'][0]['utilization']['gpu_util']}%") ``` -------------------------------- ### Volatile Corrected ECC Error: Total Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Represents the total number of corrected volatile ECC errors. Use this constant to get an overall count of temporary, corrected ECC errors across all components. ```python NVSMI_ECC_ERROR_CORRECTED_VOLATILE_TOTAL = 85 ``` -------------------------------- ### Enumeration vs. String Queries Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/SUMMARY.txt Demonstrates the difference in performance between using NVSMI enumeration constants and string-based query fields. Using constants is generally faster. ```python from pynvml import nvidia_smi nvml = nvidia_smi.getInstance() # Using enumeration constants (preferred for performance) gpu_utilization_enum = nvml.DeviceQuery(query_field=nvidia_smi.NVSMI_QUERY_GPU_UTILIZATION) # Using string field name (less performant) gpu_utilization_str = nvml.DeviceQuery(query_field='utilization.gpu') print("Using Enum:", gpu_utilization_enum) print("Using String:", gpu_utilization_str) ``` -------------------------------- ### Iterate and Print Process Information Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md Shows how to iterate over the 'processes' list within GPU data to extract and print process ID, name, and memory usage in MiB. ```python for proc in gpu.get('processes', []): pid = proc['pid'] # int name = proc['process_name'] # str memory = proc['used_memory'] # int (MiB) print(f"{pid}: {name} using {memory} MiB") ``` -------------------------------- ### Get Latest Result from Asynchronous GPU Query Loop Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Retrieves the most recent GPU utilization data from an active asynchronous loop. Ensure the loop has had time to collect data before calling this method. Remember to cancel the task when done. ```python task = nvidia_smi.loop(time_in_milliseconds=100) time.sleep(0.5) latest = task.result() if latest: print(f"Latest GPU usage: {latest['gpu'][0]['utilization']}") task.cancel() ``` -------------------------------- ### format() Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Format DeviceQuery results dictionary into human-readable string representation. Returns string unchanged. ```APIDOC ## format() ### Description Format DeviceQuery results dictionary into human-readable string representation. Returns string unchanged. ### Method Signature ```python def format(self, results: dict | str) -> str ``` ### Parameters #### results - **Type**: dict | str - **Description**: Results from DeviceQuery() or raw string. ### Returns - **Type**: str - **Description**: Formatted string representation. ### Example ```python nvsmi = nvidia_smi.getInstance() results = nvsmi.DeviceQuery('memory.total, memory.free') formatted = nvsmi.format(results) print(formatted) ``` ``` -------------------------------- ### Handling 'N/A' Error Values in pynvml Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/types.md pynvml represents unsupported features or errors as string values within result dictionaries, such as 'N/A' for unsupported features or specific error messages. This example demonstrates how to check for and handle these string-based error conditions. ```python results = nvsmi.DeviceQuery('ecc.mode.current') ecc_value = results['gpu'][0]['ecc_mode']['current_ecc'] if ecc_value == 'N/A': print("ECC not supported") elif isinstance(ecc_value, str): print(f"Current ECC: {ecc_value}") ``` -------------------------------- ### Thermal and Power Query Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Retrieves thermal and power-related metrics such as GPU temperature, fan speed, and power draw/limit. ```python # Thermal and power results = nvsmi.DeviceQuery('temperature.gpu, fan.speed, power.draw, power.limit') ``` -------------------------------- ### Memory and Utilization Query Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Queries for total, used, and free memory, along with GPU and memory utilization percentages. ```python # Memory and utilization results = nvsmi.DeviceQuery('memory.total, memory.used, memory.free, utilization.gpu, utilization.memory') ``` -------------------------------- ### Iterate Through All GPUs Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/usage-patterns.md Query all available GPUs and iterate through the results to display information such as product name and UUID for each GPU. Handles cases where the 'gpu' key might be missing. ```python results = nvsmi.DeviceQuery() for i, gpu in enumerate(results.get('gpu', [])): name = gpu.get('product_name', 'Unknown') uuid = gpu.get('uuid', 'N/A') print(f"GPU {i}: {name} ({uuid})") ``` -------------------------------- ### Query Driver Model Information Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Retrieve the current and pending driver models for a GPU. The results are nested under the 'driver_model' key. ```python results = nvsmi.DeviceQuery('driver_model.current, driver_model.pending') driver_info = results['gpu'][0]['driver_model'] # {'current_dm': 'TCC', 'pending_dm': 'TCC'} ``` -------------------------------- ### Query GPU Memory and Utilization Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Use DeviceQuery to retrieve total memory, free memory, GPU utilization, memory utilization, GPU temperature, and power draw for system monitoring dashboards. ```python results = nvsmi.DeviceQuery('memory.total, memory.free, utilization.gpu, utilization.memory, temperature.gpu, power.draw') # Update dashboard with results['gpu'][0] fields ``` -------------------------------- ### Finding Optimal GPU Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/SUMMARY.txt Develop a strategy to select the 'best' GPU for a given task, considering factors like utilization, memory, and clock speeds. This goes beyond simple load balancing. ```python from pynvml import nvidia_smi nvml = nvidia_smi.getInstance() def find_optimal_gpu(required_memory_mib=1024): gpus = nvml.DeviceQuery() optimal_gpu_index = -1 best_score = -1 for i, gpu in enumerate(gpus): mem_total = gpu.get('mem_total', 0) mem_used = gpu.get('mem_used', 0) utilization = gpu.get('utilization.gpu', 0) # Simple criteria: GPU has enough memory and is not fully utilized if mem_total >= required_memory_mib and (mem_total - mem_used) >= required_memory_mib and utilization < 70: # Example scoring: prioritize lower utilization score = 100 - utilization if score > best_score: best_score = score optimal_gpu_index = i return optimal_gpu_index # Example usage: Find a GPU with at least 2GB memory available task_gpu_index = find_optimal_gpu(required_memory_mib=2048) if task_gpu_index != -1: print(f"Found optimal GPU: {task_gpu_index}") else: print("No suitable GPU found for the task.") ``` -------------------------------- ### Query GPU Information as XML Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/SUMMARY.txt Retrieve detailed information about all GPUs in the system, returned as an XML string. Use this method when XML output is specifically required. ```python from pynvml import nvidia_smi nvml = nvidia_smi.getInstance() xml_output = nvml.XmlDeviceQuery() print(xml_output) ``` -------------------------------- ### Query GPU with Enumeration-based Filters Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Use a list of integer constants defined in pynvml_utils.smi module for querying GPU information. ```python from pynvml_utils.smi import NVSMI_MEMORY_TOTAL, NVSMI_MEMORY_FREE, NVSMI_PCI_BUS_ID results = nvsmi.DeviceQuery([NVSMI_PCI_BUS_ID, NVSMI_MEMORY_TOTAL, NVSMI_MEMORY_FREE]) ``` -------------------------------- ### nvidia_smi.loop() Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Create an asynchronous loop that periodically queries GPU devices at specified intervals. This method allows for continuous monitoring of GPU metrics with an optional callback function to process the results. ```APIDOC ## loop() ### Description Create an asynchronous loop that periodically queries GPU devices at specified intervals with optional callback. ### Method `@staticmethod def loop(time_in_milliseconds: int = 1, filter: list | str | None = None, callback: callable | None = None) -> loop_async` ### Parameters #### Query Parameters - **time_in_milliseconds** (int) - Optional - Polling interval in milliseconds. Defaults to 1. - **filter** (list | str | None) - Optional - Query filter (enumeration list or string). - **callback** (callable | None) - Optional - Callback function invoked with results. ### Returns `loop_async` - Async task object with `.cancel()` and `.result()` methods. ### Example ```python def on_result(async_task, results): print(f"GPU Memory: {results['gpu'][0]['fb_memory_usage']}") task = nvidia_smi.loop(time_in_milliseconds=5000, filter='memory.free', callback=on_result) # ... later task.cancel() ``` ``` -------------------------------- ### loop_async Methods Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Methods for managing asynchronous GPU query tasks. ```APIDOC ## loop_async.cancel() ### Description Stop the async loop and wait for thread completion. ### Method Signature ```python def cancel(self) -> None ``` ### Example ```python task = nvidia_smi.loop(time_in_milliseconds=1000) # ... later task.cancel() # Blocks until thread exits ``` ``` ```APIDOC ## loop_async.is_aborted() ### Description Check if async loop has been aborted. ### Method Signature ```python def is_aborted(self) -> bool ``` ### Returns `bool` - True if abort flag is set, False otherwise. ``` ```APIDOC ## loop_async.result() ### Description Get the last result from the async loop. ### Method Signature ```python def result(self) -> dict | None ``` ### Returns `dict | None` - Last DeviceQuery result, or None if not yet available. ### Example ```python task = nvidia_smi.loop(time_in_milliseconds=100) time.sleep(0.5) latest = task.result() if latest: print(f"Latest GPU usage: {latest['gpu'][0]['utilization']}") task.cancel() ``` ``` -------------------------------- ### Check for ECC Support using Constants Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Demonstrates how to check for ECC (Error Correcting Code) support on a GPU using specific pynvml constants. It retrieves the current ECC mode and checks if it's supported. ```python results = nvsmi.DeviceQuery([ NVSMI_ECC_MODE_CUR, NVSMI_ECC_ERROR_CORRECTED_VOLATILE_TOTAL ]) ecc_value = results['gpu'][0].get('ecc_mode', {}).get('current_ecc') if ecc_value == 'N/A': print("ECC not supported") else: print(f"ECC mode: {ecc_value}") ``` -------------------------------- ### Check GPU Health Metrics Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Use DeviceQuery to fetch GPU temperature, fan speed, corrected ECC errors, and active clock throttling reasons for health check scripts. This helps in monitoring thresholds and triggering alerts. ```python results = nvsmi.DeviceQuery('temperature.gpu, fan.speed, ecc.errors.corrected.volatile.total, clocks_throttle_reasons.active') # Check thresholds and alert if exceeded ``` -------------------------------- ### Query GPU with String-based Filters Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Use comma-separated query strings matching nvidia-smi --query-gpu syntax to select specific GPU information fields. ```python results = nvsmi.DeviceQuery('pci.bus_id, memory.total, memory.free') ``` -------------------------------- ### NVSMI_DRIVER_VERSION Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Constant for retrieving the NVIDIA driver version string. ```APIDOC ## NVSMI_DRIVER_VERSION ### Description NVIDIA driver version string. ### Example ```python results = nvsmi.DeviceQuery([NVSMI_DRIVER_VERSION]) # Returns: {'driver_version': '535.104.05'} ``` ``` -------------------------------- ### Import Query Constants Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Import various query filter constants prefixed with NVSMI_ from the `smi` module. These constants are used for filtering GPU information. ```python from pynvml_utils.smi import ( NVSMI_ALL, NVSMI_MEMORY_TOTAL, NVSMI_MEMORY_FREE, NVSMI_UTILIZATION_GPU, NVSMI_TEMPERATURE_GPU, NVSMI_POWER_DRAW, # ... and many more ) ``` -------------------------------- ### NVSMI_ALL Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Special constant to query all available GPU fields. Equivalent to providing no filter or an empty list in DeviceQuery(). ```APIDOC ## NVSMI_ALL ### Description Special constant to query all available GPU fields. Equivalent to providing no filter or an empty list in `DeviceQuery()`. ### Example ```python results = nvsmi.DeviceQuery([NVSMI_ALL]) # Same as DeviceQuery() ``` ``` -------------------------------- ### Query All GPU Metrics Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/module-overview.md Query all available GPU metrics and access specific GPU information from the results dictionary. Displays used/total memory, temperature, and power draw. ```python # Query everything results = nvsmi.DeviceQuery() # Access GPU 0 information gpu0 = results['gpu'][0] print(f"GPU: {gpu0['product_name']}") print(f"Memory: {gpu0['fb_memory_usage']['used']} / {gpu0['fb_memory_usage']['total']} MiB") print(f"Temperature: {gpu0['temperature']['gpu_temp']}°C") print(f"Power: {gpu0['power_readings']['power_draw']} W") ``` -------------------------------- ### Define Accounting Mode Constant Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Use NVSMI_ACCT_MODE to control accounting mode. When enabled, it tracks per-process GPU resource usage. ```python NVSMI_ACCT_MODE = 21 ``` -------------------------------- ### Query Thermal and Cooling Data Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Fetches GPU temperature, fan speed, and performance state. Available filters include 'temperature.gpu', 'fan.speed', and 'pstate'. ```python results = nvsmi.DeviceQuery('temperature.gpu, fan.speed, pstate') gpu = results['gpu'][0] # {'temperature': {'gpu_temp': 45}, 'fan_speed': 40, 'performance_state': 'P8', 'fan_speed_unit': '%'} ``` -------------------------------- ### NVSMI_NAME Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Constant for retrieving the GPU product name. ```APIDOC ## NVSMI_NAME ### Description GPU product name (e.g., "NVIDIA A100", "GeForce RTX 3090"). ``` -------------------------------- ### GPU Query Result Structure Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/INDEX.md Illustrates the nested dictionary structure of query results, including timestamp, driver version, and GPU-specific details. Note that many fields are optional. ```python { 'timestamp': str, 'driver_version': str, 'count': int, 'gpu': [ { 'id': str, 'product_name': str, 'fb_memory_usage': {'total': float, 'used': float, 'free': float}, # ... 30+ more optional fields } ] } ``` -------------------------------- ### DeviceQuery() Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Query GPU device information and return results as nested dictionaries. Supports filtering by query string or enumeration constants. ```APIDOC ## DeviceQuery() ### Description Query GPU device information and return results as nested dictionaries. Supports filtering by query string or enumeration constants. ### Method Signature ```python @classmethod def DeviceQuery(self, filter: list | str | None = None) -> dict ``` ### Parameters #### filter - **Type**: list | str | None - **Default**: None - **Description**: Query filter. Can be: (1) None for all fields, (2) comma-separated string like `"pci.bus_id,memory.total"`, (3) list of NVSMI_* enumeration constants ### Returns - **Type**: dict - **Description**: Dictionary with GPU device information. Includes fields like timestamp, driver_version, count, and a list of GPU objects, each containing details such as id, product_name, memory usage, utilization, power readings, temperature, and clocks. Additional fields can be included based on the filter. ### Throws - **Type**: NVMLError - **Description**: If NVML operation fails (caught internally, returns error string in results) ### Special Filter Arguments - `"--help"` or `"-h"`: Returns method docstring. - `"--help-query-gpu"`: Returns available query fields. ### Example ```python nvsmi = nvidia_smi.getInstance() # Query all fields results = nvsmi.DeviceQuery() # Query specific fields by string results = nvsmi.DeviceQuery('pci.bus_id, memory.total, memory.free') # Query by enumeration from pynvml_utils.smi import NVSMI_MEMORY_TOTAL, NVSMI_MEMORY_FREE results = nvsmi.DeviceQuery([NVSMI_MEMORY_TOTAL, NVSMI_MEMORY_FREE]) # Get help print(nvsmi.DeviceQuery('--help-query-gpu')) ``` ``` -------------------------------- ### Collect and Export GPU Time Series Metrics Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/usage-patterns.md Records GPU utilization, memory utilization, temperature, and power draw at regular intervals and exports the collected data as a CSV file. This pattern is useful for performance monitoring over time. ```python import time from datetime import datetime class TimedMetrics: def __init__(self): self.samples = [] def record(self, gpu_data): """Record GPU metrics with timestamp""" self.samples.append({ 'timestamp': datetime.now().isoformat(), 'gpu_util': gpu_data['utilization']['gpu_util'], 'mem_util': gpu_data['utilization']['memory_util'], 'temperature': gpu_data['temperature']['gpu_temp'], 'power': gpu_data['power_readings']['power_draw'], }) def export_csv(self, filename): """Export as CSV""" import csv if not self.samples: return with open(filename, 'w', newline='') as f: writer = csv.DictWriter(f, fieldnames=self.samples[0].keys()) writer.writeheader() writer.writerows(self.samples) # Usage nvsmi = nvidia_smi.getInstance() metrics = TimedMetrics() for _ in range(60): results = nvsmi.DeviceQuery('utilization.gpu, utilization.memory, temperature.gpu, power.draw') gpu = results['gpu'][0] metrics.record(gpu) time.sleep(1) metrics.export_csv('gpu_metrics.csv') ``` -------------------------------- ### Query GPU Power Draw and Limit Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/query-filters.md Use this snippet to query the current power draw and power limit of a GPU. The results include power management status. ```python results = nvsmi.DeviceQuery('power.draw, power.limit, power.management') power = results['gpu'][0]['power_readings'] # {'power_draw': 125.5, 'power_limit': 300.0, 'power_management': 'Supported', 'unit': 'W'} ``` -------------------------------- ### Query GPU Device Information (XML) Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Use XmlDeviceQuery to retrieve GPU information in XML format, similar to `nvidia-smi -q -x`. Supports the same filtering as DeviceQuery. Useful for integrating with XML parsers. ```python nvsmi = nvidia_smi.getInstance() # Get all info as XML xml_output = nvsmi.XmlDeviceQuery() print(xml_output) # Get filtered fields as XML xml_output = nvsmi.XmlDeviceQuery('pci.bus_id, memory.total, utilization.gpu') print(xml_output) # Get available fields print(nvsmi.XmlDeviceQuery('--help-query-gpu')) ``` -------------------------------- ### Format GPU Query Results Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/api-reference-nvidia-smi.md Use the format method to convert DeviceQuery results (dictionary or raw string) into a human-readable string. Useful for displaying GPU information to users. ```python nvsmi = nvidia_smi.getInstance() results = nvsmi.DeviceQuery('memory.total, memory.free') formatted = nvsmi.format(results) print(formatted) ``` -------------------------------- ### Polling with Result Checking Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/usage-patterns.md This pattern is useful when you need to periodically check for GPU metric results and process them only when available. It's suitable for scenarios where immediate reaction to every update isn't critical, but you need to act on collected data within a certain timeframe. ```python import time nvsmi = nvidia_smi.getInstance() task = nvsmi.loop( time_in_milliseconds=1000, filter='memory.free, temperature.gpu' ) for _ in range(10): time.sleep(1) result = task.result() if result: gpu = result['gpu'][0] mem_free = gpu['fb_memory_usage'].get('free', 0) temp = gpu['temperature'].get('gpu_temp', 0) if temp > 80: print(f"WARNING: GPU too hot! {temp}°C") if mem_free < 1024: print(f"WARNING: Low memory! {mem_free} MiB free") task.cancel() ``` -------------------------------- ### Compare String vs Enumeration for Device Queries Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/constants-reference.md Compares querying GPU information using string-based filters versus enumeration constants. Enumeration-based queries are generally faster for repeated use. ```python # These two are equivalent: results1 = nvsmi.DeviceQuery('memory.total, utilization.gpu, temperature.gpu') results2 = nvsmi.DeviceQuery([ NVSMI_MEMORY_TOTAL, NVSMI_UTILIZATION_GPU, NVSMI_TEMPERATURE_GPU ]) # Enumeration-based is faster for repeated queries ``` -------------------------------- ### Asynchronous Data Collection Source: https://github.com/gpuopenanalytics/pynvml/blob/master/_autodocs/usage-patterns.md Implement this pattern to collect GPU metrics over time and store them for later analysis, such as calculating averages or identifying maximum values. This is ideal for performance logging or post-execution analysis. ```python import time from collections import deque class MetricsCollector: def __init__(self, max_samples=100): self.nvsmi = nvidia_smi.getInstance() self.metrics = deque(maxlen=max_samples) self.task = self.nvsmi.loop( time_in_milliseconds=1000, filter='utilization.gpu, temperature.gpu, power.draw', callback=self._collect ) def _collect(self, async_task, results): import time as time_module gpu = results['gpu'][0] sample = { 'timestamp': time_module.time(), 'gpu_util': gpu['utilization'].get('gpu_util', 0), 'temperature': gpu['temperature'].get('gpu_temp', 0), 'power': gpu['power_readings'].get('power_draw', 0), } self.metrics.append(sample) def get_average_util(self): if not self.metrics: return 0 return sum(m['gpu_util'] for m in self.metrics) / len(self.metrics) def get_max_temp(self): if not self.metrics: return 0 return max(m['temperature'] for m in self.metrics) def stop(self): self.task.cancel() # Usage collector = MetricsCollector(max_samples=60) time.sleep(60) # Collect for 1 minute print(f"Average GPU util: {collector.get_average_util()}%") print(f"Max temperature: {collector.get_max_temp()}°C") collector.stop() ```