### Downloading and Preparing PyTorch Distributed Examples Source: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/setup_pcluster_slurm.md This snippet navigates to `/shared`, clones the PyTorch examples repository (shallow clone), filters it to retain only the `distributed/ddp-tutorial-series` subdirectory, installs a specific version of `setuptools`, and then installs project dependencies from `requirements.txt`. ```Shell cd /shared git clone --depth 1 https://github.com/pytorch/examples; cd /shared/examples git filter-branch --prune-empty --subdirectory-filter distributed/ddp-tutorial-series python3 -m pip install setuptools==59.5.0 pip install -r requirements.txt ``` -------------------------------- ### Downloading Training Code and Installing Requirements (Shell) Source: https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/slurm/setup_pcluster_slurm.md These commands navigate to the `/shared` directory, clone the PyTorch examples repository, filter it to the `minGPT-ddp` distributed example, and then install specific Python dependencies, including `setuptools` and those listed in the `requirements.txt` file, to prepare the environment for training. ```Shell cd /shared git clone --depth 1 https://github.com/pytorch/examples; cd /shared/examples git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp python3 -m pip install setuptools==59.5.0 pip install -r requirements.txt ``` -------------------------------- ### Installing Python Dependencies (Bash) Source: https://github.com/pytorch/examples/blob/main/distributed/FSDP/README.md This command installs all required Python packages listed in the `requirements.txt` file using pip. It's essential for setting up the Python environment needed to run the T5 training script. ```bash pip install -r requirements.txt ``` -------------------------------- ### Installing Dependencies and Running MNIST Hogwild Example (Bash) Source: https://github.com/pytorch/examples/blob/main/mnist_hogwild/README.md This snippet provides the commands to install the necessary Python dependencies from `requirements.txt` and then execute the `main.py` script to start the MNIST Hogwild training example. ```bash pip install -r requirements.txt python main.py ``` -------------------------------- ### Installing Dependencies and Running Main Script (Bash) Source: https://github.com/pytorch/examples/blob/main/mnist_forward_forward/README.md This snippet provides instructions to set up the project environment by installing required Python packages from `requirements.txt` and then executing the main training script `main.py`. This is the standard way to prepare and run the Forward-Forward algorithm example. ```Bash pip install -r requirements.txt python main.py ``` -------------------------------- ### Starting FSDP T5 Training with Torchrun (Bash) Source: https://github.com/pytorch/examples/blob/main/distributed/FSDP/README.md This command initiates the distributed training of the T5 model using Torchrun. It specifies one node and four processes per node, which should be adjusted based on the available GPU count. The `T5_training.py` script contains the core training logic. ```bash torchrun --nnodes 1 --nproc_per_node 4 T5_training.py ``` -------------------------------- ### Installing Dependencies for PyTorch RL Examples - Bash Source: https://github.com/pytorch/examples/blob/main/reinforcement_learning/README.md This command installs all necessary Python packages listed in the `requirements.txt` file. It is a prerequisite for running any of the reinforcement learning examples. ```bash pip install -r requirements.txt ``` -------------------------------- ### Installing Python Dependencies and Virtual Environment Source: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/setup_pcluster_slurm.md This sequence of commands updates package lists, installs `python3-venv`, creates a Python virtual environment at `/shared/venv/`, activates it, installs the `wheel` package, and adds the activation command to `.bashrc` for persistent environment loading. ```Shell sudo apt-get update sudo apt-get install -y python3-venv python3 -m venv /shared/venv/ source /shared/venv/bin/activate pip install wheel echo 'source /shared/venv/bin/activate' >> ~/.bashrc ``` -------------------------------- ### Downloading Wikihow Dataset (Bash) Source: https://github.com/pytorch/examples/blob/main/distributed/FSDP/README.md This snippet executes a shell script to download the 'wikihow' dataset, which is a prerequisite for the T5 text summarization example. It ensures the necessary data is available before training. ```bash sh download_dataset.sh ``` -------------------------------- ### Running Distributed Pipeline Parallel Example Source: https://github.com/pytorch/examples/blob/main/distributed/rpc/pipeline/README.md This snippet provides the commands to set up the environment and execute the distributed pipeline parallel example. It first installs the necessary dependencies from `requirements.txt` and then runs the main script `main.py` to start the distributed application. ```Shell pip install -r requirements.txt python main.py ``` -------------------------------- ### Installing Dependencies and Running the Main Script (Bash) Source: https://github.com/pytorch/examples/blob/main/vae/README.md This snippet provides the necessary commands to set up the project by installing its dependencies from 'requirements.txt' and then executing the main training script. This is a standard procedure for initializing and running Python-based projects. ```bash pip install -r requirements.txt python main.py ``` -------------------------------- ### Building Documentation for New PyTorch Examples (Shell) Source: https://github.com/pytorch/examples/blob/main/CONTRIBUTING.md This script navigates to the `docs` directory, sets up a Python virtual environment, installs documentation dependencies from `requirements.txt`, and builds the HTML documentation. It's a prerequisite for verifying documentation changes for new examples. ```Shell cd docs virtualenv venv source venv/bin/activate pip install -r requirements.txt make html ``` -------------------------------- ### Running Distributed DataParallel and RPC Example Source: https://github.com/pytorch/examples/blob/main/distributed/rpc/ddp_rpc/README.md This snippet provides the commands to set up the environment and run the PyTorch distributed example. It first installs necessary dependencies from `requirements.txt` and then executes the main training script `main.py`. ```Bash pip install -r requirements.txt python main.py ``` -------------------------------- ### Running the Distributed Reinforcement Learning Example (Shell) Source: https://github.com/pytorch/examples/blob/main/distributed/rpc/rl/README.md This snippet provides the commands to set up and run the distributed reinforcement learning example. It first installs the necessary Python dependencies listed in `requirements.txt` and then executes the main application script `main.py`. ```Shell pip install -r requirements.txt python main.py ``` -------------------------------- ### Installing Python Dependencies on Cluster (Shell) Source: https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/slurm/setup_pcluster_slurm.md This sequence of commands updates the package lists, installs `python3-venv`, creates a Python virtual environment at `/shared/venv/`, activates it, installs the `wheel` package, and configures the virtual environment to activate automatically upon shell login for persistent access. ```Shell sudo apt-get update sudo apt-get install -y python3-venv python3 -m venv /shared/venv/ source /shared/venv/bin/activate pip install wheel echo 'source /shared/venv/bin/activate' >> ~/.bashrc ``` -------------------------------- ### Installing AWS CLI and ParallelCluster with pip Source: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/setup_pcluster_slurm.md This snippet installs the AWS Command Line Interface (CLI) and AWS ParallelCluster using pip3. The `--user` flag ensures installation into the user's home directory, avoiding system-wide changes, and `-U` or `--upgrade` ensures the latest versions are installed. ```Shell pip3 install awscli -U --user pip3 install "aws-parallelcluster" --upgrade --user ``` -------------------------------- ### Installing AWS CLI and ParallelCluster (Python) Source: https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/slurm/setup_pcluster_slurm.md This snippet installs the AWS Command Line Interface (CLI) and AWS ParallelCluster using pip3. It ensures the latest versions are installed and configured for the current user, which is a prerequisite for managing AWS resources and clusters. ```Shell pip3 install awscli -U --user pip3 install "aws-parallelcluster" --upgrade --user ``` -------------------------------- ### Previewing Sphinx Documentation Locally (Shell) Source: https://github.com/pytorch/examples/blob/main/CONTRIBUTING.md This command uses `sphinx-serve` to host the built Sphinx documentation locally, allowing contributors to preview their changes in a web browser. `sphinx-serve` must be installed separately. ```Shell sphinx-serve -b build ``` -------------------------------- ### Installing Dependencies and Running Main Script (Bash) Source: https://github.com/pytorch/examples/blob/main/siamese_network/README.md This snippet provides commands to install necessary Python dependencies from `requirements.txt` and to execute the main `main.py` script. It also includes an optional command to specify a GPU ID using `CUDA_VISIBLE_DEVICES` for execution on a specific device. ```bash pip install -r requirements.txt python main.py # CUDA_VISIBLE_DEVICES=2 python main.py # to specify GPU id to ex. 2 ``` -------------------------------- ### Installing Dependencies and Running Example Source: https://github.com/pytorch/examples/blob/main/distributed/tensor_parallelism/README.md This snippet outlines the necessary steps to prepare the environment by installing all required Python packages from 'requirements.txt' and then executing the main 'example.py' script to run the PyTorch Tensor Parallel demonstration. ```Shell pip install -r requirements.txt python example.py ``` -------------------------------- ### Installing GCN Dependencies and Running Main Script - Bash Source: https://github.com/pytorch/examples/blob/main/gcn/README.md This snippet outlines the steps to set up the project environment and execute the main application. It first installs all required Python packages listed in `requirements.txt` using pip, and then runs the `main.py` script, which typically initializes the application or performs initial setup. ```bash pip install -r requirements.txt python main.py ``` -------------------------------- ### Running the PyTorch MNIST RNN Example (Bash) Source: https://github.com/pytorch/examples/blob/main/mnist_rnn/README.md This snippet provides commands to set up the environment and run the PyTorch MNIST RNN example. It includes installing dependencies and executing the main script, with an optional command to specify a GPU ID. ```bash pip install -r requirements.txt python main.py # CUDA_VISIBLE_DEVICES=2 python main.py # to specify GPU id to ex. 2 ``` -------------------------------- ### Running Synchronized Batch Update Parameter Server Example (Shell) Source: https://github.com/pytorch/examples/blob/main/distributed/rpc/batch/README.md This snippet provides the commands to set up and run the Synchronized Batch Update Parameter Server example. It installs necessary dependencies from `requirements.txt` and then executes the `parameter_server.py` script, which utilizes `@rpc.functions.async_execution` for parameter updates and retrieval. ```Shell pip install -r requirements.txt python parameter_server.py ``` -------------------------------- ### Running Multi-Observer with Batch-Processing Agent Example (Shell) Source: https://github.com/pytorch/examples/blob/main/distributed/rpc/batch/README.md This snippet provides the commands to set up and run the Multi-Observer with Batch-Processing Agent example. It installs necessary dependencies from `requirements.txt` and then executes the `reinforce.py` script, which uses `@rpc.functions.async_execution` to process multiple observed states through a policy. ```Shell pip install -r requirements.txt python reinforce.py ``` -------------------------------- ### Building the DCGAN Example with CMake and Make Source: https://github.com/pytorch/examples/blob/main/cpp/dcgan/README.md This snippet provides the shell commands required to build the DCGAN example using CMake and Make. It navigates into the `dcgan` directory, creates a build directory, configures the project with CMake by specifying the `LibTorch` installation path, and then compiles the project using Make. Prerequisites include a C++ compiler, CMake, and the PyTorch LibTorch distribution. ```shell $ cd dcgan $ mkdir build $ cd build $ cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch .. $ make ``` -------------------------------- ### Building PyTorch C++ Frontend Example Source: https://github.com/pytorch/examples/blob/main/cpp/custom-dataset/README.md This snippet provides the shell commands required to build the custom dataset example using CMake and Make. It assumes `libtorch` is installed and its path is provided via `CMAKE_PREFIX_PATH`. Troubleshooting tips for OpenCV compatibility are also mentioned. ```Shell $ cd custom-dataset $ mkdir build $ cd build $ cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch .. $ make ``` -------------------------------- ### Running the Distributed RNN Example Source: https://github.com/pytorch/examples/blob/main/distributed/rpc/rnn/README.md This snippet provides the commands to install necessary Python dependencies and then execute the main application script for the distributed RNN model example. ```Shell pip install -r requirements.txt python main.py ``` -------------------------------- ### Setting Up Project Dependencies Source: https://github.com/pytorch/examples/blob/main/language_translation/README.md These commands install all project dependencies listed in `requirements.txt` and download a specified Spacy language model, providing a quick start for the language translation project. ```bash pip install -r requirements.txt python3 -m spacy download ``` -------------------------------- ### Installing Torchtext Library Source: https://github.com/pytorch/examples/blob/main/language_translation/README.md This command installs the Torchtext library, a dependency for handling text processing and datasets in the PyTorch language translation example. ```bash pip install torchtext ``` -------------------------------- ### Configuring AWS ParallelCluster Source: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/setup_pcluster_slurm.md This command initiates the configuration process for AWS ParallelCluster, using `config.yaml` as the target configuration file. It guides the user through setting up cluster parameters, which can then be reviewed and modified in the specified YAML file. ```Shell pcluster configure --config config.yaml ``` -------------------------------- ### Running MNIST Example with PyTorch (Bash) Source: https://github.com/pytorch/examples/blob/main/mnist/README.md This snippet provides the necessary bash commands to install project dependencies from 'requirements.txt' and execute the main PyTorch MNIST script. It also includes an optional command to specify a particular GPU ID using the 'CUDA_VISIBLE_DEVICES' environment variable for execution. ```bash pip install -r requirements.txt python main.py # CUDA_VISIBLE_DEVICES=2 python main.py # to specify GPU id to ex. 2 ``` -------------------------------- ### Building and Installing OpenCV from Source on Linux Source: https://github.com/pytorch/examples/blob/main/cpp/tools/InstallingOpenCV.md This sequence of commands clones the OpenCV repositories, configures the build using CMake, compiles the project with parallel jobs, and installs it to the specified prefix. This is a standard procedure for building large C++ projects from source. ```shell git clone https://github.com/opencv/opencv.git git clone https://github.com/opencv/opencv_contrib.git cd opencv && mkdir build && cd build cmake -D CMAKE_BUILD_TYPE=Release -D CMAKE_INSTALL_PREFIX=/usr/local .. make -j8 # runs 8 jobs in parallel sudo make install ``` -------------------------------- ### Installing Spacy Language Models Source: https://github.com/pytorch/examples/blob/main/language_translation/README.md This command installs specific language models for Spacy, which are required for tokenization in the language translation example. It demonstrates how to download a generic language model, English, and German models. ```bash python3 -m spacy download python3 -m spacy download en python3 -m spacy download de ``` -------------------------------- ### Cloning PyTorch GAT Example Repository (Bash) Source: https://github.com/pytorch/examples/blob/main/gat/README.md This command sequence clones the PyTorch examples repository from GitHub and navigates into the `examples/gat` directory, which contains the GAT model implementation. This is the initial step to set up the project locally. ```bash git clone https://github.com/pytorch/examples.git cd examples/gat ``` -------------------------------- ### Example Output of PyTorch DDP Application Launch in Shell Source: https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md This shell output illustrates the console logs generated when launching a PyTorch DDP application. It shows the initialization of multiple process groups, each with its own rank and world size, and confirms the backend (NCCL) and device assignments for each process. This output helps verify the correct distributed setup. ```Shell ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** [238627] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '8'} [238630] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '8'} [238628] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '8'} [238634] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '7', 'WORLD_SIZE': '8'} [238631] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '4', 'WORLD_SIZE': '8'} [238632] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '5', 'WORLD_SIZE': '8'} [238629] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '8'} [238633] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '6', 'WORLD_SIZE': '8'} [238633] world_size = 8, rank = 6, backend=nccl [238628] world_size = 8, rank = 1, backend=nccl [238629] world_size = 8, rank = 2, backend=nccl [238631] world_size = 8, rank = 4, backend=nccl [238630] world_size = 8, rank = 3, backend=nccl [238632] world_size = 8, rank = 5, backend=nccl [238634] world_size = 8, rank = 7, backend=nccl [238627] world_size = 8, rank = 0, backend=nccl [238633] rank = 6, world_size = 8, n = 1, device_ids = [6] [238628] rank = 1, world_size = 8, n = 1, device_ids = [1] [238632] rank = 5, world_size = 8, n = 1, device_ids = [5] [238634] rank = 7, world_size = 8, n = 1, device_ids = [7] [238629] rank = 2, world_size = 8, n = 1, device_ids = [2] [238630] rank = 3, world_size = 8, n = 1, device_ids = [3] [238631] rank = 4, world_size = 8, n = 1, device_ids = [4] [238627] rank = 0, world_size = 8, n = 1, device_ids = [0] ``` -------------------------------- ### Building PyTorch C++ MNIST Example (Shell) Source: https://github.com/pytorch/examples/blob/main/cpp/mnist/README.md These shell commands navigate into the `mnist` directory, create a `build` directory, change into it, configure the build with CMake, specifying the path to the LibTorch distribution, and then compile the project using `make`. ```Shell $ cd mnist $ mkdir build $ cd build $ cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch .. $ make ``` -------------------------------- ### Building PyTorch C++ Autograd Example Source: https://github.com/pytorch/examples/blob/main/cpp/autograd/README.md This snippet provides the shell commands required to build the PyTorch C++ autograd example. It involves navigating to the `autograd` directory, creating a build directory, configuring the project with CMake, and compiling it using `make`. The `CMAKE_PREFIX_PATH` must point to the unzipped LibTorch distribution. ```shell cd autograd mkdir build cd build cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch .. make ``` -------------------------------- ### Launching PyTorch Distributed Training with `launch.py` Source: https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md This command executes a PyTorch distributed training script (`example.py`) using a launcher utility (`launch.py`). It configures a single node with one process per node, setting the local world size to one, simplifying the setup of distributed training environments. ```Shell python /path/to/launch.py --nnode=1 --node_rank=0 --nproc_per_node=1 example.py --local_world_size=1 ``` -------------------------------- ### Building Linear Regression Example with CMake (Shell) Source: https://github.com/pytorch/examples/blob/main/cpp/regression/README.md This snippet provides the shell commands required to navigate into the regression example directory, create a build directory, configure the project with CMake, linking against the LibTorch distribution, and compile the project using `make`. The `/path/to/libtorch` placeholder should be replaced with the actual path to your unzipped LibTorch distribution. ```shell $ cd regression $ mkdir build $ cd build $ cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch .. $ make ``` -------------------------------- ### Launching RPC Master Node (Rank 0) Source: https://github.com/pytorch/examples/blob/main/distributed/rpc/parameter_server/README.md This specific command launches the master node (server) for the RPC-based training example with a `WORLD_SIZE` of 2 and a `RANK` of 0. It should be run in a separate terminal window to initiate the distributed environment. ```Shell python rpc_parameter_server.py --world_size=2 --rank=0 ``` -------------------------------- ### Executing Script on GPU with Accelerator (Bash) Source: https://github.com/pytorch/examples/blob/main/siamese_network/README.md This command demonstrates how to execute the `main.py` script utilizing a detected GPU by adding the `--accel` argument. This enables accelerated computation for the Siamese network example. ```bash python main.py --accel ``` -------------------------------- ### Command-Line Arguments for main.py (Bash) Source: https://github.com/pytorch/examples/blob/main/vae/README.md This snippet lists the optional command-line arguments available for customizing the execution of the 'main.py' script. These arguments allow users to control various training parameters such as batch size, number of epochs, hardware acceleration, random seed, and logging frequency. ```bash --batch-size input batch size for training (default: 128) --epochs number of epochs to train (default: 10) --accel use accelerator --seed random seed (default: 1) --log-interval how many batches to wait before logging training status ``` -------------------------------- ### Launching RPC Trainer Node (Rank 1) Source: https://github.com/pytorch/examples/blob/main/distributed/rpc/parameter_server/README.md This command launches a trainer node for the RPC-based training example with a `WORLD_SIZE` of 2 and a `RANK` of 1. It should be run in a separate terminal window to begin training with the server launched by the master node. ```Shell python rpc_parameter_server.py --world_size=2 --rank=1 ``` -------------------------------- ### Running PyTorch Python Examples Locally (Shell) Source: https://github.com/pytorch/examples/blob/main/CONTRIBUTING.md This command executes the `run_python_examples.sh` script, ensuring it runs within a specified virtual environment (`.venv`). This is crucial for verifying bug fixes and ensuring all tests pass locally before submitting a pull request. ```Shell VIRTUAL_ENV=.venv ./run_python_examples.sh ``` -------------------------------- ### Executing PyTorch C++ MNIST Model Training (Shell) Source: https://github.com/pytorch/examples/blob/main/cpp/mnist/README.md This shell command executes the compiled `mnist` binary to start the model training process. The output shows the training progress across multiple epochs, including loss and accuracy metrics for both training and test sets. ```Shell $ ./mnist Train Epoch: 1 [59584/60000] Loss: 0.4232 Test set: Average loss: 0.1989 | Accuracy: 0.940 Train Epoch: 2 [59584/60000] Loss: 0.1926 Test set: Average loss: 0.1338 | Accuracy: 0.959 Train Epoch: 3 [59584/60000] Loss: 0.1390 Test set: Average loss: 0.0997 | Accuracy: 0.969 Train Epoch: 4 [59584/60000] Loss: 0.1239 Test set: Average loss: 0.0875 | Accuracy: 0.972 ... ``` -------------------------------- ### Listing AWS ParallelClusters Source: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/setup_pcluster_slurm.md This command lists all AWS ParallelClusters associated with the current AWS account. It is used to track the status and details of existing clusters, including those currently being created or updated. ```Shell pcluster list-clusters ``` -------------------------------- ### Creating an AWS ParallelCluster Source: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/setup_pcluster_slurm.md This command creates a new AWS ParallelCluster named `dist-ml` based on the settings defined in `config.yaml`. It provisions the necessary AWS resources, including compute instances and networking, to form the cluster. ```Shell pcluster create-cluster --cluster-name dist-ml --cluster-configuration config.yaml ``` -------------------------------- ### Executing DCGAN Training with Default Epochs Source: https://github.com/pytorch/examples/blob/main/cpp/dcgan/README.md This command executes the compiled DCGAN binary to start the training process. By default, it trains for 30 epochs, displaying loss values for the discriminator (D_loss) and generator (G_loss) at regular intervals, along with checkpoint indicators. This requires the `dcgan` binary to be successfully built and located in the current directory. ```shell $ ./dcgan [ 1/30][200/938] D_loss: 0.4953 | G_loss: 4.0195 -> checkpoint 1 [ 1/30][400/938] D_loss: 0.3610 | G_loss: 4.8148 -> checkpoint 2 [ 1/30][600/938] D_loss: 0.4072 | G_loss: 4.36760 -> checkpoint 3 [ 1/30][800/938] D_loss: 0.4444 | G_loss: 4.0250 -> checkpoint 4 [ 2/30][200/938] D_loss: 0.3761 | G_loss: 3.8790 -> checkpoint 5 [ 2/30][400/938] D_loss: 0.3977 | G_loss: 3.3315 -> checkpoint 6 [ 2/30][600/938] D_loss: 0.3815 | G_loss: 3.5696 -> checkpoint 7 [ 2/30][800/938] D_loss: 0.4039 | G_loss: 3.2759 -> checkpoint 8 [ 3/30][200/938] D_loss: 0.4236 | G_loss: 4.5132 -> checkpoint 9 [ 3/30][400/938] D_loss: 0.3645 | G_loss: 3.9759 -> checkpoint 10 ... ``` -------------------------------- ### Executing Linear Regression Example and Observing Output (Shell) Source: https://github.com/pytorch/examples/blob/main/cpp/regression/README.md This snippet shows how to execute the compiled linear regression binary and provides an example of the expected output. The output includes the final loss after a certain number of batches and a comparison between the learned polynomial function and the actual target function, demonstrating the model's accuracy. ```shell $ ./regression Loss: 0.000301158 after 584 batches ==> Learned function: y = 11.6441 x^4 -3.10164 x^3 2.19786 x^2 -3.83606 x^1 + 4.37066 ==> Actual function: y = 11.669 x^4 -3.16023 x^3 2.19182 x^2 -3.81505 x^1 + 4.38219 ... ``` -------------------------------- ### Installing Optional Dependencies for OpenCV on Linux Source: https://github.com/pytorch/examples/blob/main/cpp/tools/InstallingOpenCV.md This command installs optional libraries that enhance OpenCV's functionality, such as Python bindings, TBB for parallel processing, and support for various image formats (JPEG, PNG, TIFF). These are highly recommended for a full-featured OpenCV installation. ```shell sudo apt-get install python-dev python-numpy libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libjasper-dev libdc1394-22-dev ``` -------------------------------- ### Executing PyTorch C++ Frontend Training Source: https://github.com/pytorch/examples/blob/main/cpp/custom-dataset/README.md This snippet shows how to execute the compiled binary for training the model and provides an example of the console output during the training process, including loss and accuracy metrics. The output indicates the device being used (e.g., CUDA) and progress per epoch. ```Shell ./custom-dataset Running on: CUDA Train Epoch: 1 16/7281 Loss: 0.314655 Acc: 0 Train Epoch: 1 176/7281 Loss: 0.532111 Acc: 0.0681818 Train Epoch: 1 336/7281 Loss: 0.538482 Acc: 0.0714286 Train Epoch: 1 496/7281 Loss: 0.535302 Acc: 0.0705645 Train Epoch: 1 656/7281 Loss: 0.536113 Acc: 0.0716463 Train Epoch: 1 816/7281 Loss: 0.537626 Acc: 0.0784314 Train Epoch: 1 976/7281 Loss: 0.537055 Acc: 0.079918 ... ``` -------------------------------- ### Command-Line Arguments for main.py (Bash) Source: https://github.com/pytorch/examples/blob/main/mnist_forward_forward/README.md This section lists the optional command-line arguments accepted by the `main.py` script, allowing users to customize training parameters such as epochs, learning rate, random seed, dataset sizes, and logging intervals. These arguments control the behavior and performance of the Forward-Forward algorithm training process. ```Bash optional arguments: -h, --help show this help message and exit --epochs EPOCHS number of epochs to train (default: 1000) --lr LR learning rate (default: 0.03) --no_accel disables accelerator --seed SEED random seed (default: 1) --save_model For saving the current Model --train_size TRAIN_SIZE size of training set --threshold THRESHOLD threshold for training --test_size TEST_SIZE size of test set --save-model For Saving the current Model --log-interval LOG_INTERVAL logging training status interval ``` -------------------------------- ### Configuring AWS ParallelCluster (Shell) Source: https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/slurm/setup_pcluster_slurm.md This command initiates the configuration process for AWS ParallelCluster, prompting the user to define cluster settings and generating a `config.yaml` file. It's crucial to have a valid EC2 key-pair file for secure access to the cluster. ```Shell pcluster configure --config config.yaml ``` -------------------------------- ### Listing AWS ParallelClusters (Shell) Source: https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/slurm/setup_pcluster_slurm.md This command lists all active AWS ParallelClusters associated with the current AWS account. It is useful for monitoring the status of cluster creation or verifying the existence and state of deployed clusters. ```Shell pcluster list-clusters ``` -------------------------------- ### Creating AWS ParallelCluster (Shell) Source: https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/slurm/setup_pcluster_slurm.md This command creates an AWS ParallelCluster named 'dist-ml' based on the specifications in the `config.yaml` file. It provisions all necessary AWS resources, including compute instances and networking, to form the cluster. ```Shell pcluster create-cluster --cluster-name dist-ml --cluster-configuration config.yaml ``` -------------------------------- ### SSH into ParallelCluster Head Node Source: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/setup_pcluster_slurm.md This command establishes an SSH connection to the head node of the `dist-ml` cluster. The `-i` flag specifies the private key file (`your-keyname-file`) required for authentication, allowing remote access to the cluster's primary control instance. ```Shell pcluster ssh --cluster-name dist-ml -i your-keyname-file ``` -------------------------------- ### Installing PyTorch GAT Dependencies (Bash) Source: https://github.com/pytorch/examples/blob/main/gat/README.md This command installs all necessary Python packages and libraries required to run the PyTorch GAT model. Dependencies are listed in the `requirements.txt` file, ensuring the correct environment for model execution. ```bash pip install -r requirements.txt ``` -------------------------------- ### Showcasing DCP API with FSDP2 - Bash Source: https://github.com/pytorch/examples/blob/main/distributed/FSDP2/README.md This command runs the FSDP2 training script to demonstrate the Distributed Checkpointing (DCP) API. It uses `torchrun` with 2 processes per node and includes the `--dcp-api` flag to activate and showcase the functionality of the DCP API. ```Bash torchrun --nproc_per_node 2 train.py --dcp-api ``` -------------------------------- ### Running REINFORCE Algorithm - Bash Source: https://github.com/pytorch/examples/blob/main/reinforcement_learning/README.md Executes the `reinforce.py` script, which implements the REINFORCE algorithm for reinforcement learning. This script trains a model using the specified algorithm. ```bash python reinforce.py ``` -------------------------------- ### Running FSDP2 on Transformer Model - Bash Source: https://github.com/pytorch/examples/blob/main/distributed/FSDP2/README.md This command sequence navigates to the FSDP2 example directory and then executes the `train.py` script using `torchrun` with 2 processes per node. The first run creates and saves state dictionaries to a 'checkpoints' folder, while subsequent runs load from these checkpoints. ```Bash cd distributed/FSDP2 torchrun --nproc_per_node 2 train.py ``` -------------------------------- ### Installing OpenCV on Arch Linux using Pacman Source: https://github.com/pytorch/examples/blob/main/cpp/tools/InstallingOpenCV.md This command installs OpenCV and essential development tools on Arch Linux using the `pacman` package manager. It ensures all necessary dependencies for building and running applications with OpenCV are met. ```shell pacman -Syu base-devel opencv ``` -------------------------------- ### Building Distributed MNIST Example with CMake and LibTorch (Shell) Source: https://github.com/pytorch/examples/blob/main/cpp/distributed/README.md These shell commands compile the `dist-mnist.cpp` example. It involves navigating to the `distributed` directory, creating a `build` directory, configuring CMake with the LibTorch path, and then building the project using `make`. A custom-compiled LibTorch with MPI headers is required for this example. ```Shell $ cd distributed $ mkdir build $ cd build $ cmake -DCMAKE_PREFIX_PATH=/path/to/libtorch .. $ make ``` -------------------------------- ### Running RPC Parameter Server Worker Source: https://github.com/pytorch/examples/blob/main/distributed/rpc/parameter_server/README.md This command launches a worker for the RPC-based distributed training example. It requires specifying the total `WORLD_SIZE` and the unique `RANK` of the current worker. This command is used for both server and trainer processes. ```Shell python rpc_parameter_server.py --world_size=WORLD_SIZE --rank=RANK ``` -------------------------------- ### Installing OpenCV on Fedora using DNF Source: https://github.com/pytorch/examples/blob/main/cpp/tools/InstallingOpenCV.md This command installs OpenCV and its development files on Fedora using the `dnf` package manager. The `opencv-dev` package provides header files and libraries required for compiling applications against OpenCV. ```shell sudo dnf install opencv opencv-dev ``` -------------------------------- ### Deactivating Python Virtual Environment (Shell) Source: https://github.com/pytorch/examples/blob/main/CONTRIBUTING.md This command deactivates the currently active Python virtual environment, returning the shell to its system-wide Python installation. It should be run after completing work within the virtual environment. ```Shell deactivate ``` -------------------------------- ### SSH into Cluster Headnode (Shell) Source: https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/slurm/setup_pcluster_slurm.md This command establishes an SSH connection to the head node of the specified AWS ParallelCluster. It requires the cluster name and the path to your EC2 key-pair file for secure authentication and access. ```Shell pcluster ssh --cluster-name dist-ml -i your-keypair-file ``` -------------------------------- ### Running Actor-Critic Algorithm - Bash Source: https://github.com/pytorch/examples/blob/main/reinforcement_learning/README.md Executes the `actor_critic.py` script, which implements the Actor-Critic algorithm for reinforcement learning. This script trains a model using the specified algorithm. ```bash python actor_critic.py ``` -------------------------------- ### Executing PyTorch C++ Autograd Examples Source: https://github.com/pytorch/examples/blob/main/cpp/autograd/README.md This snippet shows the command to execute the compiled PyTorch C++ autograd binary and its expected output. The output demonstrates various autograd functionalities, including basic operations, higher-order gradient computations, and the use of custom autograd functions, showcasing tensor values and their gradients. ```shell ./autograd ====== Running: "Basic autograd operations" ====== 1 1 1 1 [ CPUFloatType{2,2} ] 3 3 3 3 [ CPUFloatType{2,2} ] AddBackward1 27 27 27 27 [ CPUFloatType{2,2} ] MulBackward1 27 [ CPUFloatType{} ] MeanBackward0 false true SumBackward0 4.5000 4.5000 4.5000 4.5000 [ CPUFloatType{2,2} ] 813.6625 1015.0142 -664.8849 [ CPUFloatType{3} ] MulBackward1 204.8000 2048.0000 0.2048 [ CPUFloatType{3} ] true true false true false true ====== Running "Computing higher-order gradients in C++" ====== 0.0025 0.0946 0.1474 0.1387 0.0238 -0.0018 0.0259 0.0094 0.0513 -0.0549 -0.0604 0.0210 [ CPUFloatType{3,4} ] ====== Running "Using custom autograd function in C++" ====== -3.5513 3.7160 3.6477 -3.5513 3.7160 3.6477 [ CPUFloatType{2,3} ] 0.3095 1.4035 -0.0349 0.3095 1.4035 -0.0349 0.3095 1.4035 -0.0349 0.3095 1.4035 -0.0349 [ CPUFloatType{4,3} ] 5.5000 5.5000 [ CPUFloatType{2} ] ``` -------------------------------- ### Training and Generation Commands for Language Models Source: https://github.com/pytorch/examples/blob/main/word_language_model/README.md This snippet provides common command-line examples for training language models using `main.py` and generating text using `generate.py`. It demonstrates how to specify model type (LSTM, Transformer), enable CUDA, set epochs, and tie weights for training, as well as how to run the text generation script. ```Bash python main.py --cuda --epochs 6 # Train a LSTM on Wikitext-2 with CUDA. python main.py --cuda --epochs 6 --tied # Train a tied LSTM on Wikitext-2 with CUDA. python main.py --cuda --tied # Train a tied LSTM on Wikitext-2 with CUDA for 40 epochs. python main.py --cuda --epochs 6 --model Transformer --lr 5 # Train a Transformer model on Wikitext-2 with CUDA. python generate.py # Generate samples from the default model checkpoint. ``` -------------------------------- ### Installing Required Build Dependencies for OpenCV on Linux Source: https://github.com/pytorch/examples/blob/main/cpp/tools/InstallingOpenCV.md This command installs essential build tools and libraries required to compile OpenCV from source on Debian/Ubuntu-based systems. It includes compilers, build systems, and multimedia libraries necessary for the build process. ```shell sudo apt-get install build-essential cmake git libgtk2.0-dev pkg-config libavcodec-dev libavformat-dev libswscale-dev ``` -------------------------------- ### Optimized Training Configurations for Language Models Source: https://github.com/pytorch/examples/blob/main/word_language_model/README.md This snippet provides examples of command-line arguments for `main.py` that are known to produce slower but better-performing language models. It demonstrates configurations with increased embedding and hidden unit sizes, higher dropout rates, and extended training epochs, both with and without tied weights, all utilizing CUDA. ```Bash python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 python main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40 python main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --epochs 40 --tied ``` -------------------------------- ### Command-Line Arguments for PyTorch MNIST Hogwild Script (Bash) Source: https://github.com/pytorch/examples/blob/main/mnist_hogwild/README.md This snippet lists the optional command-line arguments available for the `main.py` script, allowing users to customize training parameters such as batch size, epochs, learning rate, and enable CUDA or MPS training. ```bash optional arguments: -h, --help show this help message and exit --batch_size input batch_size for training (default:64) --testing_batch_size input batch size for testing (default: 1000) --epochs EPOCHS number of epochs to train (default: 1000) --lr LR learning rate (default: 0.03) --momentum SGD momentum (default: 0.5) --seed SEED random seed (default: 1) --mps enables macos GPU training --save_model For saving the current Model --log_interval how many batches to wait before logging training status --num_process how many training processes to use (default: 2) --cuda enables CUDA training --dry-run quickly check a single pass --save-model For Saving the current Model ``` -------------------------------- ### Launching PyTorch DDP Application with launch.py (Multi-process) in Shell Source: https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md This shell command demonstrates how to launch a PyTorch DDP application using `launch.py`. It configures a single node with 8 GPUs, running one process per GPU, and explicitly passes `local_world_size=8` to the `example.py` script. This setup is typical for distributing training across multiple GPUs on a single machine. ```Shell python /path/to/launch.py --nnode=1 --node_rank=0 --nproc_per_node=8 example.py --local_world_size=8 ``` -------------------------------- ### Customizing Execution with Command-Line Arguments (Bash) Source: https://github.com/pytorch/examples/blob/main/siamese_network/README.md This snippet lists various command-line arguments available for customizing the execution of the `main.py` script. These arguments control parameters such as batch sizes for training and testing, number of epochs, learning rate, gamma for learning rate step, use of an accelerator, dry-run mode, random seed, logging interval, and model saving. ```bash --batch-size input batch size for training (default: 64) --test-batch-size input batch size for testing (default: 1000) --epochs number of epochs to train (default: 14) --lr learning rate (default: 1.0) --gamma learning rate step gamma (default: 0.7) --accel use accelerator --dry-run quickly check a single pass --seed random seed (default: 1) --log-interval how many batches to wait before logging training status --save-model Saving the current Model ``` -------------------------------- ### Command-Line Arguments for PyTorch MNIST RNN (Bash) Source: https://github.com/pytorch/examples/blob/main/mnist_rnn/README.md This snippet lists the available command-line arguments for configuring the PyTorch MNIST RNN training and testing process. It details parameters such as batch size, epochs, learning rate, and options for saving the model or enabling accelerators. ```bash optional arguments: -h, --help show this help message and exit --batch_size input batch_size for training (default:64) --testing_batch_size input batch size for testing (default: 1000) --epochs EPOCHS number of epochs to train (default: 14) --lr LR learning rate (default: 0.1) --gamma learning rate step gamma (default: 0.7) --accel enables accelerator --seed SEED random seed (default: 1) --save_model For saving the current Model --log_interval how many batches to wait before logging training status --dry-run quickly check a single pass ``` -------------------------------- ### Enabling Explicit Prefetching for FSDP2 - Bash Source: https://github.com/pytorch/examples/blob/main/distributed/FSDP2/README.md This command runs the FSDP2 training script with explicit prefetching enabled. It uses `torchrun` with 2 processes per node and passes the `--explicit-prefetch` flag to optimize data loading. ```Bash torchrun --nproc_per_node 2 train.py --explicit-prefetch ``` -------------------------------- ### Training with Dummy Data for Benchmarking in PyTorch Source: https://github.com/pytorch/examples/blob/main/imagenet/README.md This command runs the training script using dummy data instead of the full ImageNet dataset. This is useful for quick setup, testing, and benchmarking training speed, though the resulting loss and accuracy will not be meaningful. ```bash python main.py -a resnet18 --dummy ``` -------------------------------- ### Executing DCGAN Training with Custom Epochs Source: https://github.com/pytorch/examples/blob/main/cpp/dcgan/README.md This snippet demonstrates how to run the DCGAN training script and specify a custom number of training epochs using the `--epochs` flag. In this example, the model will train for 10 epochs instead of the default 30. This command requires the `dcgan` binary to be compiled and accessible. ```shell $ ./dcgan --epochs 10 ``` -------------------------------- ### Running Distributed MNIST Example with MPI (Shell) Source: https://github.com/pytorch/examples/blob/main/cpp/distributed/README.md This shell command executes the compiled `dist-mnist` program using `mpirun`. The `{NUM-PROCS}` placeholder should be replaced with the desired number of processes for distributed training, allowing the application to leverage multiple MPI ranks. ```Shell mpirun -np {NUM-PROCS} ./dist-mnist ``` -------------------------------- ### Displaying Generated DCGAN Samples with Python Source: https://github.com/pytorch/examples/blob/main/cpp/dcgan/README.md This command uses the `display_samples.py` Python script to visualize the image samples generated during the DCGAN training. It takes the path to a saved sample tensor file (e.g., `dcgan-sample-10.pt`) as input using the `-i` flag and outputs a plot image named `out.png`. This requires Python and the necessary libraries for `display_samples.py`. ```shell $ python display_samples.py -i dcgan-sample-10.pt Saved out.png ``` -------------------------------- ### Parsing DDP Command-Line Arguments in Python Source: https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md This Python snippet demonstrates how a PyTorch DDP application parses command-line arguments using `argparse`. It specifically handles `--local_rank` and `--local_world_size`, which are crucial for distributed training setup. These arguments are then passed to the `spmd_main` entrypoint. ```Python if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--local_rank", type=int, default=0) parser.add_argument("--local_world_size", type=int, default=1) args = parser.parse_args() spmd_main(args.local_world_size, args.local_rank) ``` -------------------------------- ### Enabling Mixed Precision for FSDP2 - Bash Source: https://github.com/pytorch/examples/blob/main/distributed/FSDP2/README.md This command executes the FSDP2 training script with mixed precision enabled. It utilizes `torchrun` with 2 processes per node and includes the `--mixed-precision` flag to leverage lower precision data types for potentially faster training and reduced memory usage. ```Bash torchrun --nproc_per_node 2 train.py --mixed-precision ```