How to Use Multiple GPUs In PyTorch?

Published on Sep 20, 2025

9 min read

Check the number of available GPUs
Set the device to the first GPU if available
Initialize your model
Wrap the model with DataParallel
Move the model to available GPUs
Define your loss criterion
Optimization step
Inside your training loop

How to Use Multiple GPUs In PyTorch? image

Best GPU Configurations for PyTorch to Buy in October 2025

PyTorch Pocket Reference: Building and Deploying Deep Learning Models

BUY & SAVE

$16.69 $29.99

Save 44%

Parallel and High Performance Programming with Python: Unlock parallel and concurrent programming in Python using multithreading, CUDA, Pytorch and Dask. (English Edition)

BUY & SAVE

$37.11

Learning Deep Learning: Theory and Practice of Neural Networks, Computer Vision, Natural Language Processing, and Transformers Using TensorFlow

BUY & SAVE

$74.52

Deep Learning at Scale: At the Intersection of Hardware, Software, and Data

BUY & SAVE

$28.06

ONE MORE?

In PyTorch, you can use multiple GPUs for faster training and inference by utilizing the torch.nn.DataParallel module. This module allows you to parallelize the computation across multiple GPUs, thereby taking advantage of their combined processing power.

To utilize multiple GPUs, follow these steps:

Ensure that PyTorch is installed with GPU support (torch.cuda.is_available() should return True).
Import the necessary modules: torch, torch.nn, and torch.nn.DataParallel.
Create your model and move it to the GPU by calling model.cuda().
Wrap your model with torch.nn.DataParallel: model = torch.nn.DataParallel(model).
Prepare your data, and if required, move it to the GPU using input.cuda().
Pass the data through the model as you would with a single GPU or CPU.
Ensure that the output from the model is moved back to the CPU using output.cpu() if you need to perform further operations outside the model.
If needed, you can access the individual model replicas and perform operations on them using model.module. For example, model.module.parameters() returns the parameters of the original model.

By utilizing these steps, PyTorch will distribute the computation across multiple GPUs automatically. It will divide the input tensor into smaller, equal-sized chunks and send them to each GPU. The model will then forward propagate these chunks independently and merge the results back into a single output.

Using torch.nn.DataParallel allows you to achieve a significant speedup in training or inference time when compared to a single GPU or CPU.

What is synchronized batch normalization and how does it relate to multi-GPU training in PyTorch?

Synchronized Batch Normalization (SyncBN) is an extension of the standard Batch Normalization technique. In Batch Normalization, the mean and variance of a mini-batch are calculated independently on each GPU during multi-GPU training, which can cause synchronization issues and lead to different mean and variance estimates across the GPUs. This discrepancy affects the normalization and can harm the model's performance.

SyncBN addresses this problem by synchronizing the mean and variance statistics across all GPUs involved in the training process. It performs the necessary operations to exchange the mean and variance values between GPUs, ensuring that all GPUs have the same estimates. By doing so, SyncBN helps in maintaining accurate and consistent normalization during multi-GPU training.

In PyTorch, SyncBN can be utilized through the torch.nn.SyncBatchNorm module, which is an alternative to the regular torch.nn.BatchNorm module. By replacing the latter with SyncBN in a network architecture, it enables synchronized Batch Normalization during multi-GPU training in PyTorch. This ensures consistent normalization and improves the overall training process by mitigating the GPU synchronization issues.

How to set the number of GPUs to use in PyTorch?

In PyTorch, you can set the number of GPUs to use by configuring the CUDA_VISIBLE_DEVICES environment variable. This variable allows you to control which GPUs are visible to a program.

Here are the steps to set the number of GPUs to use in PyTorch:

Open a terminal and set the CUDA_VISIBLE_DEVICES environment variable. If you want to use only a specific GPU, specify its index. For example, to use GPU 0, run the following command:

export CUDA_VISIBLE_DEVICES=0

If you want to use multiple GPUs, specify their indices as a comma-separated list. For example, to use GPUs 0 and 1, run:

export CUDA_VISIBLE_DEVICES=0,1

In your PyTorch code, specify the number of GPUs to use by checking the length of torch.cuda.device_count(), which returns the number of available GPUs. You can then call torch.cuda.set_device() to set the current device to the desired GPU.

Here is an example:

import torch

Check the number of available GPUs

num_gpus = torch.cuda.device_count()

Set the device to the first GPU if available

if num_gpus > 0: torch.cuda.set_device(0)

By setting the device, you ensure that PyTorch uses the specified GPU(s) for computations.

Note that CUDA_VISIBLE_DEVICES affects the visibility of GPUs for all CUDA-capable applications running in the current shell session.

What is the impact of different network architectures on multi-GPU training in PyTorch?

The choice of network architecture can have a significant impact on multi-GPU training in PyTorch. Here are some aspects to consider:

Model size: Larger models with more parameters require more memory to store and compute the gradients during training. This can limit the number of models that can fit in the GPU memory when using multiple GPUs.
Parallelization strategy: PyTorch provides different parallelization strategies for multi-GPU training. When using Data Parallelism, the model is replicated on each GPU, and each GPU processes a different subset of the training data. However, this strategy comes with communication overhead between GPUs, as gradients need to be synchronized. Alternatively, Model Parallelism can be used to split the model across multiple GPUs, with each GPU responsible for certain layers. This allows larger models to be trained, but requires careful design and implementation to ensure efficient communication between GPUs.
Network topology: The structure of the network architecture can impact the efficiency of multi-GPU training. Some architectures have more inter-layer dependencies, which can increase communication overhead between GPUs. On the other hand, architectures with more independent branches or skip connections can enable better model parallelism.
Batch size: When training on multiple GPUs, the effective batch size is typically increased since each GPU processes a subset of the data. Larger batch sizes can provide better generalization and speed up training, but excessively large batches may lead to instability. The choice of network architecture should consider the optimal batch size for convergence and performance.
Training time: Different network architectures can have varying training times, which can affect the overall time spent on multi-GPU training. Some architectures may converge faster, while others may require more iterations. The training time should be considered in relation to the available computational resources and time constraints.

Ultimately, the selection of a network architecture for multi-GPU training in PyTorch should take into account considerations such as memory requirements, parallelization strategy, communication overhead, and the desired trade-off between model performance and training time.

What is the recommended way to handle gradients when using multiple GPUs in PyTorch?

When using multiple GPUs in PyTorch, a common approach to handle gradients is to use the torch.nn.DataParallel wrapper.

The DataParallel wrapper allows you to wrap your model with it and parallelize the model's forward pass across multiple GPUs. It automatically divides the input batch across the available GPUs and computes the forward pass on each GPU in parallel. After the forward pass, it gathers the outputs from all GPUs and returns them as a single tensor.

To use DataParallel, you need to initialize your model, wrap it with DataParallel, and move it to the available GPUs. Here's an example:

import torch import torch.nn as nn from torch.nn import DataParallel

Initialize your model

model = YourModel()

Wrap the model with DataParallel

model = DataParallel(model)

Move the model to available GPUs

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") model = model.to(device)

Define your loss criterion

criterion = nn.CrossEntropyLoss()

Optimization step

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Inside your training loop

for inputs, labels in dataloader: inputs = inputs.to(device) labels = labels.to(device)

# Forward pass
outputs = model(inputs)

# Compute loss
loss = criterion(outputs, labels)

# Backward pass
loss.backward()

# Update model parameters
optimizer.step()
optimizer.zero\_grad()

By using DataParallel, PyTorch will automatically handle the gradients for you and synchronize them across the multiple GPUs. It splits the gradients computed during the backward pass across the GPUs and then sums them up to compute the overall gradient for updating the model parameters.

How to use multiple GPUs in PyTorch?

To use multiple GPUs in PyTorch, you can follow these steps:

Set the device variable to "cuda" and call torch.cuda.device_count() to get the number of available GPUs.

device = "cuda" if torch.cuda.is_available() else "cpu" num_gpus = torch.cuda.device_count()

Define your model and convert it to a parallel model using torch.nn.DataParallel. The DataParallel class splits the input across multiple GPUs and applies the forward pass on each GPU in parallel. This is done by wrapping your model with DataParallel.

model = YourModel() if num_gpus > 1: model = torch.nn.DataParallel(model) model.to(device)

Convert your data and labels to tensors and move them to the GPU.

data = data.to(device) labels = labels.to(device)

Finally, perform the forward and backward passes as usual on your model.

outputs = model(data) loss = criterion(outputs, labels) loss.backward()

That's it! PyTorch will distribute the computations across the available GPUs, and the gradients will be synchronized automatically during the backward pass.

How to choose the optimal learning rate for multi-GPU training in PyTorch?

Choosing the optimal learning rate for multi-GPU training in PyTorch involves a similar process as for single-GPU training. Here are some steps you can follow:

Start with a small learning rate: It is recommended to begin with a conservative learning rate. This allows for a cautious exploration of the parameter space.
Use learning rate schedules: Gradually increasing or decreasing the learning rate over time can help achieve better convergence. Implement and experiment with learning rate schedules like step decay, exponential decay, or cyclical learning rates.
Employ learning rate range tests: This technique involves gradually increasing the learning rate until the loss value starts to increase. This helps in determining an optimal learning rate range. PyTorch provides a useful implementation of the learning rate range test through the torch_lr_finder library.
Perform grid search: Conduct a grid search by trying different learning rates within the determined optimal learning rate range obtained from step 3. Monitor the model's performance using validation metrics such as accuracy or loss and select the learning rate that achieves the best results.
Use learning rate warm-up: Gradually increasing the learning rate during the initial training iterations can help avoid abrupt sudden change in weights and provide a better initialization. This is especially useful when training large models with large batch sizes.
Perform cross-validation: If training hyperparameters allow, use k-fold cross-validation to evaluate the model's performance using different learning rates. This helps in finding the learning rate that generalizes well across different data splits.
Experiment and observe: Ultimately, it is important to experiment with different learning rates and observe their impact on the model's convergence and performance. Monitor key model metrics during training to determine the optimal learning rate.

Remember that the optimal learning rate may vary depending on the specific problem, dataset, model architecture, and batch size. It is recommended to iterate and fine-tune the learning rate selection process for multi-GPU training to achieve the best results.

How to Use Multiple GPUs In PyTorch?

Table of Contents

Best GPU Configurations for PyTorch to Buy in October 2025

PyTorch Pocket Reference: Building and Deploying Deep Learning Models

Parallel and High Performance Programming with Python: Unlock parallel and concurrent programming in Python using multithreading, CUDA, Pytorch and Dask. (English Edition)

Learning Deep Learning: Theory and Practice of Neural Networks, Computer Vision, Natural Language Processing, and Transformers Using TensorFlow

Deep Learning at Scale: At the Intersection of Hardware, Software, and Data

What is synchronized batch normalization and how does it relate to multi-GPU training in PyTorch?

How to set the number of GPUs to use in PyTorch?

Check the number of available GPUs

Set the device to the first GPU if available

What is the impact of different network architectures on multi-GPU training in PyTorch?

What is the recommended way to handle gradients when using multiple GPUs in PyTorch?

Initialize your model

Wrap the model with DataParallel

Move the model to available GPUs

Define your loss criterion

Optimization step

Inside your training loop

How to use multiple GPUs in PyTorch?

How to choose the optimal learning rate for multi-GPU training in PyTorch?