How to Create A Custom Dataset In PyTorch?

Published on Sep 20, 2025

7 min read

What is cross-validation in machine learning?
How to load a custom dataset in PyTorch?
What are data augmentation techniques used in deep learning?
How to handle class imbalance in PyTorch custom datasets?
What is class imbalance in deep learning?

How to Create A Custom Dataset In PyTorch? image

Best PyTorch Books to Buy in October 2025

Machine Learning with PyTorch and Scikit-Learn: Develop machine learning and deep learning models with Python

BUY & SAVE

$46.99 $54.99

Save 15%

Deep Learning for Coders with Fastai and PyTorch: AI Applications Without a PhD

BUY & SAVE

$43.99 $79.99

Save 45%

Deep Learning with PyTorch: Build, train, and tune neural networks using Python tools

BUY & SAVE

$34.40 $49.99

Save 31%

Mastering PyTorch: Create and deploy deep learning models from CNNs to multimodal models, LLMs, and beyond

BUY & SAVE

$29.63 $51.99

Save 43%

PyTorch Pocket Reference: Building and Deploying Deep Learning Models

BUY & SAVE

$16.69 $29.99

Save 44%

Generative AI with Python and PyTorch: Navigating the AI frontier with LLMs, Stable Diffusion, and next-gen AI applications

BUY & SAVE

$41.24 $54.99

Save 25%

ONE MORE?

Creating a custom dataset in PyTorch involves defining a class that inherits from the torch.utils.data.Dataset class and implementing two mandatory methods: __len__ and __getitem__.

The __len__ method should return the total number of samples in the dataset, while the __getitem__ method should return a sample from the dataset at a given index. The __getitem__ method takes an index as input and returns a tuple (or dictionary) containing the features and labels of the sample.

Within the class, you can define any necessary data preprocessing or transformations, such as resizing images or normalizing values. This can be done in the __init__ method, where you can initialize any required variables, read data from files, or set up data paths.

To use the custom dataset, you can instantiate an object of the dataset class and then pass it to a torch.utils.data.DataLoader. The data loader allows you to efficiently load the data in batches and parallelize the processing.

Overall, creating a custom dataset in PyTorch gives you flexibility in handling your own data format or applying specific preprocessing steps before training or testing your models.

What is cross-validation in machine learning?

Cross-validation is a technique used in machine learning for assessing the performance and generalizability of a model. It involves splitting the available data into multiple subsets, typically referred to as "folds." The model is trained on a combination of these folds and evaluated on the remaining fold. This process is repeated multiple times, with each fold serving as the validation set exactly once.

The main goal of cross-validation is to estimate how well a model will perform on unseen data by simulating the process of training and testing on different sets of data. It helps to detect issues like overfitting, where a model performs well on the training data but fails to generalize to new data.

One common approach to cross-validation is k-fold cross-validation, where the data is divided into k equal-sized folds. The model is trained on k-1 folds and evaluated on the remaining fold, and this process is repeated k times, with each fold serving as the validation set. The performance of the model is then averaged over the k iterations to obtain a more reliable estimate.

Cross-validation allows for better model selection and hyperparameter tuning, as it provides a more robust evaluation of performance compared to a single train-test split. It helps in choosing the best model or set of parameters that generalize well to new and unseen data.

How to load a custom dataset in PyTorch?

To load a custom dataset in PyTorch, you need to create a custom dataset class that inherits from the torch.utils.data.Dataset class. Here are the steps to do so:

Import the necessary PyTorch libraries:

import torch from torch.utils.data import Dataset, DataLoader

Create a custom dataset class that inherits from Dataset:

class CustomDataset(Dataset): def __init__(self, data): self.data = data

def \_\_getitem\_\_(self, index):
    # Implement how to retrieve data for a given index
    sample = self.data\[index\]
    return sample

def \_\_len\_\_(self):
    # Return the total number of samples in the dataset
    return len(self.data)

Implement the necessary methods:

__init__(self, data): Initialize the dataset with the given data.
__getitem__(self, index): Retrieve the data sample for a given index.
__len__(self): Return the total number of samples in the dataset.

Create an instance of your custom dataset by passing your dataset data to the constructor:

dataset = CustomDataset(data)

Use the DataLoader class to create a data loader for your dataset:

dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

Here, you can specify the batch_size and choose whether to shuffle the dataset during training.

Iterate over the data loader to get the data samples in batches during training:

for batch in dataloader: # Access the data samples in the batch inputs, labels = batch # Perform your training/inference operations on the batched data

You can now use this custom dataset and data loader in your PyTorch code for training or inference.

What are data augmentation techniques used in deep learning?

There are several data augmentation techniques used in deep learning to increase the diversity and size of training data. Some commonly used techniques include:

Image flipping: Flipping the image horizontally or vertically to create a mirrored version of the image.
Random cropping: Randomly cropping a section of the image, which helps to simulate different viewpoints or object scales.
Rotation: Rotating the image by a certain angle to introduce variability in the visual orientation.
Zooming: Zooming in or out of the image to include different levels of object details.
Translation: Translating the image horizontally or vertically to simulate object movement or positioning.
Scaling: Scaling the image up or down to mimic different object sizes.
Gaussian noise: Adding random Gaussian noise to the image to improve model robustness against noise in real-world scenarios.
Brightness adjustment: Adjusting the brightness or contrast of the image to account for different lighting conditions.
Color jittering: Introducing random variations to the color channels of the image to simulate different color saturation, hue, or brightness.

These techniques help in reducing overfitting, improving generalization, and making the model more versatile in handling variations in real-world inputs.

How to handle class imbalance in PyTorch custom datasets?

Handling class imbalance in PyTorch custom datasets can be done using either oversampling or undersampling techniques. Here's how you can handle class imbalance:

Oversampling: Identify the class with the lowest number of samples. Generate additional samples for the minority class by duplicating existing samples or using techniques like SMOTE (Synthetic Minority Over-sampling Technique). Use the oversampled dataset for training.
Undersampling: Identify the class with the highest number of samples. Randomly remove samples from the majority class to balance the dataset. Use the undersampled dataset for training.
Stratified Sampling: Divide the dataset into multiple subsets, each containing a balanced representation of classes. Randomly sample a fixed number of samples from each subset to create a balanced dataset. Use the stratified dataset for training.
Weighted Loss Function: Assign higher weights to the minority class samples in the loss function. This helps in giving more importance to the minority class during optimization.
Combine Undersampling and Oversampling: Perform undersampling on the majority class to reduce its class imbalance. Then perform oversampling techniques on the minority class to increase its representation. This combined approach can help create a more balanced dataset.

You can implement these approaches in the custom Dataset class by modifying the __getitem__ method to handle oversampling, undersampling, or stratified sampling based on your specific requirements. Additionally, you can use the WeightedRandomSampler with a custom weight distribution to handle imbalanced datasets in the training dataloader.

What is class imbalance in deep learning?

Class imbalance in deep learning refers to a scenario where the number of examples in one class is significantly higher or lower than the number of examples in another class within a dataset. This imbalance can create learning difficulties for a deep learning model, as it may become biased towards the majority class, ignoring or misclassifying instances from the minority class.

For example, in a binary classification problem where 90% of the samples belong to class A and only 10% belong to class B, the model may achieve high accuracy by simply always predicting class A. However, this would not be desirable if the true objective is to correctly classify instances from both classes.

Class imbalance can affect the training process and model performance because the model tends to navigate towards the majority class due to its better representation. The model might encounter challenges in recognizing and learning patterns from the minority class, leading to poor predictive performance for that class.

Addressing class imbalance is important in deep learning to ensure fair and accurate predictions for all classes. Techniques such as oversampling the minority class, undersampling the majority class, or using more advanced methods like data augmentation, weighted loss functions, or synthetic minority oversampling technique (SMOTE) can help mitigate the impact of class imbalance on the deep learning model.