
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Use a DataLoader in PyTorch
PyTorch is a popular open-source machine-learning library. Data scientists, researchers, and developers use this library widely to develop AI/ML products. One of the most important features of PyTorch is the Data Loader class. This class helps to load and batch the data for neural network training efficiently. This article will teach us how to use the Data Loader in PyTorch.
Use Data Loader In PyTorch
We can follow the following basic rules to perform the Data Loading operation in Python using the PyTorch library ?
Data Preparation ? Create a custom Random Dataset class that generates a random dataset of a desired size. Use Data Loader to create batches of data, specifying the batch size and enabling shuffling.
Neural Network Definition ? Define a neural network class, Net, with two fully connected layers and an activation function. Customize the architecture based on the desired number of units in each layer
Initialization and Optimization ? Instantiate the Net class, set the mean squared error (MSE) loss criterion, and initialize the optimizer as stochastic gradient descent (SGD) with a desired learning rate.
Training Loop ? Iterate over the Data Loader for the desired number of epochs. For each batch of data, compute the network output, calculate the loss, back propagate the gradients, update the weights, and track the running loss
Example
The following code defines a simple neural network and a random dataset of 1000 data points with ten features each. It then creates a Data Loader from the dataset with a batch size of 32 and shuffles the data. The neural network is trained using stochastic gradient descent with a mean squared error loss function. The training loop iterates over the Data Loader for ten epochs, computing the loss for each batch of data, back propagating the gradients, and updating the network weights. The running loss is also printed every ten batches to monitor training progress.
import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, Dataset class RandomDataset(Dataset): def __init__(self, size): self.data = torch.randn(size, 10) def __len__(self): return len(self.data) def __getitem__(self, index): return self.data[index] class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.fc1 = nn.Linear(10, 5) self.fc2 = nn.Linear(5, 1) def forward(self, x): x = self.fc1(x) x = nn.functional.relu(x) x = self.fc2(x) return x dataset = RandomDataset(1000) dataloader = DataLoader(dataset, batch_size=32, shuffle=True) net = Net() criterion = nn.MSELoss() optimizer = optim.SGD(net.parameters(), lr=0.01) for epoch in range(10): running_loss = 0.0 for i, data in enumerate(dataloader, 0): inputs = data labels = torch.rand((data.shape[0], 1)) optimizer.zero_grad() outputs = net(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item() if i % 10 == 9: print(f"[Epoch {epoch + 1}, Batch {i + 1}] loss: {running_loss / 10}") running_loss = 0.0
Output
[Epoch 1, Batch 10] loss: 0.25439725518226625 [Epoch 1, Batch 20] loss: 0.18304144889116286 [Epoch 1, Batch 30] loss: 0.1451663628220558 [Epoch 2, Batch 10] loss: 0.12896266356110572 [Epoch 2, Batch 20] loss: 0.11783223450183869 ?????????????????????.. [Epoch 10, Batch 30] loss: 0.09491728842258454
Data Sampling and Weighted Sampling
Data sampling refers to selecting only a subset of the data for execution. This is essential in machine learning and data analysis when a large amount of data cannot fit into a RAM. Sampling helps to train, test and validate batch-wise. Weighted sampling is a variant of sampling where we define some weight to the data points. This considers that the data points with more impact over the prediction have more significance
Syntax
weighted_sampler = WeightedRandomSampler("weights in the form of array like objects", num_samples=len(dataset), other parameters?) loader = DataLoader(dataset, batch_size=batch_size, sampler=weighted_sampler, other parameters......)
Here we need to define the weight as a list or array-like objects; Weighted Random Sampler then creates the sampler. Then we need to pass the dataset to the Data Loader object. We need to use the parameter "sampler" for the weighted sampling.
Example
We implemented weighted sampling using the Data loader, Weighted Random Sampler in the following example. We passed the dataset, and batch size=32, to the Data Loader object. This means that 32 data samples are processed at a time. We used the Weighted Random Sampler method to give weights to the samples. Since we set replacement=True, the data points can be included in multiple batches.
import torch from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler class CustomDataset(Dataset): def __init__(self): self.data = torch.randn((1000, 3, 32, 32)) self.labels = torch.randint(0, 10, (1000,)) def __len__(self): return len(self.data) def __getitem__(self, index): data_sample = self.data[index] label = self.labels[index] return data_sample, label dataset = CustomDataset() weights = torch.where(dataset.labels == 0, torch.tensor(2.0), torch.tensor(1.0)) sampler = WeightedRandomSampler(weights, len(dataset), replacement=True) dataloader = DataLoader(dataset, batch_size=32, sampler=sampler) for batch_data, batch_labels in dataloader: print(batch_data.shape, batch_labels.shape)
Output
torch.Size([32, 3, 32, 32]) torch.Size([32]) torch.Size([32, 3, 32, 32]) torch.Size([32]) .................................................. torch.Size([32, 3, 32, 32]) torch.Size([32]) torch.Size([32, 3, 32, 32]) torch.Size([32]) torch.Size([8, 3, 32, 32]) torch.Size([8])
Multi-threaded Data Loading
Multi threaded loading is a process to speed up the process of loading and pre-processing the data. This technique aims to parallelized the data loading operations in multiple threads across the device, enabling it to process the execution much faster. We can enable this in PyTorch by using the num_workers parameter. The parameter takes the number of threads to be used in the form of integers
Syntax
dataloader = DataLoader( num_workers=<number of workers>, other parameters)
Here the num_workers is the number of sub processes that can occur during execution. Setting the num_works to the number of CPU threads available is common. If set to -1 it will utilize all the CPU cores available.
Example
In the following code, we have set the num_workers to be 2, meaning that the data loading and pre-processing process would occur in 2 threads in parallel. We kept the batch size to 32 and shuffle=True(shuffling would occur before creating the batches).
import torch from torch.utils.data import Dataset, DataLoader class CustomDataset(Dataset): def __init__(self, num_samples): self.data = torch.randn((num_samples, 3, 64, 64)) self.labels = torch.randint(0, 10, (num_samples,)) def __len__(self): return len(self.data) def __getitem__(self, index): data_sample = self.data[index] label = self.labels[index] return data_sample, label dataset = CustomDataset(num_samples=3000) dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2) for batch_data, batch_labels in dataloader: print("Batch data shape:", batch_data.shape) print("Batch labels shape:", batch_labels.shape)
Output
Batch data shape: torch.Size([32, 3, 64, 64]) Batch labels shape: torch.Size([32]) Batch data shape: torch.Size([32, 3, 64, 64]) ....................................................... Batch data shape: torch.Size([24, 3, 64, 64]) Batch labels shape: torch.Size([24])
Shuffling and Batch Size
As the name suggests, shuffling refers to randomly reordering the data points. This has several advantages, including the removal of bias. It is expected that the data points would be more uniform on shuffling the data, leading to better fine-tuning of the model. Batch size, conversely, refers to grouping the data points and letting them execute at once. This is important since a large amount of data may only sometimes fit within the memory
Syntax
dataloader = DataLoader(dataset, batch_size=<set a number>, shuffle=<Boolean True or False>, other parameters...)
Here dataset is the data we need to set the batch size and shuffle the same. Batch size takes parameters in the form of whole numbers. Shuffle accepts boolean True and False as the arguments. If set to True shuffling takes place and if set to False no shuffling takes place.
Example
In the following example, we passed two important parameters to the Data Loader class, i.e., batch_size and shuffle. We set batch_size to 128, meaning that 128 data points will be executed simultaneously. shuffle=True indicates that before each execution shuffling would take place. Shuffling won't occur if set to False, and we may encounter a slightly biased model.
import torch from torch.utils.data import Dataset, DataLoader class CustomDataset(Dataset): def __init__(self, num_samples): self.data = torch.randn((num_samples, 3, 32, 32)) self.labels = torch.randint(0, 10, (num_samples,)) def __len__(self): return len(self.data) def __getitem__(self, index): data_sample = self.data[index] label = self.labels[index] return data_sample, label dataset = CustomDataset(num_samples=1000) dataloader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=2) for batch_data, batch_labels in dataloader: print("Batch data shape:", batch_data.shape) print("Batch labels shape:", batch_labels.shape)
Output
Batch data shape: torch.Size([128, 3, 32, 32]) Batch labels shape: torch.Size([128]) ...................................................... Batch data shape: torch.Size([104, 3, 32, 32]) Batch labels shape: torch.Size([104])
Conclusion
In this article, we discussed using the Data Loader in PyTorch. We can process this data later to train neural networks. These classes are extremely useful when training our data on any existing model. This helps us save time and get good results since multiple developers contribute to the models, the open-source community, etc. It is also important to understand that different models may require different hyper parameters. So it depends upon the resources available and characteristics of the data about which parameters one should choose.