作者:Avishek Biswas
is shaping our world as we speak. In fact, it has been slowly revolutionizing software since the early 2010s. In 2025, PyTorch is at the forefront of this revolution, emerging as one of the most important libraries to train neural networks.
Whether you are working with computer vision, building large language models (LLMs), training a reinforcement learning agent, or experimenting with graph neural networks – your path is going to cross through PyTorch once you enter deep learning city.
All images provided in this article have been produced by the author.
This guide will provide a whirlwind tour of PyTorch’s methodologies and design principles. Over the next hour, we’re going to cut through the noise and get straight to the heart of how neural networks are actually trained.
This article is about PyTorch’s foundational concepts and how to compose and train models — from simple linear regression all the way to a modern transformer block.
More importantly than the specific code examples presented here, the goal of this article is to teach the main ideas, project-level architectures, and abstractions to work with PyTorch.
In other words, how to think “the PyTorch way”.
Before we get that far, it is important to understand the basics. PyTorch is built on two core abstractions: tensors and automatic differentiation. Master these two — how tensors store data, and how gradients are used to train neural networks — and the rest of PyTorch will feel natural. Let’s discuss Tensors first.
A tensor is a multidimensional array with a dtype, a device, and optional gradient tracking. If you know NumPy arrays, think of tensors as numpy arrays with a few major benefits:
Neural networks in PyTorch construct a dynamic computation graph, and use it to compute gradients automatically. Let’s see a simple example to learn this.
Let us begin with a clean, scalar example so that shapes and values are easy to reason about. The following code computes z = x^2 + y^3
for scalar x and y, then calls backward to obtain dz/dx
and dz/dy
.
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)
# Forward pass: compute z = x^2 + y^3
z = x**2 + y**3
# Backward pass: compute gradients
z.backward()
dz_dx = x.grad # partial derivative wrt x
dz_dy = y.grad # partial derivative wrt y
What is happening:
x
and y
with requires_grad=True
. This tells autograd to track their operations.z.backward()
triggers the reverse-mode autodiff: PyTorch computes the gradients and places them in x.grad
and y.grad
.Here is what the results of the above block of code will look like:
If you did some mental math, here’s what the partial derivatives look like for that equation when calculated analytically (spoiler: it works!):
Chain Rule in calculus is a fundamental formula for differentiating composite functions, which are essentially functions within other functions. In simpler terms, you work from the outside in, taking the derivative of each “layer” of the function and multiplying them together.
Let’s take a simple example of how the chain rule works in PyTorch. Let’s say you have the following three-step equation:
Eqn 1: y = x^2
Eqn 2: z = y + 1
Eqn 3: w = z^2
Basically, w
depends on z
, z
depends on y
, y
depends on x
. A basic chain of compositionality. And let’s say you want to find the derivative of w
with respect to x
The chain rule in calculus states that to find dw/dx
we calculate the gradients up the chain of dependencies and multiply them. So:dw/dx = dw/dz * dz/dy * dy/dx
Let’s see how PyTorch does this:
# requires_grad=True tells PyTorch to compute the gradients for this tensor
x = torch.tensor(2.0, requires_grad=True)
# Define the forward pass
y = x**2
z = y + 1
w = z**2
# Calculate the gradient
w.backward()
# print the gradient
print(x.grad) # 40
And that’s it! It just works.
What’s even more special is that instead of defining x as a scalar like we did above, we can also define it as a multi-dimensional tensor.
Here’s what happens when we change the first line from initializing a scalar with torch.tensor(2)
to a 1D-tensor with torch.tensor([-1, 2])
This is what makes PyTorch so cool. You can simultaneously (or parallely) compute gradients for multiple elements just like that.
When working on deep learning projects, our inputs are generally multi-dimensional, so PyTorch does a lot of heavy lifting in the background by parallelizing the gradient computation!
As seen in the previous example, the PyTorch game plan is pretty simple.
Now that we understand the basics of auto differentiation, let’s see how linear regression works in PyTorch. The code below constructs a small housing-style dataset with two features (area and age), normalizes them to the range of [-1, 1], and prepares us for some good old-fashioned linear regression.
df = pd.DataFrame(
{
"area": [120, 180, 150, 210, 105],
"age": [5, 2, 1, 2, 1],
"price": [30, 90, 100, 180, 85]
}
)
df = normalize(df)
To do anything with PyTorch, we must first transfer the data into tensors! Notice how the data tensors X and Y do not require gradients because they are constants (i.e., they don’t change during training).
The weights W
and B
are trainable, though. We will update them to fit our dataset. To make it trainable through backpropagation, we need to set requires_grad=True
for these declarations.
Look at the code below:
# Note that these are constants, we are not going to update them
X = torch.tensor(df[["area", "age"]].values, dtype=torch.float32)
Y = torch.tensor(df[["price"]].values, dtype=torch.float32)
# These "require_grad" So they are trainable weights.
W = torch.rand(size=(2, 1), requires_grad=True)
B = torch.rand(1, requires_grad=True)
Next, let’s generate a prediction! The forward pass uses the idiomatic matrix multiplication and addition, i.e. X @ W + B
.
# Generate a prediction
pred = X @ W + B
The @
operator basically does matrix multiplication between X and W. The X @ W + B
model performs a “linear transformation” of X. Our goal is to tune the trainable weights W and B so that the prediction is closer to our target ground truth.
Next, we calculate the error as the mean square error loss. It calculates the distance between our current prediction and the ground truth. If we call loss.backward()
we will also get the gradients of the trainable variables in the graph (i.e., W and B).
loss.backward()
dW = W.grad # Tells us "how much W must change to reduce the loss"
dB = B.grad # and "how much B must change to reduce the loss"
dW
and dB
are the gradients of W and B with respect to the loss function. We can apply “gradient descent” to nudge these trainable parameters in the direction indicated by the gradient.
lr = 0.2 # Learning rate: tells us how much we should update the weights
with torch.no_grad():
W = W - lr * dW # Updating W with Gradient descent
B = B - lr * dB # Updating B with Gradient descent
Understanding linear regression, loss calculation, and gradient descent are some of the pillars of machine learning, and by extension, deep learning. While updating the weights manually by subtracting the gradients is possible, it is infeasible in practice for deep neural networks with multiple layers of weights. If only there had been a way to automatically update weights without worrying about keeping track like this!
Side note
The above technique of taking small steps in the optimization space to iteratively learn the weights is called Gradient Descent. Note that there are better ways to learn the optimal W and B for small datasets. Like the Normal equation, which gives us an analytical solution that doesn’t require any steps or iteration. It is however, computationally expensive for large datasets. For large matrices, the standard approach is to divide the data into minibatches and apply gradient descent individually. This technique is known as Stochastic Gradient Descent (SGD).
PyTorch optimizers are algorithms (like SGD, Adam, or RMSprop) that adjust the model’s weights and biases based on the computed gradients to minimize the loss function.
Let’s check how the above Linear Regression code will look if we replaced the manual weight updates with PyTorch optimizers.
from torch.optim import SGD
...
W = torch.rand(size=(2, 1), requires_grad=True)
B = torch.rand(1, requires_grad=True)
optimizer = SGD(params = [W, B], lr=0.1)
for step in range(10):
pred = X @ W + B # Forward pass
loss = ((Y - pred) ** 2).mean() # Calculate loss
loss.backward() # Calculate gradients
optimizer.step() # Update W and B according to gradients
optimizer.zero_grad() # Reset all gradients
The core loop for training models in PyTorch looks like this:
pred
loss
by finding the error between the prediction (pred) and the ground truth (Y)loss.backward()
to populate W.grad
and B.grad.
optimizer.step()
to update parameters.optimizer.zero_grad()
to avoid accumulation.SGD is a solid baseline for linear regression. As you scale up or face noisier gradients, adaptive optimizers can help. This is where PyTorch’s suite of open source optimizers comes into play. This includes adaptive optimizers, like Adam, that use techniques such as momentum and per-parameter learning rates to achieve faster and more stable convergence on these challenging tasks. Here is a flashcard comparing various popular ones:
Not just optimizers, because Torch also provides a host of different loss functions too! Here are some examples:
Just like we don’t need to write our own optimizers, we don’t need to declare raw tensors and matrix multiplication logic on our own (for the most part). Pytorch modules have us covered.
A PyTorch Module is the fundamental building block for all neural networks in PyTorch, acting as a container for layers, learnable parameters, and the logic for how data flows through them. For example, that linear layer we wrote earlier, where we manually declared the weights and biases, we can instead use these lines of code:
linear_model = nn.Linear(in_size, out_size) # Torch takes care of initializing weights
prediction = linear_model(input) # Forward pass
We learned how to make linear models (yay!), but what we really need to learn is how to train larger and deeper neural networks. The simplest type of neural network is the multi-layer perceptron (MLP). An MLP is basically multiple linear layers with non-linear functions in between.
Creating MLPs is pretty straightforward in Torch. nn.Sequential
is a common PyTorch module that is used to sequentially pass the input through multiple layers. Here is the code:
# A 2 layer MLP
mlp_2_layers = nn.Sequential(
nn.Linear(in_size, hidden_units),
nn.ReLU(),
nn.Linear(hidden_units, out_size)
)
# A 3 layer MLP
mlp_3_layers = nn.Sequential(
nn.Linear(in_size, hidden_units),
nn.ReLU(),
nn.Linear(hidden_units, hidden_units),
nn.ReLU(),
nn.Linear(hidden_units, out_size)
)
Multi-layer perceptrons can learn compositional and non-linear functions! Here is an example of a zig-zag function and how a 2-layer MLP with RELU learns it.
Torch has a vast array of awesome layers and modules that have inspired entire research papers. You can think about these as lego blocks that you can fit into and compose any neural network.
Want a convolutional network layer for images? Use nn.Conv2d
.
A GRU layer to process sequential tokens? Use nn.GRU
But most often in research, you would want to write a custom neural network architecture from scratch. The recipe for this process is as follows:
nn.Module
__init__
constructor function, initialize all your layers and weightsforward()
where you write the logic of the forward passHere is an example where we implement the classic ResNet architecture:
class ResNetBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super(ResNetBlock, self).__init__()
self.conv1 = nn.Conv2d(
in_channels,
out_channels,
kernel_size=3,
stride=stride,
padding=1,
bias=False,
)
self.bn1 = nn.BatchNorm2d(out_channels)
self.conv2 = nn.Conv2d(
out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False
)
self.bn2 = nn.BatchNorm2d(out_channels)
self.downsample = downsample
def forward(self, x):
residual = x
out = F.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
if self.downsample:
residual = self.downsample(x)
out += residual
out = F.relu(out)
return out
That’s it! You just initialize your layers and define the forward pass computation graph, and Torch will directly do the backward pass on its own.
Of course, you can use your custom layers and modules as parts of a much larger network! For example, here is an example of writing a single transformer block.
class AttentionLayer(nn.Module):
def __init__(self, input_dim, attention_dim=64):
super(SimpleAttention, self).__init__()
# Linear layers for attention computation
self.query = nn.Linear(input_dim, attention_dim)
self.key = nn.Linear(input_dim, attention_dim)
self.value = nn.Linear(input_dim, attention_dim)
# Scaling factor
self.scale = torch.sqrt(torch.FloatTensor([attention_dim]))
def forward(self, x):
# x shape: (batch_size, sequence_length, input_dim)
batch_size, seq_len, input_dim = x.size()
# Compute Q, K, V
Q = self.query(x) # (batch_size, seq_len, attention_dim)
K = self.key(x) # (batch_size, seq_len, attention_dim)
V = self.value(x) # (batch_size, seq_len, attention_dim)
attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale # Scaled dot-product attention
attention_weights = F.softmax(attention_scores, dim=-1) # Convert attention weights to probabilities
attended_output = torch.matmul(attention_weights, V) # Apply attention to values
return attended_output, attention_weights
class TransformerBlock(nn.Module):
"""
A single transformer block composed of self-attention and a feed-forward network.
"""
def __init__(self, embed_dim, ffn_hidden_dim):
"""
Args:
embed_dim (int): The dimensionality of the model's embeddings.
ffn_hidden_dim (int): The dimensionality of the hidden layer in the FFN.
"""
super(TransformerBlock, self).__init__()
self.attention = SimpleAttention(embed_dim, embed_dim)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.ffn = nn.Sequential(
nn.Linear(embed_dim, ffn_hidden_dim),
nn.ReLU(),
nn.Linear(ffn_hidden_dim, embed_dim)
)
def forward(self, x):
"""
Forward pass for the transformer block.
Args:
x (torch.Tensor): Input tensor of shape (batch_size, sequence_length, embed_dim).
Returns:
torch.Tensor: The output tensor of the transformer block.
"""
# Self-attention part
attended, _ = self.attention(x)
# Add & Norm (residual connection)
x = self.norm1(attended + x)
# Feed-forward part
ffn_out = self.ffn(x)
# Add & Norm (residual connection)
x = self.norm2(ffn_out + x)
return x
class TransformerEncoder(nn.Module):
"""
A transformer encoder that stacks multiple TransformerBlocks.
"""
def __init__(self, num_layers, embed_dim, ffn_hidden_dim, seq_len, output_dim):
"""
Args:
num_layers (int): The number of transformer blocks to stack.
embed_dim (int): The dimensionality of the model's embeddings.
ffn_hidden_dim (int): The dimensionality of the hidden layer in the FFN.
seq_len (int): The length of the input sequences.
output_dim (int): The dimensionality of the final output (e.g., number of classes).
"""
super(TransformerEncoder, self).__init__()
# Create a list of transformer blocks
self.layers = nn.ModuleList(
[TransformerBlock(embed_dim, ffn_hidden_dim) for _ in range(num_layers)]
)
# Final classification head
self.classifier = nn.Linear(embed_dim * seq_len, output_dim)
def forward(self, x):
"""
Forward pass for the full transformer encoder.
Args:
x (torch.Tensor): Input tensor of shape (batch_size, sequence_length, embed_dim).
Returns:
torch.Tensor: The final output logits from the classifier.
"""
# Pass input through all transformer blocks
for layer in self.layers:
x = layer(x)
# Flatten the output for the classifier
x = x.view(x.size(0), -1)
# Final classification
output = self.classifier(x)
return output
Notice how the first module AttentionLayer
computes the scaled-dot-product attention. TransformerBlock
applies layer norms and feedforward networks on top of it. And finally, the TransformerEncoder
module applies multiple Transformer blocks in a sequence! And just like that, we have a BERT model that incorporates multiple stacks of bidirectional attention layers, along with various optimizations such as layer norms and residual connections.
If you are a beginner and this part overwhelms you, this is very much expected! The cool thing with PyTorch is you get to choose the level of complexity you want to work with depending on your skill level.
When you are beginning, you may want to stick to the hundreds of readymade modules Pytorch offers out of the box. You will slowly find the need to branch out and customizing them for your own use-case. And as you write a couple of custom ones on your own, you will grow more and more confident and proficient.
The goal of this section was to show you the capabilities and infinite customization you can do by combining modules together. Remember: you write the forward pass, and as long as the full graph is differentiable, Torch will always be able to do the auto-differentiation for you!
The features and concepts covered in this article were handpicked to provide a whirlwind tour of some of Torch’s most important capabilities. I have a YouTube video that explains all of these concepts, along with some additional ones like model deployment, dataloaders, distributions, and training methods.
That’s it for this article! Here are some links where you can learn more about my work. Thanks for the read!
Support me on Patreon: https://www.patreon.com/NeuralBreakdownwithAVB
My YouTube channel:
https://www.youtube.com/@avb_fj
Follow me on Twitter:
https://x.com/neural_avb
Read my articles:
https://towardsdatascience.com/author/neural-avb/