PyTorch Explained: From Automatic Differentiation to Training Custom Neural Networks | Towards Data Science

2025-09-24 18:19:59 英文原文

作者：Avishek Biswas

is shaping our world as we speak. In fact, it has been slowly revolutionizing software since the early 2010s. In 2025, PyTorch is at the forefront of this revolution, emerging as one of the most important libraries to train neural networks.

Whether you are working with computer vision, building large language models (LLMs), training a reinforcement learning agent, or experimenting with graph neural networks – your path is going to cross through PyTorch once you enter deep learning city.

All images provided in this article have been produced by the author.

This guide will provide a whirlwind tour of PyTorch’s methodologies and design principles. Over the next hour, we’re going to cut through the noise and get straight to the heart of how neural networks are actually trained.

This article is about PyTorch’s foundational concepts and how to compose and train models — from simple linear regression all the way to a modern transformer block.

More importantly than the specific code examples presented here, the goal of this article is to teach the main ideas, project-level architectures, and abstractions to work with PyTorch.

In other words, how to think “the PyTorch way”.

Before we get that far, it is important to understand the basics. PyTorch is built on two core abstractions: tensors and automatic differentiation. Master these two — how tensors store data, and how gradients are used to train neural networks — and the rest of PyTorch will feel natural. Let’s discuss Tensors first.

1. Basics of Tensors

A tensor is a multidimensional array with a dtype, a device, and optional gradient tracking. If you know NumPy arrays, think of tensors as numpy arrays with a few major benefits:

GPU utilization: Tensors can perform massively parallel operations in the GPU. Matrix multiplications, additions, and even conditional statements are all supported.
Computation Graph: Instead of imagining tensors as an isolated block of data, think of it as a node on a computation graph. (Shown below)
Automatic Differentiation: PyTorch automatically calculates partial derivatives of each differentiable operation it performs. We will discuss what this actually means, and why this is a huge deal for training neural networks very shortly.

A simple computation graph. Pytorch doesn’t just calculate the output, but it also stores information about what nodes the data flows through to generate the output. (Source: Author)

2. Automatic Differentiation (Autograd)

Neural networks in PyTorch construct a dynamic computation graph, and use it to compute gradients automatically. Let’s see a simple example to learn this.

Let us begin with a clean, scalar example so that shapes and values are easy to reason about. The following code computes z = x^2 + y^3 for scalar x and y, then calls backward to obtain dz/dx and dz/dy.

x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

# Forward pass: compute z = x^2 + y^3
z = x**2 + y**3

# Backward pass: compute gradients
z.backward()
dz_dx = x.grad # partial derivative wrt x
dz_dy = y.grad # partial derivative wrt y

What is happening:

We created two tensors x and y with requires_grad=True. This tells autograd to track their operations.
The forward computation constructs a small graph for z.
z.backward() triggers the reverse-mode autodiff: PyTorch computes the gradients and places them in x.grad and y.grad.

Here is what the results of the above block of code will look like:

If you did some mental math, here’s what the partial derivatives look like for that equation when calculated analytically (spoiler: it works!):

Chain Rule

Chain Rule in calculus is a fundamental formula for differentiating composite functions, which are essentially functions within other functions. In simpler terms, you work from the outside in, taking the derivative of each “layer” of the function and multiplying them together.

Let’s take a simple example of how the chain rule works in PyTorch. Let’s say you have the following three-step equation:

Eqn 1: y = x^2
Eqn 2: z = y + 1
Eqn 3: w = z^2

Basically, w depends on z, z depends on y, y depends on x. A basic chain of compositionality. And let’s say you want to find the derivative of w with respect to x

The chain rule in calculus states that to find dw/dx we calculate the gradients up the chain of dependencies and multiply them. So:
dw/dx = dw/dz * dz/dy * dy/dx

Let’s see how PyTorch does this:

# requires_grad=True tells PyTorch to compute the gradients for this tensor
x = torch.tensor(2.0, requires_grad=True)

# Define the forward pass
y = x**2 
z = y + 1
w = z**2

# Calculate the gradient
w.backward()

# print the gradient
print(x.grad) # 40

And that’s it! It just works.

What’s even more special is that instead of defining x as a scalar like we did above, we can also define it as a multi-dimensional tensor.

Here’s what happens when we change the first line from initializing a scalar with torch.tensor(2) to a 1D-tensor with torch.tensor([-1, 2])

Notice how when x is a vector, Pytorch calculates the gradients for each element of x

This is what makes PyTorch so cool. You can simultaneously (or parallely) compute gradients for multiple elements just like that.

When working on deep learning projects, our inputs are generally multi-dimensional, so PyTorch does a lot of heavy lifting in the background by parallelizing the gradient computation!

The Pytorch formula

As seen in the previous example, the PyTorch game plan is pretty simple.

Define the “forward pass” of your equation, i.e., how is your dependent variable derived from your independent variables?
PyTorch automatically computes the backward propagation (provided your equations are differentiable)

We define the forward function of the computation graph. PyTorch automatically computes the gradients. (Source: Author)

3. Training models

Now that we understand the basics of auto differentiation, let’s see how linear regression works in PyTorch. The code below constructs a small housing-style dataset with two features (area and age), normalizes them to the range of [-1, 1], and prepares us for some good old-fashioned linear regression.

df = pd.DataFrame(
    {
        "area": [120, 180, 150, 210, 105],
        "age": [5, 2, 1, 2, 1],
        "price": [30, 90, 100, 180, 85]
    }
)
df = normalize(df)

To do anything with PyTorch, we must first transfer the data into tensors! Notice how the data tensors X and Y do not require gradients because they are constants (i.e., they don’t change during training).

The weights W and B are trainable, though. We will update them to fit our dataset. To make it trainable through backpropagation, we need to set requires_grad=True for these declarations.

Look at the code below:

# Note that these are constants, we are not going to update them
X = torch.tensor(df[["area", "age"]].values, dtype=torch.float32)
Y = torch.tensor(df[["price"]].values, dtype=torch.float32)

# These "require_grad" So they are trainable weights.
W = torch.rand(size=(2, 1), requires_grad=True)
B = torch.rand(1, requires_grad=True)

Next, let’s generate a prediction! The forward pass uses the idiomatic matrix multiplication and addition, i.e. X @ W + B.

# Generate a prediction
pred = X @ W + B

The @ operator basically does matrix multiplication between X and W. The X @ W + B model performs a “linear transformation” of X. Our goal is to tune the trainable weights W and B so that the prediction is closer to our target ground truth.

Next, we calculate the error as the mean square error loss. It calculates the distance between our current prediction and the ground truth. If we call loss.backward() we will also get the gradients of the trainable variables in the graph (i.e., W and B).

loss.backward()
dW = W.grad # Tells us "how much W must change to reduce the loss"
dB = B.grad # and "how much B must change to reduce the loss"

dW and dB are the gradients of W and B with respect to the loss function. We can apply “gradient descent” to nudge these trainable parameters in the direction indicated by the gradient.

lr = 0.2 # Learning rate: tells us how much we should update the weights
with torch.no_grad():
    W = W - lr * dW # Updating W with Gradient descent
    B = B - lr * dB # Updating B with Gradient descent

Understanding linear regression, loss calculation, and gradient descent are some of the pillars of machine learning, and by extension, deep learning. While updating the weights manually by subtracting the gradients is possible, it is infeasible in practice for deep neural networks with multiple layers of weights. If only there had been a way to automatically update weights without worrying about keeping track like this!

Side note
The above technique of taking small steps in the optimization space to iteratively learn the weights is called Gradient Descent. Note that there are better ways to learn the optimal W and B for small datasets. Like the Normal equation, which gives us an analytical solution that doesn’t require any steps or iteration. It is however, computationally expensive for large datasets. For large matrices, the standard approach is to divide the data into minibatches and apply gradient descent individually. This technique is known as Stochastic Gradient Descent (SGD).

Optimizers

PyTorch optimizers are algorithms (like SGD, Adam, or RMSprop) that adjust the model’s weights and biases based on the computed gradients to minimize the loss function.

Let’s check how the above Linear Regression code will look if we replaced the manual weight updates with PyTorch optimizers.

from torch.optim import SGD
... 

W = torch.rand(size=(2, 1), requires_grad=True)
B = torch.rand(1, requires_grad=True)
optimizer = SGD(params = [W, B], lr=0.1)
for step in range(10):
    pred = X @ W + B # Forward pass
    loss = ((Y - pred) ** 2).mean() # Calculate loss
    loss.backward() # Calculate gradients
    optimizer.step() # Update W and B according to gradients
    optimizer.zero_grad() # Reset all gradients

The core loop for training models in PyTorch looks like this:

Forward pass to compute pred
Calculateloss by finding the error between the prediction (pred) and the ground truth (Y)
Backward pass with loss.backward() to populate W.grad and B.grad.
Step with optimizer.step() to update parameters.
Zero gradients with optimizer.zero_grad() to avoid accumulation.

SGD is a solid baseline for linear regression. As you scale up or face noisier gradients, adaptive optimizers can help. This is where PyTorch’s suite of open source optimizers comes into play. This includes adaptive optimizers, like Adam, that use techniques such as momentum and per-parameter learning rates to achieve faster and more stable convergence on these challenging tasks. Here is a flashcard comparing various popular ones:

Some common PyTorch optimizers! (Source: Author)

Not just optimizers, because Torch also provides a host of different loss functions too! Here are some examples:

Some common PyTorch loss functions (Source: Author)

4. Layers and Modules

Just like we don’t need to write our own optimizers, we don’t need to declare raw tensors and matrix multiplication logic on our own (for the most part). Pytorch modules have us covered.

A PyTorch Module is the fundamental building block for all neural networks in PyTorch, acting as a container for layers, learnable parameters, and the logic for how data flows through them. For example, that linear layer we wrote earlier, where we manually declared the weights and biases, we can instead use these lines of code:

linear_model = nn.Linear(in_size, out_size) # Torch takes care of initializing weights
prediction = linear_model(input) # Forward pass

We learned how to make linear models (yay!), but what we really need to learn is how to train larger and deeper neural networks. The simplest type of neural network is the multi-layer perceptron (MLP). An MLP is basically multiple linear layers with non-linear functions in between.

Creating MLPs is pretty straightforward in Torch. nn.Sequentialis a common PyTorch module that is used to sequentially pass the input through multiple layers. Here is the code:

# A 2 layer MLP
mlp_2_layers = nn.Sequential(
    nn.Linear(in_size, hidden_units),
    nn.ReLU(),
    nn.Linear(hidden_units, out_size)
)

# A 3 layer MLP
mlp_3_layers = nn.Sequential(
    nn.Linear(in_size, hidden_units),
    nn.ReLU(),
    nn.Linear(hidden_units, hidden_units),
    nn.ReLU(),
    nn.Linear(hidden_units, out_size)
)

Multi-layer perceptrons can learn compositional and non-linear functions! Here is an example of a zig-zag function and how a 2-layer MLP with RELU learns it.

A 2 layer MLP training on a piecewise linear function (Source: Author)

5. Writing custom networks

Torch has a vast array of awesome layers and modules that have inspired entire research papers. You can think about these as lego blocks that you can fit into and compose any neural network.

Want a convolutional network layer for images? Use nn.Conv2d.
A GRU layer to process sequential tokens? Use nn.GRU

But most often in research, you would want to write a custom neural network architecture from scratch. The recipe for this process is as follows:

Subclass from nn.Module
In the __init__ constructor function, initialize all your layers and weights
Define a forward() where you write the logic of the forward pass

Here is an example where we implement the classic ResNet architecture:

class ResNetBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(ResNetBlock, self).__init__()

        self.conv1 = nn.Conv2d(
            in_channels,
            out_channels,
            kernel_size=3,
            stride=stride,
            padding=1,
            bias=False,
        )
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(
            out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False
        )
        self.bn2 = nn.BatchNorm2d(out_channels)

        self.downsample = downsample

    def forward(self, x):
        residual = x

        out = F.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        if self.downsample:
            residual = self.downsample(x)

        out += residual
        out = F.relu(out)

        return out

That’s it! You just initialize your layers and define the forward pass computation graph, and Torch will directly do the backward pass on its own.

A standard ResNet Block passes embeddings through a short stack of neural network layers (like convolution) and then adds the original embeddings back to the final output

Of course, you can use your custom layers and modules as parts of a much larger network! For example, here is an example of writing a single transformer block.

class AttentionLayer(nn.Module):
    def __init__(self, input_dim, attention_dim=64):
        super(SimpleAttention, self).__init__()

        # Linear layers for attention computation
        self.query = nn.Linear(input_dim, attention_dim)
        self.key = nn.Linear(input_dim, attention_dim)
        self.value = nn.Linear(input_dim, attention_dim)

        # Scaling factor
        self.scale = torch.sqrt(torch.FloatTensor([attention_dim]))

    def forward(self, x):
        # x shape: (batch_size, sequence_length, input_dim)
        batch_size, seq_len, input_dim = x.size()

        # Compute Q, K, V
        Q = self.query(x)  # (batch_size, seq_len, attention_dim)
        K = self.key(x)  # (batch_size, seq_len, attention_dim)
        V = self.value(x)  # (batch_size, seq_len, attention_dim)

        attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale # Scaled dot-product attention
        attention_weights = F.softmax(attention_scores, dim=-1) # Convert attention weights to probabilities 
        attended_output = torch.matmul(attention_weights, V) # Apply attention to values

        return attended_output, attention_weights

class TransformerBlock(nn.Module):
    """
    A single transformer block composed of self-attention and a feed-forward network.
    """
    def __init__(self, embed_dim, ffn_hidden_dim):
        """
        Args:
            embed_dim (int): The dimensionality of the model's embeddings.
            ffn_hidden_dim (int): The dimensionality of the hidden layer in the FFN.
        """
        super(TransformerBlock, self).__init__()
        self.attention = SimpleAttention(embed_dim, embed_dim)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, ffn_hidden_dim),
            nn.ReLU(),
            nn.Linear(ffn_hidden_dim, embed_dim)
        )

    def forward(self, x):
        """
        Forward pass for the transformer block.
        
        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, sequence_length, embed_dim).
            
        Returns:
            torch.Tensor: The output tensor of the transformer block.
        """
        # Self-attention part
        attended, _ = self.attention(x)
        # Add & Norm (residual connection)
        x = self.norm1(attended + x)
        
        # Feed-forward part
        ffn_out = self.ffn(x)
        # Add & Norm (residual connection)
        x = self.norm2(ffn_out + x)
        
        return x

class TransformerEncoder(nn.Module):
    """
    A transformer encoder that stacks multiple TransformerBlocks.
    """
    def __init__(self, num_layers, embed_dim, ffn_hidden_dim, seq_len, output_dim):
        """
        Args:
            num_layers (int): The number of transformer blocks to stack.
            embed_dim (int): The dimensionality of the model's embeddings.
            ffn_hidden_dim (int): The dimensionality of the hidden layer in the FFN.
            seq_len (int): The length of the input sequences.
            output_dim (int): The dimensionality of the final output (e.g., number of classes).
        """
        super(TransformerEncoder, self).__init__()
        
        # Create a list of transformer blocks
        self.layers = nn.ModuleList(
            [TransformerBlock(embed_dim, ffn_hidden_dim) for _ in range(num_layers)]
        )
        
        # Final classification head
        self.classifier = nn.Linear(embed_dim * seq_len, output_dim)

    def forward(self, x):
        """
        Forward pass for the full transformer encoder.
        
        Args:
            x (torch.Tensor): Input tensor of shape (batch_size, sequence_length, embed_dim).
            
        Returns:
            torch.Tensor: The final output logits from the classifier.
        """
        # Pass input through all transformer blocks
        for layer in self.layers:
            x = layer(x)
        
        # Flatten the output for the classifier
        x = x.view(x.size(0), -1)
        
        # Final classification
        output = self.classifier(x)
        return output

Notice how the first module AttentionLayer computes the scaled-dot-product attention. TransformerBlock applies layer norms and feedforward networks on top of it. And finally, the TransformerEncoder module applies multiple Transformer blocks in a sequence! And just like that, we have a BERT model that incorporates multiple stacks of bidirectional attention layers, along with various optimizations such as layer norms and residual connections.

If you are a beginner and this part overwhelms you, this is very much expected! The cool thing with PyTorch is you get to choose the level of complexity you want to work with depending on your skill level.

When you are beginning, you may want to stick to the hundreds of readymade modules Pytorch offers out of the box. You will slowly find the need to branch out and customizing them for your own use-case. And as you write a couple of custom ones on your own, you will grow more and more confident and proficient.

The goal of this section was to show you the capabilities and infinite customization you can do by combining modules together. Remember: you write the forward pass, and as long as the full graph is differentiable, Torch will always be able to do the auto-differentiation for you!

Next steps

The features and concepts covered in this article were handpicked to provide a whirlwind tour of some of Torch’s most important capabilities. I have a YouTube video that explains all of these concepts, along with some additional ones like model deployment, dataloaders, distributions, and training methods.

That’s it for this article! Here are some links where you can learn more about my work. Thanks for the read!

Support me on Patreon: https://www.patreon.com/NeuralBreakdownwithAVB

My YouTube channel:
https://www.youtube.com/@avb_fj

Follow me on Twitter:
https://x.com/neural_avb

Read my articles:
https://towardsdatascience.com/author/neural-avb/

关于《PyTorch Explained: From Automatic Differentiation to Training Custom Neural Networks | Towards Data Science》的评论

暂无评论

发表评论

摘要

This article provides a comprehensive introduction to PyTorch, focusing on its key features and capabilities for building neural networks. Below is a summary of the main points covered: ### Key Concepts in PyTorch 1. **Dynamic Computational Graphs**: - Unlike TensorFlow, PyTorch builds computational graphs dynamically during runtime. - This allows for flexibility in defining network architectures on-the-fly. 2. **Autograd Module**: - Enables automatic differentiation for all operations. - Useful for backpropagation and training neural networks efficiently. 3. **Tensor Operations**: - Similar to NumPy but with GPU support for faster computations. - PyTorch tensors can be easily moved between CPU and GPU using `.to(device)`. 4. **Building Neural Networks**: - `nn.Module` is the base class for defining neural networks. - Subclassing `nn.Module` allows you to define custom models, with methods like `__init__()` and `forward()`. 5. **Sequential Models**: - Use `nn.Sequential` for stacking layers in a sequential manner. - Example: `mlp_2_layers = nn.Sequential(nn.Linear(in_size, hidden_units), nn.ReLU(), nn.Linear(hidden_units, out_size))` 6. **Custom Network Architectures**: - For complex architectures like ResNet and Transformers, subclass from `nn.Module`. - Define custom layers and modules as needed. 7. **ResNet Block Example**: ```python class ResNetBlock(nn.Module): def __init__(self, in_channels, out_channels, stride=1, downsample=None): super(ResNetBlock, self).__init__() # Initialize convolutional layers and batch norms here def forward(self, x): residual = x out = F.relu(self.bn1(self.conv1(x))) out = self.bn2(self.conv2(out)) if self.downsample: residual = self.downsample(x) out += residual return F.relu(out) ``` 8. **Attention Layer Example**: ```python class AttentionLayer(nn.Module): def __init__(self, input_dim, attention_dim=64): super(AttentionLayer, self).__init__() # Initialize query, key, and value layers here def forward(self, x): Q = self.query(x) K = self.key(x) V = self.value(x) scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale weights = F.softmax(scores, dim=-1) attended_output = torch.matmul(weights, V) return attended_output, weights ``` 9. **Transformer Block Example**: ```python class TransformerBlock(nn.Module): def __init__(self, embed_dim, ffn_hidden_dim): super(TransformerBlock, self).__init__() # Initialize attention layer and feed-forward network here def forward(self, x): attended_output, _ = self.attention(x) x = self.norm1(attended_output + x) ffn_out = self.ffn(x) x = self.norm2(ffn_out + x) return x ``` ### Additional Resources and Tips - **YouTube Video**: A video explaining all concepts covered in the article. - **Support Me on Patreon**: [Patreon Link](https://www.patreon.com/NeuralBreakdownwithAVB) - **YouTube Channel**: [My YouTube Channel](https://www.youtube.com/@avb_fj) - **Twitter Handle**: Follow me at [@neural_avb](https://x.com/neural_avb) - **Blog Articles**: Read my articles on Medium. ### Conclusion PyTorch is a powerful and flexible library for deep learning, providing dynamic graph capabilities and extensive support for custom architectures. Understanding its core concepts and features can significantly enhance your ability to build complex neural networks and perform cutting-edge research in machine learning and AI.