Intuitive deep learning: From linear regression to deep neural networks
How should I get started to learn deep learning? How can we evolve a linear regression model to a neural network? - If you have these questions, this article might be helpful for you.
Deep learning is a subdivision of machine learning that imitates the human brain with artificial neural networks. In the past decades, deep learning has been ubiquitously applied to our daily lives, including speech/image recognition, self-driving cars, and natural language processing, to name just a few. Its ability to approximate previously intractable functions and generate new data also facilitates the development of various scientific disciplines in various contexts, e.g., identifying the structural determinants of protein biophysical properties, designing drugs with specific properties, or predicting the structure and function of genomic elements.
As deep learning and its applications thrive, an increasing amount of learning materials and perspectives have become available these years. However, it might not always be immediately clear to beginners how to get started. In this article, I aim to share with you some hopefully straightforward ways to develop an intuition about neural networks. This article will be more helpful if you are familiar with linear regression and basic calculus. Additionally, this article is accompanied with a Google Colab notebook. If you’re interested in how we build models (using PyTorch) introduced in this article and more insights into their performance, I would highly recommend you check it out!
What is machine learning/deep learning?
So, what is the general goal of machine learning or deep learning? In a nutshell, machine/deep learning trains computers to find functions that adequately represent complex relationships, e.g.,
- A function that takes in voice signals and generates speech contents (speech recognition)
- A function that reads in a figure of a cat and classifies the input as a cat (image recognition)
- Or a function fed with the positions of all stones and figures out the next move in a Go competition (AlphaGo)
Broadly speaking, these functions come with at least the following two types:
- Regression: Functions that predict a continuous quantity output, usually a scalar. One example is the prediction of PM2.5 tomorrow by a function that takes in the PM2.5 today, the temperature, the concentration of ozone, etc.
- Classification: Functions that predict a discrete class label. One example is to tell whether an email is spam, or to suggest the next move in a Go game using AlphaGo, with each position on the chessboard being a class.
In this article, I will only talk about regression models, but hopefully this is enough for you to get some intuitions about deep neural networks!
Problem to be solved
Here, our goal is to build a regression model that can predict COVID-19 daily new cases in the United States, given the past time series of daily new cases from January 23, 2020, to July 31, 2022 (921 days in total). This dataset was extracted from Our World in Data and then reformatted.

Our first regression model
To build a model for predicting COVID-19 daily new cases, we can follow a very general workflow described below.
Step 1: Define the model
Here, let’s start with the simplest regression model: a linear regression model
is the estimated number of new cases on Day (feature) is the number of new cases on Day is the weight applied to the feature is the bias
As can be seen, defining such a model requires decisions about which features to consider, hence the domain knowledge relevant to the problem. In deep learning, however, such feature engineering is generally not required, as opposed to classical machine learning methods like linear regression.
Step 2: Prepare datasets
When building a machine/deep learning model, we need to define the following 3 datasets:
- Training set: The dataset that the model is trained on. In our case, we will take the data before Day 800 as the training set.
- Validation set: The dataset split from the training set, which is used for validating the model during training. In our case, we consider 1 data point for validation for every 5 data points in the training set.
- Test set: The dataset that we will use to assess the trained model. Here we take the data after Day 800 as the test set.
Practically, this is generally done by classes like DataLoader
and Dataset
if PyTorch is used, in which some data preprocessing can be carried out as needed. In our case here, we will just normalize the data of
Step 3: Define the loss from training data
A loss function is a measure of the quality of the parameters (
- Mean absolute error (MAE):
- Mean squared error (MSE):
- Cross entropy (given that
and are probability distributions):
In our case, we will adopt MAE as the loss function.
Step 4: Train/Optimize the model
By training a model, we often mean optimizing its parameters to minimize the loss function, i.e., in our case, finding
To seek the point
where
[Reommanded materials]
- To see how gradient descent works and how the learning rate influences the result, I strongly recommend this interactive Google crash course.
- For a comprehensive review of gradient descent methods, please refer to this article by Ruder.
As a result, the training was finished after 1330 epochs. The average test loss was around 55202, which is already moderately predictive given the large values of the labels (see the figures below). For more data analysis results, please refer to the notebook.

How about multiple linear regression (MLR) models?
Although our first model is already moderately predictive, we definitiely want to improve the model performance if possible! To gain a better performance, one could consider tuning hyperparameters or modifying the model based on the characteristics of the data. For example, the time series of the daily new cases in our case actually has a 7-day period that reflects the fact that the number of cases on weekends is generally much lower because fewer patients get tested on weekends. Therefore, considering only the daily cases of the previous day might not be sufficient. Instead, we should consider Days
To see how the number of features (how much history data we trace back) influences the model performance, I’ve trained models that consider 7, 28, 56, or even 84 features, with all hyperparameters remaining the same. (Again, check the notebook for more details!)

As shown above, with the daily new cases in the past week considered, the model is indeed more predictive. If we consider the daily new cases in the past month (28 days), the performance is even better, with an average test loss of 25213, lower than half of the test loss we had in our first model! However, if we consider data in the past 56 or 84 days, the model started to overfit. That is, even if the training loss was driven lower, the model actually performs worse in testing the data it has never seen. Therefore, the best MLR model we have is the one with 28 features!
How do we further improve the performance of the model?
One obvious disadvantage of linear regression models is that they assume either a strictly increasing/decreasing relationship between the output
So how do we account for the fluctuations in the time series? To approach this, we can first consider a simpler example below, where we want to model

As shown in the left panel of the figure above, the three segments of
Back to our original problem, if we only consider 1 feature (
Note that we are not restricted to only using the sigmoid function! For example, we can also use rectified unit (ReLU) functions

So I’ve already tried these cool functions for you! Specifically, the figure below shows models of 28 features with different numbers (

[Reommanded materials]
- To see why ReLU is generally preferred to the sigmoid function, check this article.
- For other commonly used activation functions, please refer to this nicely written article!
We can make the model even more complicated! 😎
So we have made progress! At this point, we have learned 3 different types of models, including a simple linear regression model, MLR models, and models that use activation functions. And guess what? Now you’ve already seen at least 3 different kinds of (oversimplified) neural networks!
Specifically, a simple linear regression model like

Now, based on the model in panel C shown above, we can make the model even more complicated if we want! For example, we can take the 3-dimensional outputs of the first hidden layers and use them as the 3 input features of the next layer of

Would multi-layer neural networks work better in our case? Well, as can be checked in our notebook, multi-layer neural networks in our case can hardly drive the test loss even lower and they start to overfit very quickly (from the 4-layer neural network). While the neural networks we have learned here are actually not the best choice for time-series forecasting, hopefully now it becomes a little clearer how a linear regression model can be generalized all the way to a deep neural network! If you want to get some hands-on codeing experience for builing the models we’ve mentioned in this article, don’t forget to check our notebook!
This is the end of the article! 🎉🎉 If you enjoyed this article, you are welcome to share it or leave a comment below, so I will be more motivated to write more! Thank you for reading this far! 😃