Intuitive deep learning: From linear regression to deep neural networks
How should I get started to learn deep learning? How can we evolve a linear regression model to a neural network? - If you have these questions, this article might be helpful for you.
Deep learning is a subdivision of machine learning that imitates the human brain with artificial neural networks. In the past decades, deep learning has been ubiquitously applied to our daily lives, including speech/image recognition, self-driving cars, and natural language processing, to name just a few. Its ability to approximate previously intractable functions and generate new data also facilitates the development of various scientific disciplines in various contexts, e.g., identifying the structural determinants of protein biophysical properties, designing drugs with specific properties, or predicting the structure and function of genomic elements.
As deep learning and its applications thrive, an increasing amount of learning materials and perspectives have become available these years. However, it might not always be immediately clear to beginners how to get started. In this article, I aim to share with you some hopefully straightforward ways to develop an intuition about neural networks. This article will be more helpful if you are familiar with linear regression and basic calculus. Additionally, this article is accompanied with a Google Colab notebook. If you’re interested in how we build models (using PyTorch) introduced in this article and more insights into their performance, I would highly recommend you check it out!
What is machine learning/deep learning?
So, what is the general goal of machine learning or deep learning? In a nutshell, machine/deep learning trains computers to find functions that adequately represent complex relationships, e.g.,
- A function that takes in voice signals and generates speech contents (speech recognition)
- A function that reads in a figure of a cat and classifies the input as a cat (image recognition)
- Or a function fed with the positions of all stones and figures out the next move in a Go competition (AlphaGo)
Broadly speaking, these functions come with at least the following two types:
- Regression: Functions that predict a continuous quantity output, usually a scalar. One example is the prediction of PM2.5 tomorrow by a function that takes in the PM2.5 today, the temperature, the concentration of ozone, etc.
- Classification: Functions that predict a discrete class label. One example is to tell whether an email is spam, or to suggest the next move in a Go game using AlphaGo, with each position on the chessboard being a class.
In this article, I will only talk about regression models, but hopefully this is enough for you to get some intuitions about deep neural networks!
Problem to be solved
Here, our goal is to build a regression model that can predict COVID-19 daily new cases in the United States, given the past time series of daily new cases from January 23, 2020, to July 31, 2022 (921 days in total). This dataset was extracted from Our World in Data and then reformatted.
Our first regression model
To build a model for predicting COVID-19 daily new cases, we can follow a very general workflow described below.
Step 1: Define the model
Here, let’s start with the simplest regression model: a linear regression model $y=b+wx_1$, where
- $y$ is the estimated number of new cases on Day $n$
- $x_1$ (feature) is the number of new cases on Day $n-1$
- $w$ is the weight applied to the feature $x_1$
- $b$ is the bias
As can be seen, defining such a model requires decisions about which features to consider, hence the domain knowledge relevant to the problem. In deep learning, however, such feature engineering is generally not required, as opposed to classical machine learning methods like linear regression.
Step 2: Prepare datasets
When building a machine/deep learning model, we need to define the following 3 datasets:
- Training set: The dataset that the model is trained on. In our case, we will take the data before Day 800 as the training set.
- Validation set: The dataset split from the training set, which is used for validating the model during training. In our case, we consider 1 data point for validation for every 5 data points in the training set.
- Test set: The dataset that we will use to assess the trained model. Here we take the data after Day 800 as the test set.
Practically, this is generally done by classes like DataLoader
and Dataset
if PyTorch is used, in which some data preprocessing can be carried out as needed. In our case here, we will just normalize the data of $x_1$ and $y$, divide the training set into batches (batch size=128) and shuffle them. For more data preprocessing techniques, I recommend this article.
Step 3: Define the loss from training data
A loss function is a measure of the quality of the parameters ($w$ and $b$ in our case). It is generally defined by comparing the values predicted by the model ($y_n$) with the “true values” ($\hat{y_n}$, the labels). For example, below are some common loss functions.
- Mean absolute error (MAE): $$L(b, w)=\frac{1}{N}\sum_{n}|y_n-\hat{y_n}|$$
- Mean squared error (MSE): $$L(b, w)=\frac{1}{N}\sum_{n}(y_n-\hat{y_n})^2$$
- Cross entropy (given that $y_n$ and $\hat{y_n}$ are probability distributions): $$L(b, w)=-\sum_n y_n\log \hat{y_n}$$
In our case, we will adopt MAE as the loss function.
Step 4: Train/Optimize the model
By training a model, we often mean optimizing its parameters to minimize the loss function, i.e., in our case, finding $w’$ and $b’$ that $w’,b’=\arg\min_{w,b} L$, which is rigorously satisfied by $$\frac{\partial L}{\partial w}|_{w=w', b=b'}=\frac{\partial L}{\partial b}|_{w=w', b=b'}=0$$
To seek the point $(w, b)=(w’, b’)$, one common way is to start from an initial guess $(w^0, b^0)$ that is randomly chosen or determined by more complicated methods, then update the hyperparameters with an iterative scheme, like the gradient descent method expressed as follows: $$w^n=w^{n-1}-\eta \frac{\partial L}{\partial w} |_{w=w^{n-1},b=b^{n-1}}$$
$$b^n=b^{n-1}-\eta \frac{\partial L}{\partial b}|_{w=w^{n-1},b=b^{n-1}}$$
where $\eta$ is the learning rate. Practically, we stop the iterations when the desired number of iterations is reached or the validation loss has not been decreased for a number of epochs (i.e., early stopping), since it’s almost impossible to have the derivatives exactly 0 in most real-world problems. In our case, we use the stochastic gradient descent (SGD) method and set the learning rate as 0.001. The training will be terminated if the validation loss does not decrease for over 1000 epochs, otherwise it will stop after 5000 epochs.
[Reommanded materials]
- To see how gradient descent works and how the learning rate influences the result, I strongly recommend this interactive Google crash course.
- For a comprehensive review of gradient descent methods, please refer to this article by Ruder.
As a result, the training was finished after 1330 epochs. The average test loss was around 55202, which is already moderately predictive given the large values of the labels (see the figures below). For more data analysis results, please refer to the notebook.
How about multiple linear regression (MLR) models?
Although our first model is already moderately predictive, we definitiely want to improve the model performance if possible! To gain a better performance, one could consider tuning hyperparameters or modifying the model based on the characteristics of the data. For example, the time series of the daily new cases in our case actually has a 7-day period that reflects the fact that the number of cases on weekends is generally much lower because fewer patients get tested on weekends. Therefore, considering only the daily cases of the previous day might not be sufficient. Instead, we should consider Days $n-1$, $n-2$, up to $n-7$ (i.e., $y=b + \sum_{i=1}^7 w_i x_i$, which has 7 features instead of 1) or even more history data!
To see how the number of features (how much history data we trace back) influences the model performance, I’ve trained models that consider 7, 28, 56, or even 84 features, with all hyperparameters remaining the same. (Again, check the notebook for more details!)
As shown above, with the daily new cases in the past week considered, the model is indeed more predictive. If we consider the daily new cases in the past month (28 days), the performance is even better, with an average test loss of 25213, lower than half of the test loss we had in our first model! However, if we consider data in the past 56 or 84 days, the model started to overfit. That is, even if the training loss was driven lower, the model actually performs worse in testing the data it has never seen. Therefore, the best MLR model we have is the one with 28 features!
How do we further improve the performance of the model?
One obvious disadvantage of linear regression models is that they assume either a strictly increasing/decreasing relationship between the output $y$ and each of the input features $x_i$. As an example, in our first trained model $\left(\frac{y-\bar{y}}{\sigma_y} \right)=0.9451\left(\frac{x_1-\bar{x_1}}{\sigma_x} \right)-0.0287$, it was assumed that the Z-score of $x$ should always be smaller than the Z-score of $y$, which might not always be true. This inability of the method/model to capture the true relationship is what we call as the model bias. Although in the MLR models explored above, a larger number of features can indeed weaken the model bias, we should consider other possibilities if we want to account for the fluctuations in the time series with more flexibility.
So how do we account for the fluctuations in the time series? To approach this, we can first consider a simpler example below, where we want to model $f(x)$.
As shown in the left panel of the figure above, the three segments of $f(x)$ are basically shifted $g_1(x)$, $g_2(x)$ and $g_3(x)$, respectively, which are piecewise linear functions (or more specifically, hard sigmoid functions). Therefore $f(x)$ can be easily modified as the sum of $g_1(x)$, $g_2(x)$ and $g_3(x)$, shifted by a constant value. If we want to model $f(x)$ with smoother functions, we could consider sigmoid functions like $h_1(x)$, $h_2(x)$ and $h_3(x)$, whose general form is $$y=\frac{c}{1 + \exp(-(b+wx))}=c\cdot \text{sigmoid}(b+wx)$$ That is, we can define the model for approximate $f(x)$ as $$y = b + \sum_{i}^{3}c_i \cdot \text{sigmoid}(b_i + w_ix)$$
Back to our original problem, if we only consider 1 feature ($x_1$, the new cases of the previous day), the model can be defined as $y=b+\sum_i^n c_i \cdot \sigma(b_i+w_ix_1)$, where $\sigma$ is the sigmoid function and $n$ is the number of sigmoid functions, another hyperparameter that we need to decide. If multiple features are considered, we can define the model as below: $$y=b+\sum_i^n c_i \cdot \sigma(b_i+\sum_j^m w_{ij} x_j)=b+{\bf c}^T \sigma({\bf b} + W {\bf x})$$ where $\bf x$, $\bf b$, and $\bf c$ are all column vectors and $W$ is a matrix and $m$ is the number of features. If you find the term ${\bf b} + W {\bf x}$ not intuitive, try considering its matrix form, e.g., the term ${\bf r} = {\bf b} + W {\bf x}$ of a model considering 3 features and using 2 sigmoid functions can be written as $$ \begin{bmatrix} r_1 \\ r_2 \end{bmatrix}= \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} + \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23}\end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}$$
Note that we are not restricted to only using the sigmoid function! For example, we can also use rectified unit (ReLU) functions $y=c\cdot \max(0, b+wx)$. Two ReLU functions can compose of one hard sigmoid function (see the left panel below), or can model other functions with more flexibility (see the right panel below). With ReLU functions, we could define the model as $y=b+\sum_i c_i^n \max(0, b_i + \sum_{j}^m w_{ij}x_j)$.
So I’ve already tried these cool functions for you! Specifically, the figure below shows models of 28 features with different numbers ($n$) of ReLU functions to model the COVID-19 daily new cases. Generally, ReLU regression models perform better than the MLR model, with the performance roughly increasing with $n$. However, if $n$ is increased to 500, the model starts to show signs of overfitting. With these models, the lowest test loss we could get from the models we’ve built is 22319. Importantly, in deep learning, functions like ReLU functions or sigmoid functions are called activation functions.
[Reommanded materials]
- To see why ReLU is generally preferred to the sigmoid function, check this article.
- For other commonly used activation functions, please refer to this nicely written article!
We can make the model even more complicated! 😎
So we have made progress! At this point, we have learned 3 different types of models, including a simple linear regression model, MLR models, and models that use activation functions. And guess what? Now you’ve already seen at least 3 different kinds of (oversimplified) neural networks!
Specifically, a simple linear regression model like $y=ax+b$ or an MLR model like $y=a_1x_1 + a_2x_2 + a_3x_3 + b$ can be respectively represented as panels A and B in the figure below, which both only have an input layer and an output layer and no activation functions. (Or some people like to think of it as a single-layer perceptron where the input feature $x$ goes through a linear activation.) Imaginably, the models that we built in the previous section are actually neural networks with 1 hidden layer. For example, panel C in the figure below can represent the model that considers 5 features ($n=5$) and uses 3 ReLU functions ($m=3$) to predict an one-dimensional output (the number of COVID-19 daily new cases). In this case, the one and only hidden layer is where the 3 activation functions (corresponding to 3 nodes) come into play.
Now, based on the model in panel C shown above, we can make the model even more complicated if we want! For example, we can take the 3-dimensional outputs of the first hidden layers and use them as the 3 input features of the next layer of $k$ nodes that implement $k$ activation functions (e.g., sigmoid functions), where $k$ can be any positive integer. Mathematically, we can express this as having ${\bf x^{(1)}}=\sigma({\bf b} + W {\bf x^{(0)}})$ and $y=b+{\bf c}^T \sigma({\bf b} + W {\bf x^{(1)}})$ where $\sigma$ can be any kind of activation function. The figure below is an example of $k=4$ in the second hidden layer. Practically, we can arbitrarily increase the depth of the neural networks, hence the name deep neural networks (DNNs)! Notably, the depth of a neural network is the number of layers with tunable weights (i.e., hidden layers and the output layer), so the network below is a 3-layer neural network.
Would multi-layer neural networks work better in our case? Well, as can be checked in our notebook, multi-layer neural networks in our case can hardly drive the test loss even lower and they start to overfit very quickly (from the 4-layer neural network). While the neural networks we have learned here are actually not the best choice for time-series forecasting, hopefully now it becomes a little clearer how a linear regression model can be generalized all the way to a deep neural network! If you want to get some hands-on codeing experience for builing the models we’ve mentioned in this article, don’t forget to check our notebook!
This is the end of the article! 🎉🎉 If you enjoyed this article, you are welcome to share it or leave a comment below, so I will be more motivated to write more! Thank you for reading this far! 😃