Neural Networks

Question

Question: What might be the input, output, label, and criterion if we want an NN to distinguish between pictures of cats and pictures of dogs?

Answer 1

Answer: The input would be an image, the output would be a guess of cat or dog, the label would be the actual contents of the image, and the criterion should have something to do with if the output guess was correct or not.

Answer 2

Answer: Everyone and no one. We do not have a special ethics force to guide us. The problem is clearly that if everyone is responsible, nobody will think they need to act.

Answer 3

Answer: From the very beginning. This will make it easier to:

avoid pitfalls,
analyze results from an ethical lens,
avoid wasting time, and
ensure the system is ethical.

Answer 4

Answer: We say that \(\vx\i \in \mathcal{R}^{n_x}\) (each input example is \(n_x\) real values) and \(X \in \mathcal{R}^{N \times n_x}\). Therefore, the shape of \(X\) is \((N \times n_x)\).

Answer 5

Answer: The shape of \(Y\) is \((N \times n_y)\).

Answer 6

Answer: \(X_{train}\) is \((60000 \times 784\)): \[X = \begin{bmatrix} x^{(1)}_{1} & x^{(1)}_{2} & \cdots & x^{(1)}_{784} \\ x^{(2)}_{1} & x^{(2)}_{2} & \cdots & x^{(2)}_{784} \\ \vdots & \vdots & \ddots & \vdots \\ x^{(60000)}_{1} & x^{(60000)}_{2} & \cdots & x^{(60000)}_{784} \end{bmatrix} \] The first row includes all 784 pixels of the first training image, and subsequent rows likewise contain pixel data for a single image.

Answer 7

Answer: \(Y_{train}\) is \((60000 \times 10\)): \[Y = \begin{bmatrix} x^{(1)}_{1} & x^{(1)}_{2} & \cdots & x^{(1)}_{10} \\ x^{(2)}_{1} & x^{(2)}_{2} & \cdots & x^{(2)}_{10} \\ \vdots & \vdots & \ddots & \vdots \\ x^{(60000)}_{1} & x^{(60000)}_{2} & \cdots & x^{(60000)}_{10} \end{bmatrix} \] Each row in this matrix is one-hot encoded, meaning that only one item in each row is “1” and all other items in a row are “0”. Here is an example of a one-hot encoding target for an input image representing the digit “2” \[y^T = \begin{bmatrix} 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\end{bmatrix}\] For efficiency sake, we often represent a one-hot encoded vector using just the index of the “hot” item. For example, the previous vector can be represented by the integer 2.

Answer 8

Answer: \(X_{valid}\) and \(Y_{valid}\) are \((10000 \times 784)\) and \((10000 \times 10)\), respectively.

Answer 9

Answer:

Training input shape    : torch.Size([60000, 1, 28, 28])
Training target shape   : torch.Size([60000])
Validation input shape  : torch.Size([10000, 1, 28, 28])
Validation target shape : torch.Size([10000])

This is slightly different than what we discussed. PyTorch expects us to use this dataset with a convolutional neural network. When we get to sec. 9 we’ll make more sense of this data format.

Answer 10

Answer: One method would be to find an “average” image for the ten separate digits, and then compare the unknown image to the ten averages and assign the unknown label as that of the closest average image.

Answer 11

Answer: If you are equally likely to guess any of the ten digits, then you would be right around 10% of the time \(\left(\frac{1}{10}\right)\). How might this change if you were to always guess the same thing? How about if the dataset has mostly ones and sevens?

Answer 12

Answer: This model is correct about 66.85% of the time.

Answer 13

Answer: The parameters \(w_k\) and \(b\) do not change as the input \(x_k\i\) changes. These parameters are the neuron, and they are used to produce the output \(\yhat\i\) for any given input; we use the same parameter values regardless of input.

Answer 14

Answer: While this function was once widely used, it has fallen out of favor because it can often lead to slower learning due to small derivative values for any input \(z\) outside of the range [-4, 4]. ReLU is a more commonly used activation function for hidden layer neurons.

Answer 15

Answer: We are computing a single output value for each input, so, the shape of these vectors are \((N \times 1)\). PyTorch will treat these as arrays with \(N\) elements instead of as column vectors. \[\begin{align} \vz &= \begin{bmatrix} \vx^{(1)T} \vw + b \\ \vx^{(2)T} \vw + b \\ \vdots \\ \vx^{(N)T} \vw + b \\ \end{bmatrix} \\ \va &= \begin{bmatrix} g(z^{(1)}) \\ g(z^{(2)}) \\ \vdots \\ g(z^{(N)}) \\ \end{bmatrix} \end{align}\]

Answer 16

Answer:

Let’s start by looking at the output of the function for different values of the inputs.

\(\yhat\i\)	\(y\i\)	ℒ
0.1	0	0.1
0.1	1	-0.9
0.9	0	0.9
0.9	1	-0.1

The table indicates that loss can be positive or negative. But how should we interpret negative loss? We see that \(ℒ\) is minimized in row 2 of the table, but this is not an ideal result. The sign of loss is not helpful—as we’ll see shortly, we will use the sign of the derivative.

Answer 17

Answer: The factor cancels out when we take the derivative. This scaling factor is unimportant since we will later multiply it by a learning rate, and can use that to achieve whatever effect we want.

Answer 18

Answer:

\(\yhat\i\)	\(y\i\)	\(\log{\yhat\i}\)	\(\log{(1-\yhat\i)}\)	ℒ
0.1	0	-2.3	-0.1	0.1
0.1	1	-2.3	-0.1	2.3
0.9	0	-0.1	-2.3	2.3
0.9	1	-0.1	-2.3	0.1

The tables shows that a larger difference between \(\yhat\i\) and \(y\i\) (rows 2 and 3) results in a larger loss, which is exactly what we’d like to see.

Answer 19

Answer: We should take the derivative, set it equal to zero, and then solve the set of linear equations. Here is an example using linear regression, which is very similar to our single neuron. Here is our model:

\[\vyhat = X \mathbf{θ},\]

where \(θ\) is our vector of parameters. Here is our loss function (half-SSE):

\[ℒ(\vyhat, \vy) = \frac{1}{2} ||(\vyhat - \vy)^2||_1.\]

Now we can take the partial derivative of loss with respect to parameters \(θ\). (Note that I substitute for \(\vyhat\) on the third line.)

\[\begin{align} \frac{∂ ℒ}{∂ \mathbf{θ}} &= \frac{∂ ||\frac{1}{2} (\vyhat - \vy)^2||_1}{∂ \mathbf{θ}} \\ &= ||\vyhat - \vy||_1 \frac{∂ \vyhat}{∂ \mathbf{θ}} \\ &= ||X \mathbf{θ} - \vy||_1 \frac{∂ X \mathbf{θ}}{∂ \mathbf{θ}} \\ &= ||X \mathbf{θ} - \vy||_1 X \\ &= X^T X \mathbf{θ} - X^T \vy \end{align}\]

We can now set this derivative to zero and solve for \(\mathbf{θ}\).

\[\begin{align} \frac{∂ ℒ}{∂ \mathbf{θ}} &= 0 \\ X^T X \mathbf{θ} - X^T \vy &= 0 \\ X^T X \mathbf{θ} &= X^T \vy \end{align}\]

And now assuming that \(X^T X\) is invertible (that the columns are linearly independent).

\[\mathbf{θ}^* = (X^T X)^{-1} X^T \vy\]

We now have an optimal solution (called \(\mathbf{θ}^*\)) that minimizes loss. (See Ordinary least squares - Wikipedia for more details.)

Answer 20

Answer: First, we cannot directly compute the partial derivative of \(ℒ\) with respect to \(\vw\) (or \(b\)). Second, we only apply the chain rule to equations that have some form of dependency on the term in the first denominator (\(\vw\) and \(b\)). It is useful to look at the loss function when we substitute in values for \(\yhat\) and then \(z\).

\[ℒ(\vyhat, \vy) = -\frac{1}{N}\sum_{i=1}^N (y\i \log{σ(\vx^{(i)T}\vw\i + b)} + (1 - y\i)\log{(1-σ(\vx^{(i)T}\vw\i + b))})\]

In the above equation we can more easily see how the chain-rule comes into play. The parameter \(\vw\) is nested within a call to \(σ\) which is nested within a call to \(\log\) when computing \(\frac{∂ ℒ}{∂ \vw}\).

Answer 21

Answer: We use these terms to update model parameters.

\[\begin{align} \vw &:= \vw - η \frac{∂ ℒ}{∂ \vw} \\ b &:= b - η \frac{∂ ℒ}{∂ b} \end{align}\]

Answer 22

Answer:

\[\begin{align} \frac{d}{dz} σ(z) &= \frac{d}{dz} \left(\frac{1}{1 + e^{-z}}\right) \\ &= \frac{d}{dz} \left(1 + e^{-z} \right)^{-1} \\ &= -(1 + e^{-z})^{-2}(-e^{-z}) \\ &= \frac{e^{-z}}{\left(1 + e^{-z}\right)^2} \\ &= \frac{1}{1 + e^{-z}\ } \cdot \frac{e^{-z}}{1 + e^{-z}} \\ &= \frac{1}{1 + e^{-z}\ } \cdot \frac{e^{-z} + 1 - 1}{1 + e^{-z}} \\ &= \frac{1}{1 + e^{-z}\ } \cdot \left( \frac{1 + e^{-z}}{1 + e^{-z}} - \frac{1}{1 + e^{-z}} \right) \\ &= \frac{1}{1 + e^{-z}\ } \cdot \left( 1 - \frac{1}{1 + e^{-z}} \right) \\ &= σ(z) \cdot (1 - σ(z)) \end{align}\]

Answer 23

Answer: Lines 46 and 47.

Answer 24

Answer: It turns out that we might need to update our weights more than once to get useful results. Each time we update parameters based on all training examples we mark the end of an epoch. In the code above we iterate through four epochs.

Answer 25

Answer:

Accuracy before training: 0.54
 1/4, Loss=0.7, Accuracy=0.97, Time=5.5 ms
 2/4, Loss=0.5, Accuracy=0.96, Time=4.8 ms
 3/4, Loss=0.4, Accuracy=0.96, Time=4.6 ms
 4/4, Loss=0.3, Accuracy=0.96, Time=4.4 ms

Answer 26

Answer: \[z_3^{[5](6123)}\]

“\(z\)”: linear computation
“\([5]\)” superscript: fifth layer
“\((6123)\)” superscript: example 6123
“\(3\)” subscript: third neuron

Answer 27

Answer: This ensures that the dimensions are correct between the added matrices. Try this out in Python:

import torch
N, nl = 10, 4
b = torch.randn(nl, 1)
ONE = torch.ones(N, 1)
print(ONE @ b.T)

Note that most neural network frameworks handle this for you in the form of broadcasting.

Answer 28

Answer: \(Z^{[l]}\) is \((N \times n_l)\). \[Z^{[l]} = \begin{bmatrix} z_{1}^{[l](1)} & z_{2}^{[l](1)} & \cdots & z_{n_l}^{[l](1)} \\ z_{1}^{[l](2)} & z_{2}^{[l](2)} & \cdots & z_{n_l}^{[l](2)} \\ \vdots & \vdots & \ddots & \vdots \\ z_{1}^{[l](N)} & z_{2}^{[l](N)} & \cdots & z_{n_l}^{[l](N)} \end{bmatrix} \]

We compute this matrix by multiplying a \((N \times n_{l-1})\) matrix by a \((n_{l-1}, n_l)\) matrix (the transposed parameter matrix) and adding an \((N \times n_l)\) matrix.

Answer 29

Answer: \(A^{[l]}\) is \((N \times n_l)\). \[\begin{align} A^{[l]} &= \begin{bmatrix} a_{1}^{[l](1)} & a_{2}^{[l](1)} & \cdots & a_{n_l}^{[l](1)} \\ a_{1}^{[l](2)} & a_{2}^{[l](2)} & \cdots & a_{n_l}^{[l](2)} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1}^{[l](N)} & a_{2}^{[l](N)} & \cdots & a_{n_l}^{[l](N)} \end{bmatrix} \\ \\ &= \begin{bmatrix} g_{1}^{[l]}(z_{1}^{[l](1)}) & g_{2}^{[l]}(z_{2}^{[l](1)}) & \cdots & g_{n_l}^{[l]}(z_{n_l}^{[l](1)}) \\ g_{1}^{[l]}(z_{1}^{[l](2)}) & g_{2}^{[l]}(z_{2}^{[l](2)}) & \cdots & g_{n_l}^{[l]}(z_{n_l}^{[l](2)}) \\ \vdots & \vdots & \ddots & \vdots \\ g_{1}^{[l]}(z_{1}^{[l](N)}) & g_{2}^{[l]}(z_{2}^{[l](N)}) & \cdots & g_{n_l}^{[l]}(z_{n_l}^{[l](N)}) \end{bmatrix} \\ \\ &= \begin{bmatrix} g_{1}^{[l]}(\va^{[l-1](1)} \vw_{1}^{[l]T} + b_{1}^{[l]}) & g_{2}^{[l]}(\va^{[l-1](1)} \vw_{2}^{[l]T} + b_{2}^{[l]}) & \cdots & g_{n_l}^{[l]}(\va^{[l-1](1)} \vw_{n_l}^{[l]T} + b_{n_l}^{[l]}) \\ g_{1}^{[l]}(\va^{[l-1](2)} \vw_{1}^{[l]T} + b_{1}^{[l]}) & g_{2}^{[l]}(\va^{[l-1](2)} \vw_{2}^{[l]T} + b_{2}^{[l]}) & \cdots & g_{n_l}^{[l]}(\va^{[l-1](2)} \vw_{n_l}^{[l]T} + b_{n_l}^{[l]}) \\ \vdots & \vdots & \ddots & \vdots \\ g_{1}^{[l]}(\va^{[l-1](N)} \vw_{1}^{[l]T} + b_{1}^{[l]}) & g_{2}^{[l]}(\va^{[l-1](N)} \vw_{2}^{[l]T} + b_{2}^{[l]}) & \cdots & g_{n_l}^{[l]}(\va^{[l-1](N)} \vw_{n_l}^{[l]T} + b_{n_l}^{[l]}) \end{bmatrix} \end{align}\]

You should also think about the shapes of \(\va^{[l-1](i)}\) and \(\vw_{j}^{[l]}\).

Answer 30

Answer: Yes. Each of these factors play a part in the derivations above.

Answer 31

Answer: \(N\) and \(1\), respectively.

Answer 32

Answer: No. The last batch is frequently smaller than all other batches. It contains the leftovers.

1 Introduction

1.1 Background

1.2 Additional Material

2 Ethics

2.1 Key Topics

2.2 Strategies

2.3 Additional Material

3 Data

3.1 Loading MNIST Using PyTorch

3.2 Similarity Digit Classifier

4 Single Neuron

4.1 Notation and Diagram

4.2 Neuron with Python Standard Libraries

4.3 The Dot-Product

4.4 Vectorizing Inputs

4.5 Optimization with Batch Gradient Descent

4.6 Neuron Batch Gradient Descent

5 Neural Networks and Backpropagation

5.1 Vectorized Equations For a Neural Network

5.2 Backpropagation

5.2.1 Layer 2 Parameters

5.2.2 Layer 1 Parameters

5.2.3 Parameter Update Equations

5.3 Neuron Batch Gradient Descent

5.4 Automatic Differentiation

5.4.1 Alternatives

5.5 Why “Deep” Neural Networks?

5.6 The Role of an Activation Function

6 Gradient Descent

6.1 Batch Gradient Descent

6.2 Stochastic Gradient Descent

6.3 Mini-Batch Stochastic Gradient Descent

7 Optimization Techniques

7.1 Momentum

7.1.1 Nesterov’s Accelerated Gradients

7.2 RMSProp

7.3 Adam

7.4 AMSGrad

8 Overfitting and Generalization

9 Convolutional Neural Networks

10 Recurrent Neural Networks

11 Attention and Transformers

12 Advanced Topics

Terminology