cs231n作业:assignment2 - Fully-Connected Neural Nets

GitHub地址:https://github.com/ZJUFangzh/cs231n

作业2主要是关于搭建卷积神经网络框架,还有tensorflow的基本应用。

首先先搭建一个全连接神经网络的基本框架。

之前搭建的2层神经网络都是比较简单的,但是一旦模型变大了,代码就变得难以复用。因此搭建一个神经网络框架是很有必要的。

一般都会分为两部分forwardbackward,一层一层来,因此两个函数成对出现就可以了。

1
2
3
4
5
6
7
8
9
10
def layer_forward(x, w):
""" Receive inputs x and weights w """
# Do some computations ...
z = # ... some intermediate value
# Do some more computations ...
out = # the output

cache = (x, w, z, out) # Values we need to compute gradients

return out, cache
1
2
3
4
5
6
7
8
9
10
11
12
13
def layer_backward(dout, cache):
"""
Receive derivative of loss with respect to outputs and cache,
and compute derivative with respect to inputs.
"""
# Unpack cache values
x, w, z, out = cache

# Use values in cache to compute derivatives
dx = # Derivative of loss with respect to x
dw = # Derivative of loss with respect to w

return dx, dw

Affine layer: foward

cs231n/layers.py中修改affine_forward函数,也就是简单的全连接层的前向传播。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def affine_forward(x, w, b):
"""
Computes the forward pass for an affine (fully-connected) layer.

The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
examples, where each example x[i] has shape (d_1, ..., d_k). We will
reshape each input into a vector of dimension D = d_1 * ... * d_k, and
then transform it to an output vector of dimension M.

Inputs:
- x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
- w: A numpy array of weights, of shape (D, M)
- b: A numpy array of biases, of shape (M,)

Returns a tuple of:
- out: output, of shape (N, M)
- cache: (x, w, b)
"""
out = None
###########################################################################
# TODO: Implement the affine forward pass. Store the result in out. You #
# will need to reshape the input into rows. #
###########################################################################
out = x.reshape(x.shape[0],-1).dot(w) + b
###########################################################################
# END OF YOUR CODE #
###########################################################################
cache = (x, w, b)
return out, cache

而后修改affine_backward函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def affine_backward(dout, cache):
"""
Computes the backward pass for an affine layer.

Inputs:
- dout: Upstream derivative, of shape (N, M)
- cache: Tuple of:
- x: Input data, of shape (N, d_1, ... d_k)
- w: Weights, of shape (D, M)

Returns a tuple of:
- dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
- dw: Gradient with respect to w, of shape (D, M)
- db: Gradient with respect to b, of shape (M,)
"""
x, w, b = cache
dx, dw, db = None, None, None
###########################################################################
# TODO: Implement the affine backward pass. #
###########################################################################
dx = dout.dot(w.T).reshape(x.shape)
dw = x.reshape(x.shape[0],-1).T.dot(dout)
db = np.sum(dout,axis=0)
###########################################################################
# END OF YOUR CODE #
###########################################################################
return dx, dw, db

然后是relu_fowardrelu_backward函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def relu_forward(x):
"""
Computes the forward pass for a layer of rectified linear units (ReLUs).

Input:
- x: Inputs, of any shape

Returns a tuple of:
- out: Output, of the same shape as x
- cache: x
"""
out = None
###########################################################################
# TODO: Implement the ReLU forward pass. #
###########################################################################
out = np.maximum(0,x)
###########################################################################
# END OF YOUR CODE #
###########################################################################
cache = x
return out, cache
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def relu_backward(dout, cache):
"""
Computes the backward pass for a layer of rectified linear units (ReLUs).

Input:
- dout: Upstream derivatives, of any shape
- cache: Input x, of same shape as dout

Returns:
- dx: Gradient with respect to x
"""
dx, x = None, cache
###########################################################################
# TODO: Implement the ReLU backward pass. #
###########################################################################
dx = (x > 0) * dout
###########################################################################
# END OF YOUR CODE #
###########################################################################
return dx

然后它定义了一个三明治层,意思是将affine和relu连接在一起。在layer_utils.py中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def affine_relu_forward(x, w, b):
"""
Convenience layer that perorms an affine transform followed by a ReLU

Inputs:
- x: Input to the affine layer
- w, b: Weights for the affine layer

Returns a tuple of:
- out: Output from the ReLU
- cache: Object to give to the backward pass
"""
a, fc_cache = affine_forward(x, w, b)
out, relu_cache = relu_forward(a)
cache = (fc_cache, relu_cache)
return out, cache
1
2
3
4
5
6
7
8
def affine_relu_backward(dout, cache):
"""
Backward pass for the affine-relu convenience layer
"""
fc_cache, relu_cache = cache
da = relu_backward(dout, relu_cache)
dx, dw, db = affine_backward(da, fc_cache)
return dx, dw, db

在完成这些基本的函数之后,就可搭建一个简单的神经网络了。在fc_net.py中:

先完成初始化,然后在loss中调用这些基本函数,得到loss,然后再计算梯度。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
class TwoLayerNet(object):
"""
A two-layer fully-connected neural network with ReLU nonlinearity and
softmax loss that uses a modular layer design. We assume an input dimension
of D, a hidden dimension of H, and perform classification over C classes.

The architecure should be affine - relu - affine - softmax.

Note that this class does not implement gradient descent; instead, it
will interact with a separate Solver object that is responsible for running
optimization.

The learnable parameters of the model are stored in the dictionary
self.params that maps parameter names to numpy arrays.
"""

def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
weight_scale=1e-3, reg=0.0):
"""
Initialize a new network.

Inputs:
- input_dim: An integer giving the size of the input
- hidden_dim: An integer giving the size of the hidden layer
- num_classes: An integer giving the number of classes to classify
- dropout: Scalar between 0 and 1 giving dropout strength.
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- reg: Scalar giving L2 regularization strength.
"""
self.params = {}
self.reg = reg

############################################################################
# TODO: Initialize the weights and biases of the two-layer net. Weights #
# should be initialized from a Gaussian with standard deviation equal to #
# weight_scale, and biases should be initialized to zero. All weights and #
# biases should be stored in the dictionary self.params, with first layer #
# weights and biases using the keys 'W1' and 'b1' and second layer weights #
# and biases using the keys 'W2' and 'b2'. #
############################################################################
self.params['W1'] = np.random.randn(input_dim,hidden_dim) * weight_scale
self.params['b1'] = np.zeros((hidden_dim,))
self.params['W2'] = np.random.randn(hidden_dim,num_classes) * weight_scale
self.params['b2'] = np.zeros((num_classes,))

############################################################################
# END OF YOUR CODE #
############################################################################


def loss(self, X, y=None):
"""
Compute loss and gradient for a minibatch of data.

Inputs:
- X: Array of input data of shape (N, d_1, ..., d_k)
- y: Array of labels, of shape (N,). y[i] gives the label for X[i].

Returns:
If y is None, then run a test-time forward pass of the model and return:
- scores: Array of shape (N, C) giving classification scores, where
scores[i, c] is the classification score for X[i] and class c.

If y is not None, then run a training-time forward and backward pass and
return a tuple of:
- loss: Scalar value giving the loss
- grads: Dictionary with the same keys as self.params, mapping parameter
names to gradients of the loss with respect to those parameters.
"""
scores = None
############################################################################
# TODO: Implement the forward pass for the two-layer net, computing the #
# class scores for X and storing them in the scores variable. #
############################################################################
A1, A1_cache = affine_relu_forward(X,self.params['W1'],self.params['b1'])
scores , out_cache = affine_forward(A1,self.params['W2'],self.params['b2'])
############################################################################
# END OF YOUR CODE #
############################################################################

# If y is None then we are in test mode so just return scores
if y is None:
return scores

loss, grads = 0, {}
############################################################################
# TODO: Implement the backward pass for the two-layer net. Store the loss #
# in the loss variable and gradients in the grads dictionary. Compute data #
# loss using softmax, and make sure that grads[k] holds the gradients for #
# self.params[k]. Don't forget to add L2 regularization! #
# #
# NOTE: To ensure that your implementation matches ours and you pass the #
# automated tests, make sure that your L2 regularization includes a factor #
# of 0.5 to simplify the expression for the gradient. #
############################################################################
loss, dout = softmax_loss(scores,y)
loss += 0.5 * self.reg * (np.sum(self.params['W1']*self.params['W1']) + np.sum(self.params['W2']*self.params['W2']))
da1, dw2, db2 = affine_backward(dout,out_cache)
grads['W2'] = dw2 + self.reg * self.params['W2']
grads['b2'] = db2
_ , dw1, db1 = affine_relu_backward(da1, A1_cache)
grads['W1'] = dw1 + self.reg * self.params['W1']
grads['b1'] = db1
############################################################################
# END OF YOUR CODE #
############################################################################

return loss, grads

回到notebook中,调用了已经为我们写好的solver类,model用的就是TwoLayerNet()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
model = TwoLayerNet()
solver = None

##############################################################################
# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least #
# 50% accuracy on the validation set. #
##############################################################################
# data = {
# 'X_train': X_train,
# 'y_train': y_train,
# 'X_val': X_val,
# 'y_val': y_val,
# }
solver = Solver(model,data,
update_rule='sgd',
optim_config={
'learning_rate': 1e-3,
},
lr_decay=0.9,
num_epochs=10,batch_size=100,
print_every=100
)
solver.train()
##############################################################################
# END OF YOUR CODE #
##############################################################################

完成之后,我们就可以类似的,搭建一个多层的神经网络了,同样是在fc_net.pyFullyConnectedNet类中。这时候先不要去在意batchnorm和dropout,后面会来实现这些函数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
class FullyConnectedNet(object):
"""
A fully-connected neural network with an arbitrary number of hidden layers,
ReLU nonlinearities, and a softmax loss function. This will also implement
dropout and batch normalization as options. For a network with L layers,
the architecture will be

{affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax

where batch normalization and dropout are optional, and the {...} block is
repeated L - 1 times.

Similar to the TwoLayerNet above, learnable parameters are stored in the
self.params dictionary and will be learned using the Solver class.
"""

def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
dropout=0, use_batchnorm=False, reg=0.0,
weight_scale=1e-2, dtype=np.float32, seed=None):
"""
Initialize a new FullyConnectedNet.

Inputs:
- hidden_dims: A list of integers giving the size of each hidden layer.
- input_dim: An integer giving the size of the input.
- num_classes: An integer giving the number of classes to classify.
- dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then
the network should not use dropout at all.
- use_batchnorm: Whether or not the network should use batch normalization.
- reg: Scalar giving L2 regularization strength.
- weight_scale: Scalar giving the standard deviation for random
initialization of the weights.
- dtype: A numpy datatype object; all computations will be performed using
this datatype. float32 is faster but less accurate, so you should use
float64 for numeric gradient checking.
- seed: If not None, then pass this random seed to the dropout layers. This
will make the dropout layers deteriminstic so we can gradient check the
model.
"""
self.use_batchnorm = use_batchnorm
self.use_dropout = dropout > 0
self.reg = reg
self.num_layers = 1 + len(hidden_dims)
self.dtype = dtype
self.params = {}

############################################################################
# TODO: Initialize the parameters of the network, storing all values in #
# the self.params dictionary. Store weights and biases for the first layer #
# in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
# initialized from a normal distribution with standard deviation equal to #
# weight_scale and biases should be initialized to zero. #
# #
# When using batch normalization, store scale and shift parameters for the #
# first layer in gamma1 and beta1; for the second layer use gamma2 and #
# beta2, etc. Scale parameters should be initialized to one and shift #
# parameters should be initialized to zero. #
############################################################################

n_i_prev = input_dim
for i, n_i in enumerate(hidden_dims):
self.params['W' + str(i+1)] = np.random.randn(n_i_prev,n_i) * weight_scale
self.params['b' + str(i+1)] = np.zeros((n_i,))
#是否使用batchnorm
if self.use_batchnorm:
self.params['gamma' +str(i+1)] = np.ones((n_i,))
self.params['beta' + str(i+1)] = np.zeros((n_i,))

n_i_prev = n_i

self.params['W' + str(self.num_layers)] = np.random.randn(n_i_prev,num_classes) * weight_scale
self.params['b' + str(self.num_layers)] = np.zeros((num_classes,))

############################################################################
# END OF YOUR CODE #
############################################################################

# When using dropout we need to pass a dropout_param dictionary to each
# dropout layer so that the layer knows the dropout probability and the mode
# (train / test). You can pass the same dropout_param to each dropout layer.
self.dropout_param = {}
if self.use_dropout:
self.dropout_param = {'mode': 'train', 'p': dropout}
if seed is not None:
self.dropout_param['seed'] = seed

# With batch normalization we need to keep track of running means and
# variances, so we need to pass a special bn_param object to each batch
# normalization layer. You should pass self.bn_params[0] to the forward pass
# of the first batch normalization layer, self.bn_params[1] to the forward
# pass of the second batch normalization layer, etc.
self.bn_params = []
if self.use_batchnorm:
self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]

# Cast all parameters to the correct datatype
for k, v in self.params.items():
self.params[k] = v.astype(dtype)


def loss(self, X, y=None):
"""
Compute loss and gradient for the fully-connected net.

Input / output: Same as TwoLayerNet above.
"""
X = X.astype(self.dtype)
mode = 'test' if y is None else 'train'

# Set train/test mode for batchnorm params and dropout param since they
# behave differently during training and testing.
if self.use_dropout:
self.dropout_param['mode'] = mode
if self.use_batchnorm:
for bn_param in self.bn_params:
bn_param['mode'] = mode

scores = None
############################################################################
# TODO: Implement the forward pass for the fully-connected net, computing #
# the class scores for X and storing them in the scores variable. #
# #
# When using dropout, you'll need to pass self.dropout_param to each #
# dropout forward pass. #
# #
# When using batch normalization, you'll need to pass self.bn_params[0] to #
# the forward pass for the first batch normalization layer, pass #
# self.bn_params[1] to the forward pass for the second batch normalization #
# layer, etc. #
############################################################################
#无dropout,batchnorm
# A_prev = X
# fc_mix_cache = []
# for i in range(self.num_layers - 1):
# W, b = self.params['W' + str(i+1)],self.params['b' + str(i+1)]
# A, A_cache = affine_relu_forward(A_prev, W, b)
# A_prev = A
# fc_mix_cache.append(A_cache)
# W, b = self.params['W' + str(self.num_layers)],self.params['b' + str(self.num_layers)]
# ZL, ZL_cache = affine_forward(A_prev,W,b)
# scores = ZL

#加上batchnorm
A_prev = X
fc_mix_cache = []
drop_cache = []
for i in range(self.num_layers - 1):
W, b = self.params['W' + str(i+1)],self.params['b' + str(i+1)]
if self.use_batchnorm:
gamma = self.params['gamma'+str(i+1)]
beta = self.params['beta'+str(i+1)]
A, A_cache = affine_bn_relu_forword(A_prev, W, b,gamma,beta,self.bn_params[i])
else:
A, A_cache = affine_relu_forward(A_prev, W, b)

if self.use_dropout:
A, drop_ch = dropout_forward(A, self.dropout_param)
drop_cache.append(drop_ch)
A_prev = A
fc_mix_cache.append(A_cache)
W, b = self.params['W' + str(self.num_layers)],self.params['b' + str(self.num_layers)]
ZL, ZL_cache = affine_forward(A_prev,W,b)
scores = ZL
############################################################################
# END OF YOUR CODE #
############################################################################

# If test mode return early
if mode == 'test':
return scores

loss, grads = 0.0, {}
############################################################################
# TODO: Implement the backward pass for the fully-connected net. Store the #
# loss in the loss variable and gradients in the grads dictionary. Compute #
# data loss using softmax, and make sure that grads[k] holds the gradients #
# for self.params[k]. Don't forget to add L2 regularization! #
# #
# When using batch normalization, you don't need to regularize the scale #
# and shift parameters. #
# #
# NOTE: To ensure that your implementation matches ours and you pass the #
# automated tests, make sure that your L2 regularization includes a factor #
# of 0.5 to simplify the expression for the gradient. #
############################################################################
# loss, dout = softmax_loss(scores,y)
# #先算出最后一层的loss的reg
# loss += 0.5 * self.reg * (np.sum(self.params['W'+ str(self.num_layers)]**2))
# #计算最后一层的grads
# dA_prev, dwl, dbl = affine_backward(dout, ZL_cache)
# grads['W' + str(self.num_layers)] = dwl + self.reg * self.params['W'+ str(self.num_layers)]
# grads['b' + str(self.num_layers)] = dbl
# #循环计算前面隐藏层
# for i in range(self.num_layers-1, 0,-1):
# loss += 0.5 * self.reg * np.sum(self.params['W'+ str(i)]**2)
# dA_prev, dw, db = affine_relu_backward(dA_prev, fc_mix_cache[i-1])
# grads['W'+str(i)] = dw + self.reg * self.params['W'+ str(i)]
# grads['b'+str(i)] = db

loss, dout = softmax_loss(scores,y)
#先算出最后一层的loss的reg
loss += 0.5 * self.reg * (np.sum(self.params['W'+ str(self.num_layers)]**2))
#计算最后一层的grads
dA_prev, dwl, dbl = affine_backward(dout, ZL_cache)
grads['W' + str(self.num_layers)] = dwl + self.reg * self.params['W'+ str(self.num_layers)]
grads['b' + str(self.num_layers)] = dbl
#循环计算前面隐藏层
for i in range(self.num_layers-1, 0,-1):
loss += 0.5 * self.reg * np.sum(self.params['W'+ str(i)]**2)
if self.use_dropout:
dA_prev = dropout_backward(dA_prev, drop_cache[i-1])
if self.use_batchnorm:
dA_prev, dw, db, dgamma, dbeta = affine_bn_relu_backward(dA_prev, fc_mix_cache[i-1])
else:
dA_prev, dw, db = affine_relu_backward(dA_prev, fc_mix_cache[i-1])

grads['W'+str(i)] = dw + self.reg * self.params['W'+ str(i)]
grads['b'+str(i)] = db

if self.use_batchnorm:
grads['gamma' + str(i)] = dgamma
grads['beta' + str(i)] = dbeta


############################################################################
# END OF YOUR CODE #
############################################################################

return loss, grads

然后构建了三层的model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# TODO: Use a three-layer Net to overfit 50 training examples.

num_train = 50
small_data = {
'X_train': data['X_train'][:num_train],
'y_train': data['y_train'][:num_train],
'X_val': data['X_val'],
'y_val': data['y_val'],
}

weight_scale = 1e-2
learning_rate = 8e-3
model = FullyConnectedNet([100, 100],
weight_scale=weight_scale, dtype=np.float64)
solver = Solver(model, small_data,
print_every=10, num_epochs=20, batch_size=25,
update_rule='sgd',
optim_config={
'learning_rate': learning_rate,

},

)
solver.train()

plt.plot(solver.loss_history, 'o')
plt.title('Training loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')
plt.show()

五层的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# TODO: Use a five-layer Net to overfit 50 training examples.

num_train = 50
small_data = {
'X_train': data['X_train'][:num_train],
'y_train': data['y_train'][:num_train],
'X_val': data['X_val'],
'y_val': data['y_val'],
}

learning_rate = 3e-4
weight_scale = 1e-1
model = FullyConnectedNet([100, 100, 100, 100],
weight_scale=weight_scale, dtype=np.float64)
solver = Solver(model, small_data,
print_every=10, num_epochs=20, batch_size=25,
update_rule='sgd',
optim_config={
'learning_rate': learning_rate,
}
)
solver.train()
# print(model.params)
plt.plot(solver.loss_history, 'o')
plt.title('Training loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')
plt.show()

而后用上了Momentum的优化方法,在optim.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def sgd_momentum(w, dw, config=None):
"""
Performs stochastic gradient descent with momentum.

config format:
- learning_rate: Scalar learning rate.
- momentum: Scalar between 0 and 1 giving the momentum value.
Setting momentum = 0 reduces to sgd.
- velocity: A numpy array of the same shape as w and dw used to store a
moving average of the gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('momentum', 0.9)
v = config.get('velocity', np.zeros_like(w))

next_w = None
###########################################################################
# TODO: Implement the momentum update formula. Store the updated value in #
# the next_w variable. You should also use and update the velocity v. #
###########################################################################
v = config['momentum'] * v - config['learning_rate'] * dw
next_w = w + v
###########################################################################
# END OF YOUR CODE #
###########################################################################
config['velocity'] = v

return next_w, config

然后再尝试另外两种优化的梯度下降法RMSprop和Adam

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def rmsprop(x, dx, config=None):
"""
Uses the RMSProp update rule, which uses a moving average of squared
gradient values to set adaptive per-parameter learning rates.

config format:
- learning_rate: Scalar learning rate.
- decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
gradient cache.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- cache: Moving average of second moments of gradients.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-2)
config.setdefault('decay_rate', 0.99)
config.setdefault('epsilon', 1e-8)
config.setdefault('cache', np.zeros_like(x))

next_x = None
###########################################################################
# TODO: Implement the RMSprop update formula, storing the next value of x #
# in the next_x variable. Don't forget to update cache value stored in #
# config['cache']. #
###########################################################################
config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dx**2
next_x = x - config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])
###########################################################################
# END OF YOUR CODE #
###########################################################################

return next_x, config


def adam(x, dx, config=None):
"""
Uses the Adam update rule, which incorporates moving averages of both the
gradient and its square and a bias correction term.

config format:
- learning_rate: Scalar learning rate.
- beta1: Decay rate for moving average of first moment of gradient.
- beta2: Decay rate for moving average of second moment of gradient.
- epsilon: Small scalar used for smoothing to avoid dividing by zero.
- m: Moving average of gradient.
- v: Moving average of squared gradient.
- t: Iteration number.
"""
if config is None: config = {}
config.setdefault('learning_rate', 1e-3)
config.setdefault('beta1', 0.9)
config.setdefault('beta2', 0.999)
config.setdefault('epsilon', 1e-8)
config.setdefault('m', np.zeros_like(x))
config.setdefault('v', np.zeros_like(x))
config.setdefault('t', 1)

next_x = None
###########################################################################
# TODO: Implement the Adam update formula, storing the next value of x in #
# the next_x variable. Don't forget to update the m, v, and t variables #
# stored in config. #
###########################################################################
config['t'] += 1
config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx
mt = config['m'] / (1 - config['beta1']**config['t'])
config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * dx**2
vt = config['v'] / (1 - config['beta2']**config['t'])
next_x = x - config['learning_rate'] * mt / (np.sqrt(vt) + config['epsilon'])

###########################################################################
# END OF YOUR CODE #
###########################################################################

return next_x, config
分享到