Fangzh的个人博客 | 人工智能拯救世界

cs231n作业：assignment2 - Convolutional Networks

2018-10-22T05:49:41.000Z

作业做到这里才真正进入了cnn的范畴。

先用最基本的循环来写forward

def conv_forward_naive(x, w, b, conv_param):
    """
    A naive implementation of the forward pass for a convolutional layer.

    The input consists of N data points, each with C channels, height H and
    width W. We convolve each input with F different filters, where each filter
    spans all C channels and has height HH and width HH.

    Input:
    - x: Input data of shape (N, C, H, W)
    - w: Filter weights of shape (F, C, HH, WW)
    - b: Biases, of shape (F,)
    - conv_param: A dictionary with the following keys:
      - 'stride': The number of pixels between adjacent receptive fields in the
        horizontal and vertical directions.
      - 'pad': The number of pixels that will be used to zero-pad the input.

    Returns a tuple of:
    - out: Output data, of shape (N, F, H', W') where H' and W' are given by
      H' = 1 + (H + 2 * pad - HH) / stride
      W' = 1 + (W + 2 * pad - WW) / stride
    - cache: (x, w, b, conv_param)
    """
    out = None
    ###########################################################################
    # TODO: Implement the convolutional forward pass.                         #
    # Hint: you can use the function np.pad for padding.                      #
    ###########################################################################
    #N个样本，C个通道，H高度，W宽度
    N, C, H, W = x.shape
    #F个filter，C个通道，HH个核高度，WW核宽度
    F, C, HH, WW = w.shape
    #步长
    stride = conv_param['stride']
    #padding 的像素个数
    pad = conv_param['pad']

    #经过卷积核之后的图片大小
    new_H = 1 + int((H + 2 * pad - HH)/stride)
    new_W = 1 + int((W + 2 * pad - WW)/stride)
    out = np.zeros([N, F, new_H, new_W])

    #遍历N个样本卷积
    for n in range(N):
        for f in range(F):
            #需要加上bias
            conv_newH_new_W = np.ones([new_H, new_W]) * b[f]
            for c in range(C):
                #填充原图片x
                padded_x = np.lib.pad(x[n, c], pad_width = pad, mode='constant',constant_values=0)
                #开始计算卷积后的图中的每一个像素，每一个像素就是对应一个卷积核乘上原来的图片的位置
                for i in range(new_H):
                    for j in range(new_W):
                        conv_newH_new_W[i, j] += np.sum(padded_x[i * stride:i * stride+HH, j * stride: j*stride+WW]* w[f, c, :, :])
            #把C个通道中的那些对应像素加在一起，得到了单张图片单个核数的out
            out[n, f] = conv_newH_new_W

    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = (x, w, b, conv_param)
    return out, cache

backward如图：

def conv_backward_naive(dout, cache):
    """
    A naive implementation of the backward pass for a convolutional layer.

    Inputs:
    - dout: Upstream derivatives.
    - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive

    Returns a tuple of:
    - dx: Gradient with respect to x
    - dw: Gradient with respect to w
    - db: Gradient with respect to b
    """
    dx, dw, db = None, None, None
    ###########################################################################
    # TODO: Implement the convolutional backward pass.                        #
    ###########################################################################
    # 数据准备
    x, w, b, conv_param = cache
    pad = conv_param['pad']
    stride = conv_param['stride']
    F, C, HH, WW = w.shape
    N, C, H, W = x.shape
    N, F, new_H, new_W = dout.shape

    # 下面，我们模拟卷积，首先填充x。
    padded_x = np.lib.pad(x,
                          ((0, 0), (0, 0), (pad, pad), (pad, pad)),
                          mode='constant',
                          constant_values=0)
    padded_dx = np.zeros_like(padded_x)  # 填充了的dx，后面去填充即可得到dx
    dw = np.zeros_like(w)
    db = np.zeros_like(b)
    
    for n in range(N):  # 第n个图像
        for f in range(F):  # 第f个过滤器
            for i in range(new_H):
                for j in range(new_W):
                    #dw 等于所有out的每一个像素求导之和，因为out每个像素都共享参数
                    db[f] += dout[n, f, i, j] # dg对db求导为1*dout
                    dw[f] += padded_x[n, :, i*stride : HH + i*stride, j*stride : WW + j*stride] * dout[n, f, i, j]
                    padded_dx[n, :, i*stride : HH + i*stride, j*stride : WW + j*stride] += w[f] * dout[n, f, i, j]
    # 去掉填充部分
    dx = padded_dx[:, :, pad:pad + H, pad:pad + W]

    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dw, db

然后是max pool 层

def max_pool_forward_naive(x, pool_param):
    """
    A naive implementation of the forward pass for a max pooling layer.

    Inputs:
    - x: Input data, of shape (N, C, H, W)
    - pool_param: dictionary with the following keys:
      - 'pool_height': The height of each pooling region
      - 'pool_width': The width of each pooling region
      - 'stride': The distance between adjacent pooling regions

    Returns a tuple of:
    - out: Output data
    - cache: (x, pool_param)
    """
    out = None
    ###########################################################################
    # TODO: Implement the max pooling forward pass                            #
    ###########################################################################
    N, C, H, W = x.shape
    pool_height = pool_param['pool_height'] # 池化过滤器高度
    pool_width  = pool_param['pool_width']  # 池化过滤器宽度
    pool_stride = pool_param['stride']      # 移动步长
    new_H = 1 + int((H - pool_height) / pool_stride)    # 池化结果矩阵高度
    new_W = 1 + int((W - pool_width) / pool_stride)     # 池化结果矩阵宽度
    out = np.zeros([N, C, new_H, new_W])
    for n in range(N):
        for c in range(C):
            for i in range(new_H):
                for j in range(new_W):
                    out[n,c,i,j] = np.max(x[n, c, i*pool_stride : i*pool_stride+pool_height, j*pool_stride : j*pool_stride+pool_width])


    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = (x, pool_param)
    return out, cache


def max_pool_backward_naive(dout, cache):
    """
    A naive implementation of the backward pass for a max pooling layer.

    Inputs:
    - dout: Upstream derivatives
    - cache: A tuple of (x, pool_param) as in the forward pass.

    Returns:
    - dx: Gradient with respect to x
    """
    dx = None
    ###########################################################################
    # TODO: Implement the max pooling backward pass                           #
    ###########################################################################
    #太难
    x, pool_param = cache
    N, C, H, W = x.shape
    pool_height = pool_param['pool_height']
    pool_width  = pool_param['pool_width']
    pool_stride = pool_param['stride']
    new_H = 1 + int((H - pool_height) / pool_stride)
    new_W = 1 + int((W - pool_width) / pool_stride)
    dx = np.zeros_like(x)
    for n in range(N):
        for c in range(C):
            for i in range(new_H):
                for j in range(new_W):
                    window = x[n, c, i * pool_stride: i * pool_stride + pool_height,j * pool_stride: j * pool_stride + pool_width]
                    dx[n, c, i * pool_stride: i * pool_stride + pool_height, j * pool_stride: j * pool_stride + pool_width] = (window == np.max(window))*dout[n,c,i,j]

    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx

以上只是尝试最基本的CNN和max pool结构。实际使用不用这个，因为有更高效的版本。

然后用高效的版本定义了三明治层：

def conv_relu_forward(x, w, b, conv_param):
    """
    A convenience layer that performs a convolution followed by a ReLU.

    Inputs:
    - x: Input to the convolutional layer
    - w, b, conv_param: Weights and parameters for the convolutional layer

    Returns a tuple of:
    - out: Output from the ReLU
    - cache: Object to give to the backward pass
    """
    a, conv_cache = conv_forward_fast(x, w, b, conv_param)
    out, relu_cache = relu_forward(a)
    cache = (conv_cache, relu_cache)
    return out, cache


def conv_relu_backward(dout, cache):
    """
    Backward pass for the conv-relu convenience layer.
    """
    conv_cache, relu_cache = cache
    da = relu_backward(dout, relu_cache)
    dx, dw, db = conv_backward_fast(da, conv_cache)
    return dx, dw, db

在cnn.py中完成了三层的ConvNet


class ThreeLayerConvNet(object):
    """
    A three-layer convolutional network with the following architecture:

    conv - relu - 2x2 max pool - affine - relu - affine - softmax

    The network operates on minibatches of data that have shape (N, C, H, W)
    consisting of N images, each with height H and width W and with C input
    channels.
    """

    def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,
                 hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
                 dtype=np.float32):
        """
        Initialize a new network.

        Inputs:
        - input_dim: Tuple (C, H, W) giving size of input data
        - num_filters: Number of filters to use in the convolutional layer
        - filter_size: Size of filters to use in the convolutional layer
        - hidden_dim: Number of units to use in the fully-connected hidden layer
        - num_classes: Number of scores to produce from the final affine layer.
        - weight_scale: Scalar giving standard deviation for random initialization
          of weights.
        - reg: Scalar giving L2 regularization strength
        - dtype: numpy datatype to use for computation.
        """
        self.params = {}
        self.reg = reg
        self.dtype = dtype

        ############################################################################
        # TODO: Initialize weights and biases for the three-layer convolutional    #
        # network. Weights should be initialized from a Gaussian with standard     #
        # deviation equal to weight_scale; biases should be initialized to zero.   #
        # All weights and biases should be stored in the dictionary self.params.   #
        # Store weights and biases for the convolutional layer using the keys 'W1' #
        # and 'b1'; use keys 'W2' and 'b2' for the weights and biases of the       #
        # hidden affine layer, and keys 'W3' and 'b3' for the weights and biases   #
        # of the output affine layer.                                              #
        ############################################################################
        C, H, W = input_dim
        #W1为第一层conv参数
        self.params['W1'] = weight_scale * np.random.randn(num_filters, C, filter_size, filter_size)
        self.params['b1'] = np.zeros(num_filters)
        #W2为maxpool - hiddenlayer
        self.params['W2'] = weight_scale * np.random.randn(int(H / 2) * int(W / 2)*num_filters, hidden_dim)
        self.params['b2'] = np.zeros(hidden_dim)
        #W3 hidden - output
        self.params['W3'] = weight_scale * np.random.randn(hidden_dim, num_classes)
        self.params['b3'] = np.zeros(num_classes)
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        """
        Evaluate loss and gradient for the three-layer convolutional network.

        Input / output: Same API as TwoLayerNet in fc_net.py.
        """
        W1, b1 = self.params['W1'], self.params['b1']
        W2, b2 = self.params['W2'], self.params['b2']
        W3, b3 = self.params['W3'], self.params['b3']

        # pass conv_param to the forward pass for the convolutional layer
        filter_size = W1.shape[2]
        conv_param = {'stride': 1, 'pad': (filter_size - 1) // 2}

        # pass pool_param to the forward pass for the max-pooling layer
        pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}

        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the three-layer convolutional net,  #
        # computing the class scores for X and storing them in the scores          #
        # variable.                                                                #
        ############################################################################
        pass
        conv_forward_out_1, cache_forward_1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
        affine_out_2, cache_forward_2 = affine_relu_forward(conv_forward_out_1, W2, b2)
        scores, cache_forward_3 = affine_forward(affine_out_2, W3, b3)

        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        if y is None:
            return scores

        loss, grads = 0, {}
        ############################################################################
        # TODO: Implement the backward pass for the three-layer convolutional net, #
        # storing the loss and gradients in the loss and grads variables. Compute  #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        ############################################################################
        pass
        loss, dscore = softmax_loss(scores, y)
        #da2 即affine_out_2的d
        da2, grads['W3'], grads['b3'] = affine_backward(dscore, cache_forward_3)
        #da1,即第一层经过conv pool之后的d
        da1, grads['W2'], grads['b2'] = affine_relu_backward(da2, cache_forward_2)
        _, grads['W1'], grads['b1'] = conv_relu_pool_backward(da1, cache_forward_1)

        loss += 0.5 * self.reg * (np.sum(W1 ** 2) + np.sum(W2 **2) + np.sum(W3 ** 2))

        grads['W1'] += self.reg * W1
        grads['W2'] += self.reg * W2
        grads['W3'] += self.reg * W3

        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

cs231n作业：assignment2 - Batch Normalization and Dropout

2018-10-22T05:07:45.000Z

Batch Normalization

批量归一化相当于在每一层神经网络的激活函数前进行归一化预处理。

先写batchnorm_forward

def batchnorm_forward(x, gamma, beta, bn_param):
    """
    Forward pass for batch normalization.

    During training the sample mean and (uncorrected) sample variance are
    computed from minibatch statistics and used to normalize the incoming data.
    During training we also keep an exponentially decaying running mean of the
    mean and variance of each feature, and these averages are used to normalize
    data at test-time.

    At each timestep we update the running averages for mean and variance using
    an exponential decay based on the momentum parameter:

    running_mean = momentum * running_mean + (1 - momentum) * sample_mean
    running_var = momentum * running_var + (1 - momentum) * sample_var

    Note that the batch normalization paper suggests a different test-time
    behavior: they compute sample mean and variance for each feature using a
    large number of training images rather than using a running average. For
    this implementation we have chosen to use running averages instead since
    they do not require an additional estimation step; the torch7
    implementation of batch normalization also uses running averages.

    Input:
    - x: Data of shape (N, D)
    - gamma: Scale parameter of shape (D,)
    - beta: Shift paremeter of shape (D,)
    - bn_param: Dictionary with the following keys:
      - mode: 'train' or 'test'; required
      - eps: Constant for numeric stability
      - momentum: Constant for running mean / variance.
      - running_mean: Array of shape (D,) giving running mean of features
      - running_var Array of shape (D,) giving running variance of features

    Returns a tuple of:
    - out: of shape (N, D)
    - cache: A tuple of values needed in the backward pass
    """
    mode = bn_param['mode']
    eps = bn_param.get('eps', 1e-5)
    momentum = bn_param.get('momentum', 0.9)

    N, D = x.shape
    running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype))
    running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype))

    out, cache = None, None
    if mode == 'train':
        #######################################################################
        # TODO: Implement the training-time forward pass for batch norm.      #
        # Use minibatch statistics to compute the mean and variance, use      #
        # these statistics to normalize the incoming data, and scale and      #
        # shift the normalized data using gamma and beta.                     #
        #                                                                     #
        # You should store the output in the variable out. Any intermediates  #
        # that you need for the backward pass should be stored in the cache   #
        # variable.                                                           #
        #                                                                     #
        # You should also use your computed sample mean and variance together #
        # with the momentum variable to update the running mean and running   #
        # variance, storing your result in the running_mean and running_var   #
        # variables.                                                          #
        #######################################################################
        
        sample_mean = np.mean(x, axis=0)  #每一列均值
        sample_var = np.var(x, axis=0)    #每一列方差
        x_hat = (x - sample_mean) / (np.sqrt(sample_var + eps)) #归一化后
        out = gamma * x_hat + beta   #变成新的均值和方差
        cache = (gamma, x, sample_mean, sample_var, eps, x_hat)
        running_mean = momentum * running_mean + (1 - momentum) * sample_mean #然后把均值和方差在每一步都进行指数加权平均
        running_var = momentum * running_var + (1 - momentum) * sample_var
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == 'test':
        #######################################################################
        # TODO: Implement the test-time forward pass for batch normalization. #
        # Use the running mean and variance to normalize the incoming data,   #
        # then scale and shift the normalized data using gamma and beta.      #
        # Store the result in the out variable.                               #
        #######################################################################
        scale = gamma / np.sqrt(running_var + eps)
        out = x * scale + (beta - running_mean * scale)
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    else:
        raise ValueError('Invalid forward batchnorm mode "%s"' % mode)

    # Store the updated running means back into bn_param
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var

    return out, cache

backword很难，公式看图：

def batchnorm_backward(dout, cache):
    """
    Backward pass for batch normalization.

    For this implementation, you should write out a computation graph for
    batch normalization on paper and propagate gradients backward through
    intermediate nodes.

    Inputs:
    - dout: Upstream derivatives, of shape (N, D)
    - cache: Variable of intermediates from batchnorm_forward.

    Returns a tuple of:
    - dx: Gradient with respect to inputs x, of shape (N, D)
    - dgamma: Gradient with respect to scale parameter gamma, of shape (D,)
    - dbeta: Gradient with respect to shift parameter beta, of shape (D,)
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    ###########################################################################
    gamma, x, mean, var, eps, x_hat = cache
    N =x.shape[0]
    dgamma = np.sum(dout * x_hat, axis=0)   # 第5行公式
    dbeta = np.sum(dout * 1.0, axis=0)      # 第6行公式
    dx_hat = dout * gamma                   # 第1行公式
    dx_hat_numerator = dx_hat / np.sqrt(var + eps)      # 第3行第1项(未负求和)
    dx_hat_denominator = np.sum(dx_hat * (x - mean), axis=0)    # 第2行前半部分
    dx_1 = dx_hat_numerator                 # 第4行第1项
    dvar = -0.5 * ((var + eps) ** (-1.5)) * dx_hat_denominator  # 第2行公式
    # Note var is also a function of mean
    dmean = -1.0 * np.sum(dx_hat_numerator, axis=0) + \
              dvar * np.mean(-2.0 * (x - mean), axis=0)  # 第3行公式(部分)
    dx_var = dvar * 2.0 / N * (x - mean)    # 第4行第2项
    dx_mean = dmean * 1.0 / N               # 第4行第3项
    # with shape (D,), no trouble with broadcast
    dx = dx_1 + dx_var + dx_mean            # 第4行公式

    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta

另一种backword

def batchnorm_backward_alt(dout, cache):
    """
    Alternative backward pass for batch normalization.

    For this implementation you should work out the derivatives for the batch
    normalizaton backward pass on paper and simplify as much as possible. You
    should be able to derive a simple expression for the backward pass.

    Note: This implementation should expect to receive the same cache variable
    as batchnorm_backward, but might not use all of the values in the cache.

    Inputs / outputs: Same as batchnorm_backward
    """
    dx, dgamma, dbeta = None, None, None
    ###########################################################################
    # TODO: Implement the backward pass for batch normalization. Store the    #
    # results in the dx, dgamma, and dbeta variables.                         #
    #                                                                         #
    # After computing the gradient with respect to the centered inputs, you   #
    # should be able to compute gradients with respect to the inputs in a     #
    # single statement; our implementation fits on a single 80-character line.#
    ###########################################################################
    pass
    gamma, x, sample_mean, sample_var, eps, x_hat = cache
    N = x.shape[0]
    dx_hat = dout * gamma
    dvar = np.sum(dx_hat* (x - sample_mean) * -0.5 * np.power(sample_var + eps, -1.5), axis = 0)
    dmean = np.sum(dx_hat * -1 / np.sqrt(sample_var +eps), axis = 0) + dvar * np.mean(-2 * (x - sample_mean), axis =0)
    dx = 1 / np.sqrt(sample_var + eps) * dx_hat + dvar * 2.0 / N * (x-sample_mean) + 1.0 / N * dmean
    dgamma = np.sum(x_hat * dout, axis = 0)
    dbeta = np.sum(dout , axis = 0)
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return dx, dgamma, dbeta

然后把之前的FullyConnectedNet的use_batchnorm补上，之前已经写好了，不再赘述。

Dropout

定义一个mask，用来生成0-1随机数，然后转化为大于某个数的布尔值，再把输入值乘上这个mask就可以得到一部分失活，一部分没有失活的神经元

def dropout_forward(x, dropout_param):
    """
    Performs the forward pass for (inverted) dropout.

    Inputs:
    - x: Input data, of any shape
    - dropout_param: A dictionary with the following keys:
      - p: Dropout parameter. We drop each neuron output with probability p.
      - mode: 'test' or 'train'. If the mode is train, then perform dropout;
        if the mode is test, then just return the input.
      - seed: Seed for the random number generator. Passing seed makes this
        function deterministic, which is needed for gradient checking but not
        in real networks.

    Outputs:
    - out: Array of the same shape as x.
    - cache: tuple (dropout_param, mask). In training mode, mask is the dropout
      mask that was used to multiply the input; in test mode, mask is None.
    """
    p, mode = dropout_param['p'], dropout_param['mode']
    if 'seed' in dropout_param:
        np.random.seed(dropout_param['seed'])

    mask = None
    out = None

    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase forward pass for inverted dropout.   #
        # Store the dropout mask in the mask variable.                        #
        #######################################################################
        mask = np.random.rand(*x.shape) >= p
        mask = mask / (1 - p)
        out = mask * x
        #######################################################################
        #                           END OF YOUR CODE                          #
        #######################################################################
    elif mode == 'test':
        #######################################################################
        # TODO: Implement the test phase forward pass for inverted dropout.   #
        #######################################################################
        out = x
        #######################################################################
        #                            END OF YOUR CODE                         #
        #######################################################################

    cache = (dropout_param, mask)
    out = out.astype(x.dtype, copy=False)

    return out, cache


def dropout_backward(dout, cache):
    """
    Perform the backward pass for (inverted) dropout.

    Inputs:
    - dout: Upstream derivatives, of any shape
    - cache: (dropout_param, mask) from dropout_forward.
    """
    dropout_param, mask = cache
    mode = dropout_param['mode']

    dx = None
    if mode == 'train':
        #######################################################################
        # TODO: Implement training phase backward pass for inverted dropout   #
        #######################################################################
        dx = dout * mask
        #######################################################################
        #                          END OF YOUR CODE                           #
        #######################################################################
    elif mode == 'test':
        dx = dout
    return dx

吴恩达Coursera(DeepLearning.ai)笔记和作业汇总帖

2018-10-18T12:01:05.000Z

吴恩达Coursera(DeepLearning.ai)笔记和作业汇总。

历时一个多月终于把NG的五门课全部学完并且做了作业和笔记了。这里汇总一下：

第一门课：神经网络和深度学习

主要讲了神经网络的基本概念，以及机器学习的梯度下降法，向量化，而后进入了浅层和深层神经网络的实现。

前两周太简单了，在之前的机器学习课上NG全部都讲过了，这里就不做了。
第三周：主要是浅层神经网络的实现
- 笔记：浅层神经网络
- 作业：浅层神经网络
第四周：深层神经网络的实现
- 笔记：深层神经网络
- 作业：深层神经网络

第二门课：改善神经网络

介绍了改善神经网络的方法，如正则化，超参数调节，优化算法等。

第一周：训练集的划分、正则化、dropout
- 笔记：深度学习的实践层面
- 作业：深度学习的实践层面
第二周：Mini-batch、Momentum、RMS、Adam、学习率衰减
- 笔记：优化算法
- 作业：优化算法
第三周：超参数的调试、BatchNorm、softmax
- 笔记：超参数调试
- 作业：超参数调试

第三门课：结构化机器学习项目

主要讲了机器学习中的一些策略。

第一周：ML策略、正交化、优化指标、数据集的划分、偏差
- 笔记：机器学习策略(1)
第二周：误差分析、数据不同分布、迁移学习、多任务、端到端
- 笔记：机器学习策略(2)

第四门课：卷积神经网络

主要讲了神经网络的在图像上的非常重要的应用，卷积神经网络。

第一周：padding、步长、池化、卷积
- 笔记：卷积神经网络
- 作业：卷积神经网络
第二周：一些重要的神经网络结构，VGG、ResNet、Inception等
- 笔记：深度卷积网络实例探究
- 作业：深度卷积网络实例探究
第三周：目标检测、Bounding Box、IOU、NMS
- 笔记：目标检测
- 作业：目标检测
第四周：人脸识别和神经风格转换
- 笔记：人脸识别和神经风格转换
- 作业：人脸识别和神经风格转换

第五门课：序列模型

主要讲了神经网络在语言领域的应用，用RNN模型

第一周：介绍了基本的RNN、GRU、LSTM
- 笔记：循环神经网络
- 作业：构建RNN、字符级生成恐龙名字、LSTM生成爵士乐
第二周：自然语言处理与词嵌入
- 笔记：自然语言处理与词嵌入
- 作业：词向量运算和emoji表情包
第三周：序列模型和注意力机制
- 笔记：序列模型和注意力机制
- 作业：机器翻译和触发关键字

DeepLearning.ai作业:(5-3) -- 序列模型和注意力机制

2018-10-18T10:39:15.000Z

这周作业分为了两部分：

机器翻译
触发关键字

Part1：机器翻译

你将建立一个将人类可读日期（“2009年6月25日”）转换为机器可读日期（“2009-06-25”）的神经机器翻译（NMT）模型。你将使用注意力机制来执行此操作，这是模型序列中最尖端的一个序列。

你将创建的模型可用于从一种语言翻译为另一种语言，如从英语翻译为印地安语。但是，语言翻译需要大量的数据集，并且通常需要几天的GPU训练。在不使用海量数据的情况下，为了让你有机会尝试使用这些模型，我们使用更简单的“日期转换”任务。

网络以各种可能格式（例如“1958年8月29日”，“03/30/1968”，“1987年6月24日”）写成的日期作为输入，并将它们转换成标准化的机器可读的日期（例如“1958 -08-29“，”1968-03-30“，”1987-06-24“），让网络学习以通用机器可读格式YYYY-MM-DD输出日期。

X: 经过处理的训练集中人类可读日期，其中每个字符都替换为其在human_vocab中映射到的索引。每个日期用特殊字符进一步填充为Tx长度。 X.shape =（m，Tx）
Y: 经过处理的训练集中机器可读日期，其中每个字符都替换为其在machine_vocab中映射到的索引。你应该有Y.shape =（m，Ty）。
Xoh：X的one-hot向量，Xoh.shape = (m，Tx，len(human_vocab))
Yoh：Y的one-hot向量，Yoh.shape = (m，Tx，len(machine_vocab))

采用注意力机制的机器翻译

定义一些layers

# Defined shared layers as global variables
repeator = RepeatVector(Tx)
concatenator = Concatenate(axis=-1)
densor1 = Dense(10, activation = "tanh")
densor2 = Dense(1, activation = "relu")
activator = Activation(softmax, name='attention_weights') # We are using a custom softmax(axis = 1) loaded in this notebook
dotor = Dot(axes = 1)

然后根据a 和 s 得到context

# GRADED FUNCTION: one_step_attention

def one_step_attention(a, s_prev):
    """
    Performs one step of attention: Outputs a context vector computed as a dot product of the attention weights
    "alphas" and the hidden states "a" of the Bi-LSTM.
    
    Arguments:
    a -- hidden state output of the Bi-LSTM, numpy-array of shape (m, Tx, 2*n_a)
    s_prev -- previous hidden state of the (post-attention) LSTM, numpy-array of shape (m, n_s)
    
    Returns:
    context -- context vector, input of the next (post-attetion) LSTM cell
    """
    
    ### START CODE HERE ###
    # Use repeator to repeat s_prev to be of shape (m, Tx, n_s) so that you can concatenate it with all hidden states "a" (≈ 1 line)
    s_prev = repeator(s_prev)
    # Use concatenator to concatenate a and s_prev on the last axis (≈ 1 line)
    concat = concatenator([a, s_prev])
    # Use densor1 to propagate concat through a small fully-connected neural network to compute the "intermediate energies" variable e. (≈1 lines)
    e = densor1(concat)
    # Use densor2 to propagate e through a small fully-connected neural network to compute the "energies" variable energies. (≈1 lines)
    energies = densor2(e)
    # Use "activator" on "energies" to compute the attention weights "alphas" (≈ 1 line)
    alphas = activator(energies)
    # Use dotor together with "alphas" and "a" to compute the context vector to be given to the next (post-attention) LSTM-cell (≈ 1 line)
    context = dotor([alphas, a])
    ### END CODE HERE ###
    
    return context

实现model()

n_a = 32
n_s = 64
post_activation_LSTM_cell = LSTM(n_s, return_state = True)
output_layer = Dense(len(machine_vocab), activation=softmax)

# GRADED FUNCTION: model

def model(Tx, Ty, n_a, n_s, human_vocab_size, machine_vocab_size):
    """
    Arguments:
    Tx -- length of the input sequence
    Ty -- length of the output sequence
    n_a -- hidden state size of the Bi-LSTM
    n_s -- hidden state size of the post-attention LSTM
    human_vocab_size -- size of the python dictionary "human_vocab"
    machine_vocab_size -- size of the python dictionary "machine_vocab"

    Returns:
    model -- Keras model instance
    """
    
    # Define the inputs of your model with a shape (Tx,)
    # Define s0 and c0, initial hidden state for the decoder LSTM of shape (n_s,)
    X = Input(shape=(Tx, human_vocab_size))
    s0 = Input(shape=(n_s,), name='s0')
    c0 = Input(shape=(n_s,), name='c0')
    s = s0
    c = c0
    
    # Initialize empty list of outputs
    outputs = []
    
    ### START CODE HERE ###
    
    # Step 1: Define your pre-attention Bi-LSTM. Remember to use return_sequences=True. (≈ 1 line)
    a = Bidirectional(LSTM(n_a, return_sequences=True))(X)
    
    # Step 2: Iterate for Ty steps
    for t in range(Ty):
    
        # Step 2.A: Perform one step of the attention mechanism to get back the context vector at step t (≈ 1 line)
        context = one_step_attention(a ,s)
        
        # Step 2.B: Apply the post-attention LSTM cell to the "context" vector.
        # Don't forget to pass: initial_state = [hidden state, cell state] (≈ 1 line)
        s, _, c = post_activation_LSTM_cell(context, initial_state=[s, c])
        
        # Step 2.C: Apply Dense layer to the hidden state output of the post-attention LSTM (≈ 1 line)
        out = output_layer(s)
        
        # Step 2.D: Append "out" to the "outputs" list (≈ 1 line)
        outputs.append(out)
    
    # Step 3: Create model instance taking three inputs and returning the list of outputs. (≈ 1 line)
    model = Model(inputs=[X,s0,c0], outputs=outputs)
    
    ### END CODE HERE ###
    
    return model

Part2:Trigger Word Detection

做触发关键字的检测。

X: 这里把每一段音频分为了10s，而10s内细分为了5511个小的片段，也就是Tx = 5511

Y: Ty = 1375，每个y都是一个布尔值，用来记录有没有收到触发关键字。

生成一个训练示例

这里把样本分为了三种，背景音乐，正向的音频，反向的音频，合成训练示例：

随机选择一个10秒的背景音频剪辑
随机将0-4个正向音频片段插入此10秒剪辑中
随机将0-2个反向音频片段插入此10秒剪辑中

合成后类似这样：

定义一个随机插入片段起始和终点位置的函数：

def get_random_time_segment(segment_ms):
    """
    Gets a random time segment of duration segment_ms in a 10,000 ms audio clip.
    
    Arguments:
    segment_ms -- the duration of the audio clip in ms ("ms" stands for "milliseconds")
    
    Returns:
    segment_time -- a tuple of (segment_start, segment_end) in ms
    """
    
    segment_start = np.random.randint(low=0, high=10000-segment_ms)   # Make sure segment doesn't run past the 10sec background 
    segment_end = segment_start + segment_ms - 1
    
    return (segment_start, segment_end)

然后需要判断在别的片段插入的时候，有没有被占用:

# GRADED FUNCTION: is_overlapping

def is_overlapping(segment_time, previous_segments):
    """
    Checks if the time of a segment overlaps with the times of existing segments.
    
    Arguments:
    segment_time -- a tuple of (segment_start, segment_end) for the new segment
    previous_segments -- a list of tuples of (segment_start, segment_end) for the existing segments
    
    Returns:
    True if the time segment overlaps with any of the existing segments, False otherwise
    """
    
    segment_start, segment_end = segment_time
    
    ### START CODE HERE ### (≈ 4 line)
    # Step 1: Initialize overlap as a "False" flag. (≈ 1 line)
    overlap = False
    
    # Step 2: loop over the previous_segments start and end times.
    # Compare start/end times and set the flag to True if there is an overlap (≈ 3 lines)
    for previous_start, previous_end in previous_segments:
        if segment_start <= previous_end and segment_end >= previous_start:
            overlap = True
    ### END CODE HERE ###

    return overlap

生成input音频片段：

# GRADED FUNCTION: insert_audio_clip

def insert_audio_clip(background, audio_clip, previous_segments):
    """
    Insert a new audio segment over the background noise at a random time step, ensuring that the 
    audio segment does not overlap with existing segments.
    
    Arguments:
    background -- a 10 second background audio recording.  
    audio_clip -- the audio clip to be inserted/overlaid. 
    previous_segments -- times where audio segments have already been placed
    
    Returns:
    new_background -- the updated background audio
    """
    
    # Get the duration of the audio clip in ms
    segment_ms = len(audio_clip)
    
    ### START CODE HERE ### 
    # Step 1: Use one of the helper functions to pick a random time segment onto which to insert 
    # the new audio clip. (≈ 1 line)
    segment_time = get_random_time_segment(segment_ms)
    
    # Step 2: Check if the new segment_time overlaps with one of the previous_segments. If so, keep 
    # picking new segment_time at random until it doesn't overlap. (≈ 2 lines)
    while is_overlapping(segment_time,previous_segments):
        segment_time = get_random_time_segment(segment_ms)

    # Step 3: Add the new segment_time to the list of previous_segments (≈ 1 line)
    previous_segments.append(segment_time)
    ### END CODE HERE ###
    
    # Step 4: Superpose audio segment and background
    new_background = background.overlay(audio_clip, position = segment_time[0])
    
    return new_background, segment_time

生成y标签：

# GRADED FUNCTION: insert_ones

def insert_ones(y, segment_end_ms):
    """
    Update the label vector y. The labels of the 50 output steps strictly after the end of the segment 
    should be set to 1. By strictly we mean that the label of segment_end_y should be 0 while, the
    50 followinf labels should be ones.
    
    
    Arguments:
    y -- numpy array of shape (1, Ty), the labels of the training example
    segment_end_ms -- the end time of the segment in ms
    
    Returns:
    y -- updated labels
    """
    
    # duration of the background (in terms of spectrogram time-steps)
    segment_end_y = int(segment_end_ms * Ty / 10000.0)
    
    # Add 1 to the correct index in the background label (y)
    ### START CODE HERE ### (≈ 3 lines)
    for i in range(segment_end_y+1, segment_end_y+51):
        if i < Ty:
            y[0, i] = 1
    ### END CODE HERE ###
    
    return y

# GRADED FUNCTION: create_training_example

def create_training_example(background, activates, negatives):
    """
    Creates a training example with a given background, activates, and negatives.
    
    Arguments:
    background -- a 10 second background audio recording
    activates -- a list of audio segments of the word "activate"
    negatives -- a list of audio segments of random words that are not "activate"
    
    Returns:
    x -- the spectrogram of the training example
    y -- the label at each time step of the spectrogram
    """
    
    # Set the random seed
    np.random.seed(18)
    
    # Make background quieter
    background = background - 20

    ### START CODE HERE ###
    # Step 1: Initialize y (label vector) of zeros (≈ 1 line)
    y = np.zeros((1, Ty))

    # Step 2: Initialize segment times as empty list (≈ 1 line)
    previous_segments = []
    ### END CODE HERE ###
    
    # Select 0-4 random "activate" audio clips from the entire list of "activates" recordings
    number_of_activates = np.random.randint(0, 5)
    random_indices = np.random.randint(len(activates), size=number_of_activates)
    random_activates = [activates[i] for i in random_indices]
    
    ### START CODE HERE ### (≈ 3 lines)
    # Step 3: Loop over randomly selected "activate" clips and insert in background
    for random_activate in random_activates:
        # Insert the audio clip on the background
        background, segment_time = insert_audio_clip(background, random_activate, previous_segments)
        # Retrieve segment_start and segment_end from segment_time
        segment_start, segment_end = segment_time
        # Insert labels in "y"
        y = insert_ones(y, segment_end)
    ### END CODE HERE ###

    # Select 0-2 random negatives audio recordings from the entire list of "negatives" recordings
    number_of_negatives = np.random.randint(0, 3)
    random_indices = np.random.randint(len(negatives), size=number_of_negatives)
    random_negatives = [negatives[i] for i in random_indices]

    ### START CODE HERE ### (≈ 2 lines)
    # Step 4: Loop over randomly selected negative clips and insert in background
    for random_negative in random_negatives:
        # Insert the audio clip on the background 
        background, _ = insert_audio_clip(background, random_negative, previous_segments)
    ### END CODE HERE ###
    
    # Standardize the volume of the audio clip 
    background = match_target_amplitude(background, -20.0)

    # Export new training example 
    file_handle = background.export("train" + ".wav", format="wav")
    print("File (train.wav) was saved in your directory.")
    
    # Get and plot spectrogram of the new recording (background with superposition of positive and negatives)
    x = graph_spectrogram("train.wav")
    
    return x, y

实现model()

# GRADED FUNCTION: model

def model(input_shape):
    """
    Function creating the model's graph in Keras.
    
    Argument:
    input_shape -- shape of the model's input data (using Keras conventions)

    Returns:
    model -- Keras model instance
    """
    
    X_input = Input(shape = input_shape)
    
    ### START CODE HERE ###
    
    # Step 1: CONV layer (≈4 lines)
    X = Conv1D(filters=196,kernel_size=15,strides=4)(X_input)                                 # CONV1D
    X = BatchNormalization()(X)                                 # Batch normalization
    X = Activation('relu')(X)                                 # ReLu activation
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)

    # Step 2: First GRU Layer (≈4 lines)
    X = GRU(units = 128, return_sequences = True)(X)                                 # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                  # dropout (use 0.8)
    X = BatchNormalization()(X)                                 # Batch normalization
    
    # Step 3: Second GRU Layer (≈4 lines)
    X = GRU(units = 128, return_sequences = True)(X)                                 # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                  # dropout (use 0.8)
    X = BatchNormalization()(X)                                 # Batch normalization
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    
    # Step 4: Time-distributed dense layer (≈1 line)
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)

    ### END CODE HERE ###

    model = Model(inputs = X_input, outputs = X)
    
    return model

这里载入预训练好的模型，不需要自己训练那么久了，

model = load_model('./models/tr_model.h5')
opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
model.fit(X, Y, batch_size = 5, epochs=1)

DeepLearning.ai笔记:(5-3) -- 序列模型和注意力机制

2018-10-18T10:39:10.000Z

基础模型

sequence to sequence 模型：

sequence to sequence 模型最为常见的就是机器翻译，假如这里我们要将法语翻译成英文。

对于机器翻译的序列对序列模型，如果我们拥有大量的句子语料，则可以得到一个很有效的机器翻译模型。模型的前部分使用一个编码网络来对输入的法语句子进行编码，后半部分则使用一个解码网络来生成对应的英文翻译。网络结构如下图所示：

还有输入图像，输出描述图片的句子的：

挑选最可能的句子

机器翻译：条件语言模型

对于机器翻译来说和之前几节介绍的语言模型有很大的相似性但也有不同之处。

在语言模型中，我们通过估计句子的可能性，来生成新的句子。语言模型总是以零向量开始，也就是其第一个时间步的输入可以直接为零向量；

在机器翻译中，包含了编码网络和解码网络，其中解码网络的结构与语言模型的结构是相似的。机器翻译以句子中每个单词的一系列向量作为输入，所以相比语言模型来说，机器翻译可以称作条件语言模型，其输出的句子概率是相对于输入的条件概率。

集束搜索（Beam search）

Beam search 算法：

这里我们还是以法语翻译成英语的机器翻译为例：

Step 1：对于我们的词汇表，我们将法语句子输入到编码网络中得到句子的编码，通过一个softmax层计算各个单词（词汇表中的所有单词）输出的概率值，通过设置集束宽度（beam width）的大小如3，我们则取前3个最大输出概率的单词，并保存起来。
Step 2：在第一步中得到的集束宽度的单词数，我们分别对第一步得到的每一个单词计算其与单词表中的所有单词组成词对的概率。并与第一步的概率相乘，得到第一和第二两个词对的概率。有3×10000个选择，（这里假设词汇表有10000个单词），最后再通过beam width大小选择前3个概率最大的输出对；

Step 3~Step T：与Step2的过程是相似的，直到遇到句尾符号结束。

集束搜索的改进

上面的集束搜索有个问题，就是因为每一项的概率都很小，所以句子越长，概率越小，因此会倾向于选择比较短的句子，这样是不太好的。

首先，为了保证不会太小而导致数值下溢，先取对数，把连乘变成求和。

然后在前面加上一个系数

$$\frac{1}{T_{y}^{\alpha}}$$

当$\alpha$ 为 1 时，就表示概率为句子长度的平均；为0时，就表示没有系数；在这里一般取$\alpha = 0.7$

集束搜索讨论：

Beam width：B的选择，B越大考虑的情况越多，但是所需要进行的计算量也就相应的越大。在常见的产品系统中，一般设置B = 10，而更大的值（如100，1000，…）则需要对应用的领域和场景进行选择。

相比于算法范畴中的搜索算法像BFS或者DFS这些精确的搜索算法，Beam Search 算法运行的速度很快，但是不能保证找到目标准确的最大值。

集束搜索的误差分析

集束搜索算法是一种近似搜索算法，也被称为启发式搜索算法。而不是一种精确的搜索。

如果我们的集束搜素算法出现错误了要怎么办呢？如何确定是算法出现了错误还是模型出现了错误呢？此时集束搜索算法的误差分析就显示出了作用。

模型分为两个部分：

RNN 部分：编码网络 + 解码网络
Beam Search 部分：选取最大的几个值

误差分析

计算人类翻译的概率P(y∗|x)以及模型翻译的概率P(ŷ |x)

P(y∗|x) > P(ŷ |x)：Beam search算法选择了ŷ ，但是y∗ 却得到了更高的概率，所以Beam search 算法出错了；
P(y∗|x) <= P(ŷ |x) 的情况：翻译结果y∗相比ŷ 要更好，但是RNN模型却预测P(y∗|x)

Bleu 得分（选修）

PASS

注意力模型直观理解

之前我们的翻译模型分为编码网络和解码网络，先记忆整个句子再翻译，这对于较短的句子效果不错，但是对于很长的句子，翻译结果就会变差。

回想当我们人类翻译长句子时，都是一部分一部分的翻译，翻译每个部分的时候也会顾及到该部分周围上下文对其的影响。同理，引入注意力机制，一部分一部分的翻译，每次翻译时给该部分及上下文不同的注意力权重以及已经译出的部分，直至翻译出整个句子。

注意力模型

以一个双向的RNN模型来对法语进行翻译，得到相应的英语句子。其中的每个RNN单元均是LSTM或者GRU单元。

对于双向RNN，通过前向和后向的传播，可以得到每个时间步的前向激活值和反向激活值，我们用一个符号来表示前向和反向激活值的组合。

然后得到每个输入单词的注意力权重：

计算公式为：

这里的$e^{}$则是通过一层神经网络来进行计算得到的，其值取决于输出RNN中前一步的激活值$s^{}$和输入RNN当前步的激活值$a^{}$。我们可以通过训练这个小的神经网络模型，使用反向传播算法来学习一个对应的关系函数。

语音识别

语音识别就是将一段音频转化为相应文本。

之前用音位来识别，现在 end-to-end 模型中已经不需要音位了，但是需要大量的数据常见的语音数据大小为300h、3000h或者更大。

注意力模型的语音识别

CTC 损失函数的语音识别

另外一种效果较好的就是使用CTC损失函数的语音识别模型（CTC，Connectionist temporal classification）

模型会有很多个输入和输出，对于一个10s的语音片段，我们就能够得到1000个特征的输入片段，而往往我们的输出仅仅是几个单词。

在CTC损失函数中，允许RNN模型输出有重复的字符和插入空白符的方式，强制使得我们的输出和输入的大小保持一致。

触发字检测

触发字检测：关键词语音唤醒。

一种可以简单应用的触发字检测算法，就是使用RNN模型，将音频信号进行声谱图转化得到图像特征或者使用音频特征，输入到RNN中作为我们的输入。而输出的标签，我们可以以触发字前的输出都标记为0，触发字后的输出则标记为1。

一种简单应用的触发字检测算法，就是使用RNN模型，将音频信号进行声谱图转化音频特征，输入到RNN中作为我们的输入。而输出的标签，非触发字的输出都标记为0，触发字的输出则标记为1。

上面方法的缺点就是0、1标签的不均衡，0比1多很多。一种简单粗暴的方法就是在触发字及其之后多个目标标签都标记为1，在一定程度上可以提高系统的精确度。

DeepLearning.ai作业:(5-2) -- 自然语言处理与词嵌入(NLP and Word Embeddings)

2018-10-18T09:00:21.000Z

本周作业分为两部分：

词向量运算
emoji表情包

Part1:词向量运算

由于词嵌入的训练计算量庞大切耗费时间长，绝大部分机器学习人员都会导入一个预训练的词嵌入模型。

本作业中，我们使用50维的 Glove 向量来表示词。导入数据：

1	words, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

words: 词典中的词集合
word_to_vec_map: 表示单词到向量映射的map。

one-hot向量不擅长表示向量相似度(内积为0), Glove 向量包含了单词更多的信息，下面看看如何使用 Glove 向量计算相似度。

$$\text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta)$$

分子表示两个向量的内积，分母是向量的模的乘积，θθ表示向量夹角，向量越近夹角越小，cos 值越大。

# GRADED FUNCTION: cosine_similarity

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v
        
    Arguments:
        u -- a word vector of shape (n,)          
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """
    
    distance = 0.0
    
    ### START CODE HERE ###
    # Compute the dot product between u and v (≈1 line)
    dot = np.dot(u,v)
    # Compute the L2 norm of u (≈1 line)
    norm_u = np.sqrt(np.dot(u,u))
    
    # Compute the L2 norm of v (≈1 line)
    norm_v = np.sqrt(np.dot(v,v))
    # Compute the cosine similarity defined by formula (1) (≈1 line)
    cosine_similarity = dot / (norm_u * norm_v)
    ### END CODE HERE ###
    
    return cosine_similarity

单词类比推理

类比推理任务中需要实现”a is to b as c is to __” 比如”man is to woman as king is to queen”。我们需要找到单词 d,使得”e_b−e_a ≈ e_d−e_c”
也就是两组的差向量应该相似(仍然用 cos 来衡量)

# GRADED FUNCTION: complete_analogy

def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____. 
    
    Arguments:
    word_a -- a word, string
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors. 
    
    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """
    
    # convert words to lower case
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    ### START CODE HERE ###
    # Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
    ### END CODE HERE ###
    
    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:        
        # to avoid best_word being one of the input words, pass on them.
        if w in [word_a, word_b, word_c] :
            continue
        
        ### START CODE HERE ###
        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  (≈1 line)
        cosine_sim = cosine_similarity(e_b - e_a, word_to_vec_map[w] - e_c)
        
        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        ### END CODE HERE ###
        
    return best_word

消除词向量偏见 (可选)

def neutralize(word, g, word_to_vec_map):
    """
    Removes the bias of "word" by projecting it on the space orthogonal to the bias axis. 
    This function ensures that gender neutral words are zero in the gender subspace.
    
    Arguments:
        word -- string indicating the word to debias
        g -- numpy-array of shape (50,), corresponding to the bias axis (such as gender)
        word_to_vec_map -- dictionary mapping words to their corresponding vectors.
    
    Returns:
        e_debiased -- neutralized word vector representation of the input "word"
    """
    
    ### START CODE HERE ###
    # Select word vector representation of "word". Use word_to_vec_map. (≈ 1 line)
    e = word_to_vec_map[word]
    
    # Compute e_biascomponent using the formula give above. (≈ 1 line)
    e_biascomponent = np.dot(e, g) / np.square(np.linalg.norm(g)) * g
 
    # Neutralize e by substracting e_biascomponent from it 
    # e_debiased should be equal to its orthogonal projection. (≈ 1 line)
    e_debiased = e - e_biascomponent
    ### END CODE HERE ###
    
    return e_debiased

def equalize(pair, bias_axis, word_to_vec_map):
    """
    Debias gender specific words by following the equalize method described in the figure above.
    
    Arguments:
    pair -- pair of strings of gender specific words to debias, e.g. ("actress", "actor") 
    bias_axis -- numpy-array of shape (50,), vector corresponding to the bias axis, e.g. gender
    word_to_vec_map -- dictionary mapping words to their corresponding vectors
    
    Returns
    e_1 -- word vector corresponding to the first word
    e_2 -- word vector corresponding to the second word
    """
    
    ### START CODE HERE ###
    # Step 1: Select word vector representation of "word". Use word_to_vec_map. (≈ 2 lines)
    w1, w2 = pair
    e_w1, e_w2 = word_to_vec_map[w1, w2]
    
    # Step 2: Compute the mean of e_w1 and e_w2 (≈ 1 line)
    mu = (e_w1 + e_w2) / 2

    # Step 3: Compute the projections of mu over the bias axis and the orthogonal axis (≈ 2 lines)
    mu_B = np.dot(mu, bias_axis) / np.square(np.linalg.norm(bias_axis)) * bias_axis
    mu_orth = mu - mu_B

    # Step 4: Use equations (7) and (8) to compute e_w1B and e_w2B (≈2 lines)
    e_w1B = np.dot(e_w1, bias_axis) / np.square(np.linalg.norm(bias_axis)) * bias_axis
    e_w2B = np.dot(e_w2, bias_axis) / np.square(np.linalg.norm(bias_axis)) * bias_axis
        
    # Step 5: Adjust the Bias part of e_w1B and e_w2B using the formulas (9) and (10) given above (≈2 lines)
    corrected_e_w1B = np.sqrt(np.abs(1-np.sum(mu_orth**2))) * (e_w1B - mu_B)/np.linalg.norm(e_w1-mu_orth-mu_B)
    corrected_e_w2B = np.sqrt(np.abs(1-np.sum(mu_orth**2))) * (e_w2B - mu_B)/np.linalg.norm(e_w2-mu_orth-mu_B)

    # Step 6: Debias by equalizing e1 and e2 to the sum of their corrected projections (≈2 lines)
    e1 = corrected_e_w1B + mu_orth
    e2 = corrected_e_w2B + mu_orth
                                                                
    ### END CODE HERE ###
    
    return e1, e2

Part2:Emojify!

你有没有想过让你的短信更具表现力？ emojifier APP将帮助你做到这一点。所以不是写下”Congratulations on the promotion! Lets get coffee and talk. Love you!” emojifier可以自动转换为 “Congratulations on the promotion! ? Lets get coffee and talk. ☕️ Love you! ❤️”

另外，如果你对emojis不感兴趣，但有朋友向你发送了使用太多表情符号的疯狂短信，你还可以使用emojifier来回复他们。

你将实现一个模型，输入一个句子（“Let’s go see the baseball game tonight!”），并找到最适合这个句子的表情符号（⚾️）。在许多表情符号界面中，您需要记住❤️是”heart”符号而不是”love”符号。但是使用单词向量，你会发现即使你的训练集只将几个单词明确地与特定的表情符号相关联，你的算法也能够将测试集中相关的单词概括并关联到相同的表情符号上，即使这些词没有出现在训练集中。这使得即使使用小型训练集，你也可以建立从句子到表情符号的精确分类器映射。

在本练习中，您将从使用词嵌入的基本模型（Emojifier-V1）开始，然后构建进一步整合LSTM的更复杂的模型（Emojifier-V2）。

先用average试试

# GRADED FUNCTION: sentence_to_avg

def sentence_to_avg(sentence, word_to_vec_map):
    """
    Converts a sentence (string) into a list of words (strings). Extracts the GloVe representation of each word
    and averages its value into a single vector encoding the meaning of the sentence.
    
    Arguments:
    sentence -- string, one training example from X
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    
    Returns:
    avg -- average vector encoding information about the sentence, numpy-array of shape (50,)
    """
    
    ### START CODE HERE ###
    # Step 1: Split sentence into list of lower case words (≈ 1 line)
    words = sentence.lower().split()

    # Initialize the average word vector, should have the same shape as your word vectors.
    avg = np.zeros(word_to_vec_map[words[0]].shape)
    
    # Step 2: average the word vectors. You can loop over the words in the list "words".
    for w in words:
        avg += word_to_vec_map[w]
    avg = avg / len(words)
    
    ### END CODE HERE ###
    
    return avg

再用RNN

# GRADED FUNCTION: model

def model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):
    """
    Model to train word vector representations in numpy.
    
    Arguments:
    X -- input data, numpy array of sentences as strings, of shape (m, 1)
    Y -- labels, numpy array of integers between 0 and 7, numpy-array of shape (m, 1)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    learning_rate -- learning_rate for the stochastic gradient descent algorithm
    num_iterations -- number of iterations
    
    Returns:
    pred -- vector of predictions, numpy-array of shape (m, 1)
    W -- weight matrix of the softmax layer, of shape (n_y, n_h)
    b -- bias of the softmax layer, of shape (n_y,)
    """
    
    np.random.seed(1)

    # Define number of training examples
    m = Y.shape[0]                          # number of training examples
    n_y = 5                                 # number of classes  
    n_h = 50                                # dimensions of the GloVe vectors 
    
    # Initialize parameters using Xavier initialization
    W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
    b = np.zeros((n_y,))
    
    # Convert Y to Y_onehot with n_y classes
    Y_oh = convert_to_one_hot(Y, C = n_y) 
    
    # Optimization loop
    for t in range(num_iterations):                       # Loop over the number of iterations
        for i in range(m):                                # Loop over the training examples
            
            ### START CODE HERE ### (≈ 4 lines of code)
            # Average the word vectors of the words from the i'th training example
            avg = sentence_to_avg(X[i], word_to_vec_map)

            # Forward propagate the avg through the softmax layer
            z = np.dot(W, avg) + b
            a = softmax(z)

            # Compute cost using the i'th training label's one hot representation and "A" (the output of the softmax)
            cost = -np.sum(Y_oh[i] * np.log(a))
            ### END CODE HERE ###
            
            # Compute gradients 
            dz = a - Y_oh[i]
            dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))
            db = dz

            # Update parameters with Stochastic Gradient Descent
            W = W - learning_rate * dW
            b = b - learning_rate * db
        
        if t % 100 == 0:
            print("Epoch: " + str(t) + " --- cost = " + str(cost))
            pred = predict(X, Y, W, b, word_to_vec_map)

    return pred, W, b

Emojifier-V2: Using LSTMs in Keras:

# GRADED FUNCTION: sentences_to_indices

def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()` (described in Figure 4). 
    
    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. 
    
    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """
    
    m = X.shape[0]                                   # number of training examples
    
    ### START CODE HERE ###
    # Initialize X_indices as a numpy matrix of zeros and the correct shape (≈ 1 line)
    X_indices = np.zeros((m, max_len))
    
    for i in range(m):                               # loop over training examples
        
        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words =X[i].lower().split()
        
        # Initialize j to 0
        j = 0
        
        # Loop over the words of sentence_words
        for w in sentence_words:
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            X_indices[i, j] = word_to_index[w]
            # Increment j to j + 1
            j = j + 1
            
    ### END CODE HERE ###
    
    return X_indices

# GRADED FUNCTION: pretrained_embedding_layer

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    ### START CODE HERE ###
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    embedding_layer = Embedding(vocab_len,emb_dim, trainable=False)
    ### END CODE HERE ###

    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

Building the Emojifier-V2

# GRADED FUNCTION: Emojify_V2

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the Emojify-v2 model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """
    
    ### START CODE HERE ###
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape= input_shape, dtype='int32')
    
    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)   
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a batch of sequences.
    X = LSTM(128, return_sequences=True)(embeddings)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a single hidden state, not a batch of sequences.
    X = LSTM(128, return_sequences=False)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
    X =  Dense(5, activation='softmax')(X)
    # Add a softmax activation
    X = Activation('softmax')(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices ,outputs=X)
    
    ### END CODE HERE ###
    
    return model

DeepLearning.ai笔记:(5-2) -- 自然语言处理与词嵌入(NLP and Word Embeddings)

2018-10-18T09:00:17.000Z

本周主要讲了NLP和词嵌入的问题。

词汇表征

在前面学习的内容中，我们表征词汇是直接使用英文单词来进行表征的，但是对于计算机来说，是无法直接认识单词的。为了让计算机能够能更好地理解我们的语言，建立更好的语言模型，我们需要将词汇进行表征。下面是几种不同的词汇表征方式：

one-hot 表征：

在前面的一节课程中，已经使用过了one-hot表征的方式对模型字典中的单词进行表征，对应单词的位置用1表示，其余位置用0表示，如下图所示：

one-hot表征的缺点：这种方法将每个词孤立起来，使得模型对相关词的泛化能力不强。每个词向量之间的距离都一样，乘积均为0，所以无法获取词与词之间的相似性和关联性。

特征表征：词嵌入

用不同的特征来对各个词汇进行表征，相对与不同的特征，不同的单词均有不同的值。如下例所示：

这样差不多的词汇就会聚在一起：

词嵌入

Word Embeddings对不同单词进行了实现了特征化的表示，那么如何将这种表示方法应用到自然语言处理的应用中呢？

以下图为例，该图表示的是输入一段话，判断出人名。通过学习判断可以知道orange farmer指的应该是人，所以其对应的主语Sally Johnson就应该是人名了，所以其对应位置输出为1。

那如果把orange换成apple呢？通过词嵌入算法可以知道二者词性类似，而且后面跟着farmer，所以也能确认Robert Lin是人名。

我们继续替换，我们将apple farmer替换成不太常见的durian cultivator(榴莲繁殖员)。此时词嵌入中可能并没有durian这个词，cultivator也是不常用的词汇。这个时候怎么办呢？我们可以用到迁移学习。

学习含有大量文本语料库的词嵌入(一般含有10亿到1000亿单词)，或者下载预训练好的词嵌入
将学到的词嵌入迁移到相对较小规模的训练集(例如10万词汇)，这个时候就能体现出相比于使> 用one hot表示法，使用词嵌入的优势了。如果是使用one hot，那么每个单词是1×100000表> 示，而用词嵌入后，假设特征维度是300，那么只需要使用 1×300的向量表示即可。
(可选) 这一步骤就是对新的数据进行fine-tune。

词嵌入和人脸编码之间有很奇妙的联系。在人脸识别领域，我们会将人脸图片预编码成不同的编码向量，以表示不同的人脸，进而在识别的过程中使用编码来进行比对识别。词嵌入则和人脸编码有一定的相似性。

但是不同的是，对于人脸识别，我们可以将任意一个没有见过的人脸照片输入到我们构建的网络中，则可输出一个对应的人脸编码。而在词嵌入模型中，所有词汇的编码是在一个固定的词汇表中进行学习单词的编码以及其之间的关系的。

词嵌入的特性

可以得到 man to woman ，正如 King to Queen。

可以通过词嵌入，计算词之间的距离，从而实现类比。

关于词相似度的计算，可以使用余弦公式。

当然也可以使用距离公式：

$$||u - v||^2$$

嵌入矩阵

如下图示，左边是词嵌入矩阵，每一列表示该单词的特征向量，每一行表示所有单词在某一特征上的值的大小，这个矩阵用$E$表示，假设其维度是(300,10000)。

在原来的one-hot中每个词是维度为10000的向量，而现在在嵌入矩阵中，每个词变成了维度为300的向量。

学习词嵌入

下图展示了预测单词的方法，即给出缺少一个单词的句子：

“I want a glass of orange ___”

计算方法是将已知单词的特征向量都作为输入数据送到神经网络中去，然后经过一系列计算到达 Softmax分类层，在该例中输出节点数为10000个。经过计算juice概率最高，所以预测为

“I want a glass of orange juice”

在这个训练模式中，是通过全部的单词去预测最后一个单词然后反向传播更新词嵌表E

假设要预测的单词为W，词嵌表仍然为E，需要注意的是训练词嵌表和预测W是两个不同的任务。

如果任务是预测W，最佳方案是使用W前面n个单词构建语境。

如果任务是训练E，除了使用W前全部单词还可以通过：前后各4个单词、前面单独的一个词、前面语境中随机的一个词（这个方式也叫做 Skip Gram 算法），这些方法都能提供很好的结果。

Word2Vec

“word2vec” 是指将词语word 变成向量vector 的过程，这一过程通常通过浅层的神经网络完成，例如CBOW或者skip gram，这一过程同样可以视为构建词嵌表E的过程”。

Skip-grams

下图详细的展示了Skip-grams。即先假设Context(上下文)是orange，而Target(预测词)则是通过设置窗口值得到的，例如设置为紧邻的后一个单词，此时Target则为juice，设置其他窗口值可以得到其他预测词。

注意这个过程是用来构建词嵌表的，而不是为了真正的去预测，所以如果预测效果不好并不用担心。

上面在使用Softmax的时候有一个很明显的问题，那就是计算量过于繁琐，所以为了解决计算量大的问题，提出了如下图所示的方法，即Hierachical Softmax(分层的Softmax)

简单的来说就是通过使用二叉树的形式来减少运算量。

例如一些常见的单词，如the、of等就可以在很浅的层次得到，而像durian这种少用的单词则在较深的层次得到。

负采样

对于skip gram model而言，还要解决的一个问题是如何取样（选择）有效的随机词 c 和目标词 t 呢？如果真的按照自然随机分布的方式去选择，可能会大量重复的选择到出现次数频率很高的单词比如说“the, of, a, it, I, …” 重复的训练这样的单词没有特别大的意义。

如何有效的去训练选定的词如 orange 呢？在设置训练集时可以通过“负取样”的方法, 下表中第一行是通过和上面一
样的窗口法得到的“正”（1）结果，其他三行是从字典中随机得到的词语，结果为“负”（0）。通过这样的负取样法
可以更有效地去训练skip gram model.

负取样的个数k由数据量的大小而定，上述例子中为4. 实际中数据量大则 k = 2 ~ 5，数据量小则可以相对大一些k = 5 ~ 20

通过负取样，我们的神经网络训练从softmax预测每个词出现的频率变成了经典binary logistic regression问题，概率公式用 sigmoid 代替 softmax从而大大提高了速度。

选词概率的经验公式：

GloVe词向量

GloVe(Global vectors for word representation)虽然不想Word2Vec模型那样流行，但是它也有自身的优点，即简单。

这里就不介绍了，看不太懂。

情感分类

情感分类就是通过一段文本来判断这个文本中的内容是否喜欢其所讨论的内容，这是NLP中最重要的模块之一。

可以看到下图中的模型先将评语中各个单词通过 词嵌表(数据量一般比较大，例如有100Billion的单词数) 转化成对应的特征向量，然后对所有的单词向量做求和或者做平均，然后构建Softmax分类器，最后输出星级评级。

但是上面的模型存在一个问题，一般而言如果评语中有像”good、excellent“这样的单词，一般都是星级评分较高的评语，但是该模型对下面这句评语就显得无能为力了：

“Completely lacking in good taste, good service, and good ambience.”

之所以上面的模型存在那样的缺点，就是因为它没有把单词的时序考虑进去，所以我们可以使用RNN构建模型来解决这种问题。

另外使用RNN模型还有另一个好处，假设测试集中的评语是这样的

“Completely absent of good taste, good service, and good ambience.”

该评语只是将lacking in替换成了absent of，而且我们即使假设absent并没有出现在训练集中，但是因为词嵌表很庞大，所以词嵌表中包含absent，所以算法依旧可以知道absent和lacking有相似之处，最后输出的结果也依然可以保持正确。

词嵌入除偏

现如今机器学习已经被用到了很多领域，例如银行贷款决策，简历筛选。但是因为机器是向人们学习，所以好的坏的都会学到，例如他也会学到一些偏见或者歧视。

如下图示

当说到Man：程序员的时候，算法得出Woman：家庭主妇，这显然存在偏见。

又如Man：Doctor，算法认为Woman：Nurse。这显然也存在其实和偏见。

上面提到的例子都是性别上的歧视，词嵌入也会反映出年龄歧视、性取向歧视以及种族歧视等等。

人类在这方面已经做的不对了，所以机器应当做出相应的调整来减少歧视。

消除偏见的方法：

定义偏见的方向：如性别
- 对大量性别相对的词汇进行相减并求平均：$e_{he}−e_{she}、e_{male}−e_{female}$⋯；
- 通过平均后的向量，则可以得到一个或多个偏见趋势相关的维度，以及大量不相关的维度；
中和化：对每一个定义不明确的词汇，进行偏见的处理，如像doctor、babysitter这类词；通过减小这些词汇在得到的偏见趋势维度上值的大小；
均衡：将如gradmother和gradfather这种对称词对调整至babysitter这类词汇平衡的位置上，使babysitter这类词汇处于一个中立的位置，进而消除偏见。

DeepLearning.ai作业:(5-1)-- 循环神经网络（Recurrent Neural Networks）（3）

2018-10-18T08:20:36.000Z

第三个作业是用LSTM来生成爵士乐。

Part3:Improvise a Jazz Solo with an LSTM Network

我们已经对音乐数据做了预处理，以”values”来表示。可以非正式地将每个”value”看作一个音符，它包含音高和持续时间。例如，如果您按下特定钢琴键0.5秒，那么您刚刚弹奏了一个音符。在音乐理论中，”value” 实际上比这更复杂。特别是，它还捕获了同时播放多个音符所需的信息。例如，在播放音乐作品时，可以同时按下两个钢琴键（同时播放多个音符生成所谓的“和弦”）。但是这里我们不需要关系音乐理论的细节。对于这个作业，你需要知道的是，我们获得一个”values”的数据集，并将学习一个RNN模型来生成一个序列的”values”。

我们的音乐生成系统将使用78个独特的值。

X: 这是一个（m，Tx，78）维数组。 m 表示样本数量，Tx 表示时间步(也即序列的长度)，在每个时间步，输入是78个不同的可能值之一，表示为一个one-hot向量。因此，例如，X [i，t，：]是表示第i个示例在时间t的值的one-hot向量。
Y: 与X基本相同，但向左（向前）移动了一步。与恐龙分配类似，使用先前值预测下一个值，所以我们的序列模型将尝试预测给定的x⟨t⟩。但是，Y中的数据被重新排序为维（Ty，m，78），其中Ty = Tx。这种格式使得稍后进入LSTM更方便。
n_value: 数据集中独立”value”的个数，这里是78
indices_values: python 字典：key 是0-77，value 是特定音符

模型结构如下：

这里用了3个keras函数来定义：

1
2
3

reshapor = Reshape((1, 78))                        # Used in Step 2.B of djmodel(), below
LSTM_cell = LSTM(n_a, return_state = True)         # Used in Step 2.C
densor = Dense(n_values, activation='softmax')     # Used in Step 2.D

# GRADED FUNCTION: djmodel

def djmodel(Tx, n_a, n_values):
    """
    Implement the model

    Arguments:
    Tx -- length of the sequence in a corpus
    n_a -- the number of activations used in our model
    n_values -- number of unique values in the music data 

    Returns:
    model -- a keras model with the 
    """

    # Define the input of your model with a shape 
    X = Input(shape=(Tx, n_values))

    # Define s0, initial hidden state for the decoder LSTM
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')
    a = a0
    c = c0

    ### START CODE HERE ### 
    # Step 1: Create empty list to append the outputs while you iterate (≈1 line)
    outputs = []

    # Step 2: Loop
    for t in range(Tx):

        # Step 2.A: select the "t"th time step vector from X. 
        x = Lambda(lambda x: X[:,t,:])(X)
        # Step 2.B: Use reshapor to reshape x to be (1, n_values) (≈1 line)
        x = reshapor(x)
        # Step 2.C: Perform one step of the LSTM_cell
        a, _, c = LSTM_cell(x, initial_state=[a, c])
        # Step 2.D: Apply densor to the hidden state output of LSTM_Cell
        out = densor(a)
        # Step 2.E: add the output to "outputs"
        outputs.append(out)

    # Step 3: Create model instance
    model = Model(inputs=[X, a0, c0], outputs=outputs)

    ### END CODE HERE ###

    return model

1	model = djmodel(Tx = 30 , n_a = 64, n_values = 78)

1
2
3

opt = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, decay=0.01)

model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

1
2
3

m = 60
a0 = np.zeros((m, n_a))
c0 = np.zeros((m, n_a))

1	model.fit([X, a0, c0], list(Y), epochs=100)

生成音乐的模型

# GRADED FUNCTION: music_inference_model

def music_inference_model(LSTM_cell, densor, n_values = 78, n_a = 64, Ty = 100):
    """
    Uses the trained "LSTM_cell" and "densor" from model() to generate a sequence of values.
    
    Arguments:
    LSTM_cell -- the trained "LSTM_cell" from model(), Keras layer object
    densor -- the trained "densor" from model(), Keras layer object
    n_values -- integer, umber of unique values
    n_a -- number of units in the LSTM_cell
    Ty -- integer, number of time steps to generate
    
    Returns:
    inference_model -- Keras model instance
    """
    
    # Define the input of your model with a shape 
    x0 = Input(shape=(1, n_values))
    
    # Define s0, initial hidden state for the decoder LSTM
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')
    a = a0
    c = c0
    x = x0

    ### START CODE HERE ###
    # Step 1: Create an empty list of "outputs" to later store your predicted values (≈1 line)
    outputs = []
    
    # Step 2: Loop over Ty and generate a value at every time step
    for t in range(Ty):
        
        # Step 2.A: Perform one step of LSTM_cell (≈1 line)
        a, _, c = LSTM_cell(x, initial_state=[a, c])
        
        # Step 2.B: Apply Dense layer to the hidden state output of the LSTM_cell (≈1 line)
        out = densor(a)

        # Step 2.C: Append the prediction "out" to "outputs". out.shape = (None, 78) (≈1 line)
        outputs.append(out)
        
        # Step 2.D: Select the next value according to "out", and set "x" to be the one-hot representation of the
        #           selected value, which will be passed as the input to LSTM_cell on the next step. We have provided 
        #           the line of code you need to do this. 
        x = Lambda(one_hot)(out)
        
    # Step 3: Create model instance with the correct "inputs" and "outputs" (≈1 line)
    inference_model = Model(inputs=[x0, a0, c0], outputs=outputs)
    
    ### END CODE HERE ###
    
    return inference_model

1	inference_model = music_inference_model(LSTM_cell, densor, n_values = 78, n_a = 64, Ty = 50)

1
2
3

x_initializer = np.zeros((1, 1, 78))
a_initializer = np.zeros((1, n_a))
c_initializer = np.zeros((1, n_a))

# GRADED FUNCTION: predict_and_sample

def predict_and_sample(inference_model, x_initializer = x_initializer, a_initializer = a_initializer, 
                       c_initializer = c_initializer):
    """
    Predicts the next value of values using the inference model.
    
    Arguments:
    inference_model -- Keras model instance for inference time
    x_initializer -- numpy array of shape (1, 1, 78), one-hot vector initializing the values generation
    a_initializer -- numpy array of shape (1, n_a), initializing the hidden state of the LSTM_cell
    c_initializer -- numpy array of shape (1, n_a), initializing the cell state of the LSTM_cel
    
    Returns:
    results -- numpy-array of shape (Ty, 78), matrix of one-hot vectors representing the values generated
    indices -- numpy-array of shape (Ty, 1), matrix of indices representing the values generated
    """
    
    ### START CODE HERE ###
    # Step 1: Use your inference model to predict an output sequence given x_initializer, a_initializer and c_initializer.
    pred = inference_model.predict([x_initializer, a_initializer, c_initializer])
    # Step 2: Convert "pred" into an np.array() of indices with the maximum probabilities
    indices = np.argmax(pred, axis=-1)
    # Step 3: Convert indices to one-hot vectors, the shape of the results should be (1, )
    results = to_categorical(indices, num_classes=x_initializer.shape[-1])
    ### END CODE HERE ###
    
    return results, indices

1	out_stream = generate_music(inference_model)

DeepLearning.ai作业:(5-1)-- 循环神经网络（Recurrent Neural Networks）（2）

2018-10-18T08:20:33.000Z

作业2搭建了一个字符级的语言模型，来生成恐龙的名字。

Part2:Character level language model - Dinosaurus land

模型结构

初始化参数
执行最优化循环
- 计算前向传播的损失函数
- 计算反向传播的梯度及损失函数
- 剪裁梯度避免梯度爆炸
- 使用梯度更新梯度下降中的各参数
返回学习到的参数

梯度裁剪

确保不会梯度爆炸

### GRADED FUNCTION: clip

def clip(gradients, maxValue):
    '''
    Clips the gradients' values between minimum and maximum.
    
    Arguments:
    gradients -- a dictionary containing the gradients "dWaa", "dWax", "dWya", "db", "dby"
    maxValue -- everything above this number is set to this number, and everything less than -maxValue is set to -maxValue
    
    Returns: 
    gradients -- a dictionary with the clipped gradients.
    '''
    
    dWaa, dWax, dWya, db, dby = gradients['dWaa'], gradients['dWax'], gradients['dWya'], gradients['db'], gradients['dby']
   
    ### START CODE HERE ###
    # clip to mitigate exploding gradients, loop over [dWax, dWaa, dWya, db, dby]. (≈2 lines)
    for gradient in [dWax, dWaa, dWya, db, dby]:
        np.clip(gradient, -1 * maxValue, maxValue,out=gradient)
    ### END CODE HERE ###
    
    gradients = {"dWaa": dWaa, "dWax": dWax, "dWya": dWya, "db": db, "dby": dby}
    
    return gradients

采样

现在假设你的模型已经训练好了，你需要以此生成新的字母，过程如下:

# GRADED FUNCTION: sample

def sample(parameters, char_to_ix, seed):
    """
    Sample a sequence of characters according to a sequence of probability distributions output of the RNN

    Arguments:
    parameters -- python dictionary containing the parameters Waa, Wax, Wya, by, and b. 
    char_to_ix -- python dictionary mapping each character to an index.
    seed -- used for grading purposes. Do not worry about it.

    Returns:
    indices -- a list of length n containing the indices of the sampled characters.
    """
    
    # Retrieve parameters and relevant shapes from "parameters" dictionary
    Waa, Wax, Wya, by, b = parameters['Waa'], parameters['Wax'], parameters['Wya'], parameters['by'], parameters['b']
    vocab_size = by.shape[0]
    n_a = Waa.shape[1]
    
    ### START CODE HERE ###
    # Step 1: Create the one-hot vector x for the first character (initializing the sequence generation). (≈1 line)
    x = np.zeros((vocab_size, 1))
    # Step 1': Initialize a_prev as zeros (≈1 line)
    a_prev = np.zeros((n_a, 1))
    
    # Create an empty list of indices, this is the list which will contain the list of indices of the characters to generate (≈1 line)
    indices = []
    
    # Idx is a flag to detect a newline character, we initialize it to -1
    idx = -1 
    
    # Loop over time-steps t. At each time-step, sample a character from a probability distribution and append 
    # its index to "indices". We'll stop if we reach 50 characters (which should be very unlikely with a well 
    # trained model), which helps debugging and prevents entering an infinite loop. 
    counter = 0
    newline_character = char_to_ix['\n']
    
    while (idx != newline_character and counter != 50):
        
        # Step 2: Forward propagate x using the equations (1), (2) and (3)
        a = np.tanh(np.dot(Wax, x) + np.dot(Waa, a_prev) + b)
        z = np.dot(Wya, a) + by
        y = softmax(z)
        
        # for grading purposes
        np.random.seed(counter+seed) 
        
        # Step 3: Sample the index of a character within the vocabulary from the probability distribution y
        idx = np.random.choice(range(len(y)),p = y.ravel())

        # Append the index to "indices"
        indices.append(idx)
        
        # Step 4: Overwrite the input character as the one corresponding to the sampled index.
        x = np.zeros((vocab_size, 1))
        x[idx] = 1
        
        # Update "a_prev" to be "a"
        a_prev = a
        
        # for grading purposes
        seed += 1
        counter +=1
        
    ### END CODE HERE ###

    if (counter == 50):
        indices.append(char_to_ix['\n'])
    
    return indices

构建模型

函数都已经给你了

# GRADED FUNCTION: optimize

def optimize(X, Y, a_prev, parameters, learning_rate = 0.01):
    """
    Execute one step of the optimization to train the model.
    
    Arguments:
    X -- list of integers, where each integer is a number that maps to a character in the vocabulary.
    Y -- list of integers, exactly the same as X but shifted one index to the left.
    a_prev -- previous hidden state.
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        b --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    learning_rate -- learning rate for the model.
    
    Returns:
    loss -- value of the loss function (cross-entropy)
    gradients -- python dictionary containing:
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dWya -- Gradients of hidden-to-output weights, of shape (n_y, n_a)
                        db -- Gradients of bias vector, of shape (n_a, 1)
                        dby -- Gradients of output bias vector, of shape (n_y, 1)
    a[len(X)-1] -- the last hidden state, of shape (n_a, 1)
    """
    
    ### START CODE HERE ###
    
    # Forward propagate through time (≈1 line)
    loss, cache = rnn_forward(X, Y, a_prev, parameters)
    
    # Backpropagate through time (≈1 line)
    gradients, a = rnn_backward(X, Y, parameters, cache)
    
    # Clip your gradients between -5 (min) and 5 (max) (≈1 line)
    gradients = clip(gradients, 5)
    
    # Update parameters (≈1 line)
    parameters = update_parameters(parameters, gradients, learning_rate)
    
    ### END CODE HERE ###
    
    return loss, gradients, a[len(X)-1]

训练模型

# GRADED FUNCTION: model

def model(data, ix_to_char, char_to_ix, num_iterations = 35000, n_a = 50, dino_names = 7, vocab_size = 27):
    """
    Trains the model and generates dinosaur names. 
    
    Arguments:
    data -- text corpus
    ix_to_char -- dictionary that maps the index to a character
    char_to_ix -- dictionary that maps a character to an index
    num_iterations -- number of iterations to train the model for
    n_a -- number of units of the RNN cell
    dino_names -- number of dinosaur names you want to sample at each iteration. 
    vocab_size -- number of unique characters found in the text, size of the vocabulary
    
    Returns:
    parameters -- learned parameters
    """
    
    # Retrieve n_x and n_y from vocab_size
    n_x, n_y = vocab_size, vocab_size
    
    # Initialize parameters
    parameters = initialize_parameters(n_a, n_x, n_y)
    
    # Initialize loss (this is required because we want to smooth our loss, don't worry about it)
    loss = get_initial_loss(vocab_size, dino_names)
    
    # Build list of all dinosaur names (training examples).
    with open("dinos.txt") as f:
        examples = f.readlines()
    examples = [x.lower().strip() for x in examples]
    
    # Shuffle list of all dinosaur names
    np.random.seed(0)
    np.random.shuffle(examples)
    
    # Initialize the hidden state of your LSTM
    a_prev = np.zeros((n_a, 1))
    
    # Optimization loop
    for j in range(num_iterations):
        
        ### START CODE HERE ###
        
        # Use the hint above to define one training example (X,Y) (≈ 2 lines)
        index = j % len(examples)
        X = [None] + [char_to_ix[ch] for ch in examples[index]]
        Y = X[1:] + [char_to_ix['\n']]
        
        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Update parameters
        # Choose a learning rate of 0.01
        curr_loss, gradients, a_prev = optimize(X, Y, a_prev, parameters, learning_rate=0.01)
        
        ### END CODE HERE ###
        
        # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
        loss = smooth(loss, curr_loss)

        # Every 2000 Iteration, generate "n" characters thanks to sample() to check if the model is learning properly
        if j % 2000 == 0:
            
            print('Iteration: %d, Loss: %f' % (j, loss) + '\n')
            
            # The number of dinosaur names to print
            seed = 0
            for name in range(dino_names):
                
                # Sample indices and print them
                sampled_indices = sample(parameters, char_to_ix, seed)
                print_sample(sampled_indices, ix_to_char)
                
                seed += 1  # To get the same result for grading purposed, increment the seed by one. 
      
            print('\n')
        
    return parameters

DeepLearning.ai作业:(5-1)-- 循环神经网络（Recurrent Neural Networks）（1）

2018-10-18T02:26:56.000Z

本周作业分为三部分：

手动建一个RNN模型
搭建一个字符级的语言模型来生成恐龙的名字
用LSTM生成爵士乐

Part1:Building a recurrent neural network - step by step

来构建一个RNN的神经网络。

1 - Forward propagation for the basic Recurrent Neural Network

先来进行前向传播的构建，要构建这个网络，先构建每个RNN的传播单元：

RNN cell

Compute the hidden state with tanh activation: $a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)$
Using your new hidden state $a^{\langle t \rangle}$, compute the prediction $\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)$. We provided you a function: softmax.
Store $(a^{\langle t \rangle}, a^{\langle t-1 \rangle}, x^{\langle t \rangle}, parameters)$ in cache
Return $a^{\langle t \rangle}$ , $y^{\langle t \rangle}$ and cache

We will vectorize over $m$ examples. Thus, $x^{\langle t \rangle}$ will have dimension $(n_x,m)$, and $a^{\langle t \rangle}$ will have dimension $(n_a,m)$.

# GRADED FUNCTION: rnn_cell_forward

def rnn_cell_forward(xt, a_prev, parameters):
    """
    Implements a single forward step of the RNN-cell as described in Figure (2)

    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias, numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, a_prev, xt, parameters)
    """
    
    # Retrieve parameters from "parameters"
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]
    
    ### START CODE HERE ### (≈2 lines)
    # compute next activation state using the formula given above
    a_next = np.tanh(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)
    # compute output of the current cell using the formula given above
    yt_pred = softmax(np.dot(Wya, a_next) + by)    
    ### END CODE HERE ###
    
    # store values you need for backward propagation in cache
    cache = (a_next, a_prev, xt, parameters)
    
    return a_next, yt_pred, cache

RNN forward pass

思路是：

先把 a ,y_pred置为0
然后初始化a_next = a0
然后经过Tx个循环，求得每一步的a和y以及cache

# GRADED FUNCTION: rnn_forward

def rnn_forward(x, a0, parameters):
    """
    Implement the forward propagation of the recurrent neural network described in Figure (3).

    Arguments:
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Waa -- Weight matrix multiplying the hidden state, numpy array of shape (n_a, n_a)
                        Wax -- Weight matrix multiplying the input, numpy array of shape (n_a, n_x)
                        Wya -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        ba --  Bias numpy array of shape (n_a, 1)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)

    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y_pred -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of caches, x)
    """
    
    # Initialize "caches" which will contain the list of all caches
    caches = []
    
    # Retrieve dimensions from shapes of x and parameters["Wya"]
    n_x, m, T_x = x.shape
    n_y, n_a = parameters["Wya"].shape
    
    ### START CODE HERE ###
    
    # initialize "a" and "y" with zeros (≈2 lines)
    a = np.zeros((n_a, m, T_x))
    y_pred = np.zeros((n_y, m, T_x))
    
    # Initialize a_next (≈1 line)
    a_next = a0
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, compute the prediction, get the cache (≈1 line)
        a_next, yt_pred, cache = rnn_cell_forward(x[:, :, t], a_next, parameters)
        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next
        # Save the value of the prediction in y (≈1 line)
        y_pred[:,:,t] = yt_pred
        # Append "cache" to "caches" (≈1 line)
        caches.append(cache)
        
    ### END CODE HERE ###
    
    # store values needed for backward propagation in cache
    caches = (caches, x)
    
    return a, y_pred, caches

2 - Long Short-Term Memory (LSTM) network

接下来构建一个LSTM的网络

遗忘门：

假设我们正在阅读一段文字中的单词，并且希望使用LSTM来跟踪语法结构，例如主语是单数还是复数。如果主语从单个单词变成复数单词，我们需要找到一种方法来摆脱先前存储的单数/复数状态的记忆值。

在LSTM中，遗忘门让我们做到这一点：

$$\Gamma_f^{\langle t \rangle} = \sigma(W_f[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_f)$$

更新门:

一旦我们忘记所讨论的主题是单数的，我们需要找到一种方法来更新它，以反映新主题现在是复数。

$$\Gamma_u^{\langle t \rangle} = \sigma(W_u[a^{\langle t-1 \rangle}, x^] + b_u)$$

所以两个门结合起来可以更新单元值：

$$ \tilde{c}^{\langle t \rangle} = \tanh(W_c[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_c) $$

$$ c^{} = \Gamma_f^{} c^{} + \Gamma_u ^{} \tilde {c}^{} $$

输出门：

为了决定输出，我们将使用以下两个公式：

$$ \Gamma_o^{\langle t \rangle}= \sigma(W_o[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_o)$$
$$ a^{\langle t \rangle} = \Gamma_o^{\langle t \rangle}* \tanh(c^{\langle t \rangle}) $$

LSTM 单元

先将$a^{\langle t-1 \rangle}$ and $x^{\langle t \rangle}$连接在一起变成$concat = \begin{bmatrix} a^{\langle t-1 \rangle} \ x^{\langle t \rangle} \end{bmatrix}$
计算以上的6个公式
然后预测输出y

# GRADED FUNCTION: lstm_cell_forward

def lstm_cell_forward(xt, a_prev, c_prev, parameters):
    """
    Implement a single forward step of the LSTM-cell as described in Figure (4)

    Arguments:
    xt -- your input data at timestep "t", numpy array of shape (n_x, m).
    a_prev -- Hidden state at timestep "t-1", numpy array of shape (n_a, m)
    c_prev -- Memory state at timestep "t-1", numpy array of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the update gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc --  Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
                        bo --  Bias of the output gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a_next -- next hidden state, of shape (n_a, m)
    c_next -- next memory state, of shape (n_a, m)
    yt_pred -- prediction at timestep "t", numpy array of shape (n_y, m)
    cache -- tuple of values needed for the backward pass, contains (a_next, c_next, a_prev, c_prev, xt, parameters)
    
    Note: ft/it/ot stand for the forget/update/output gates, cct stands for the candidate value (c tilde),
          c stands for the memory value
    """

    # Retrieve parameters from "parameters"
    Wf = parameters["Wf"]
    bf = parameters["bf"]
    Wi = parameters["Wi"]
    bi = parameters["bi"]
    Wc = parameters["Wc"]
    bc = parameters["bc"]
    Wo = parameters["Wo"]
    bo = parameters["bo"]
    Wy = parameters["Wy"]
    by = parameters["by"]
    
    # Retrieve dimensions from shapes of xt and Wy
    n_x, m = xt.shape
    n_y, n_a = Wy.shape

    ### START CODE HERE ###
    # Concatenate a_prev and xt (≈3 lines)
    concat = np.zeros((n_x + n_a, m))
    concat[: n_a, :] = a_prev  
    concat[n_a :, :] = xt 

    # Compute values for ft, it, cct, c_next, ot, a_next using the formulas given figure (4) (≈6 lines)
    ft = sigmoid(np.dot(Wf, concat) + bf)
    it = sigmoid(np.dot(Wi, concat) + bi)
    cct = np.tanh(np.dot(Wc, concat) + bc)
    c_next = ft * c_prev + it * cct
    ot = sigmoid(np.dot(Wo, concat) + bo)
    a_next = ot * np.tanh(c_next)
    
    # Compute prediction of the LSTM cell (≈1 line)
    yt_pred = softmax(np.dot(Wy, a_next) + by)
    ### END CODE HERE ###

    # store values needed for backward propagation in cache
    cache = (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters)

    return a_next, c_next, yt_pred, cache

Forward pass for LSTM

# GRADED FUNCTION: lstm_forward

def lstm_forward(x, a0, parameters):
    """
    Implement the forward propagation of the recurrent neural network using an LSTM-cell described in Figure (3).

    Arguments:
    x -- Input data for every time-step, of shape (n_x, m, T_x).
    a0 -- Initial hidden state, of shape (n_a, m)
    parameters -- python dictionary containing:
                        Wf -- Weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        bf -- Bias of the forget gate, numpy array of shape (n_a, 1)
                        Wi -- Weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        bi -- Bias of the update gate, numpy array of shape (n_a, 1)
                        Wc -- Weight matrix of the first "tanh", numpy array of shape (n_a, n_a + n_x)
                        bc -- Bias of the first "tanh", numpy array of shape (n_a, 1)
                        Wo -- Weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
                        bo -- Bias of the output gate, numpy array of shape (n_a, 1)
                        Wy -- Weight matrix relating the hidden-state to the output, numpy array of shape (n_y, n_a)
                        by -- Bias relating the hidden-state to the output, numpy array of shape (n_y, 1)
                        
    Returns:
    a -- Hidden states for every time-step, numpy array of shape (n_a, m, T_x)
    y -- Predictions for every time-step, numpy array of shape (n_y, m, T_x)
    caches -- tuple of values needed for the backward pass, contains (list of all the caches, x)
    """

    # Initialize "caches", which will track the list of all the caches
    caches = []
    
    ### START CODE HERE ###
    # Retrieve dimensions from shapes of x and parameters['Wy'] (≈2 lines)
    n_x, m, T_x = x.shape
    n_y, n_a = parameters['Wy'].shape

    # initialize "a", "c" and "y" with zeros (≈3 lines)
    a = np.zeros((n_a, m, T_x))
    c = np.zeros((n_a, m, T_x))
    y = np.zeros((n_y, m, T_x))
    
    # Initialize a_next and c_next (≈2 lines)
    a_next = a0
    c_next = np.zeros((n_a, m))
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, next memory state, compute the prediction, get the cache (≈1 line)
        a_next, c_next, yt, cache = lstm_cell_forward(x[:, :, t], a_next, c_next, parameters)
        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next
        # Save the value of the prediction in y (≈1 line)
        y[:,:,t] = yt
        # Save the value of the next cell state (≈1 line)
        c[:,:,t]  = c_next
        # Append the cache into caches (≈1 line)
        caches.append(cache)
        
    ### END CODE HERE ###
    
    # store values needed for backward propagation in cache
    caches = (caches, x)

    return a, y, c, caches

3 - Backpropagation in recurrent neural networks

接下来是RNN的反向传播，但是一般框架都会帮我们实现，这里看看就好了。公式也比较复杂。

RNN backward pass

def rnn_cell_backward(da_next, cache):
    """
    Implements the backward pass for the RNN-cell (single time-step).

    Arguments:
    da_next -- Gradient of loss with respect to next hidden state
    cache -- python dictionary containing useful values (output of rnn_cell_forward())

    Returns:
    gradients -- python dictionary containing:
                        dx -- Gradients of input data, of shape (n_x, m)
                        da_prev -- Gradients of previous hidden state, of shape (n_a, m)
                        dWax -- Gradients of input-to-hidden weights, of shape (n_a, n_x)
                        dWaa -- Gradients of hidden-to-hidden weights, of shape (n_a, n_a)
                        dba -- Gradients of bias vector, of shape (n_a, 1)
    """
    
    # Retrieve values from cache
    (a_next, a_prev, xt, parameters) = cache
    
    # Retrieve values from parameters
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Wya = parameters["Wya"]
    ba = parameters["ba"]
    by = parameters["by"]

    ### START CODE HERE ###
    # compute the gradient of tanh with respect to a_next (≈1 line)
    dtanh = (1 - a_next**2) * da_next

    # compute the gradient of the loss with respect to Wax (≈2 lines)
    dxt = np.dot(Wax.T, dtanh)
    dWax = np.dot(dtanh, xt.T)

    # compute the gradient with respect to Waa (≈2 lines)
    da_prev = np.dot(Waa.T, dtanh)
    dWaa = np.dot(dtanh, a_prev.T)

    # compute the gradient with respect to b (≈1 line)
    dba = np.sum(dtanh, keepdims=True, axis=-1)

    ### END CODE HERE ###
    
    # Store the gradients in a python dictionary
    gradients = {"dxt": dxt, "da_prev": da_prev, "dWax": dWax, "dWaa": dWaa, "dba": dba}
    
    return gradients

def rnn_backward(da, caches):
    """
    Implement the backward pass for a RNN over an entire sequence of input data.

    Arguments:
    da -- Upstream gradients of all hidden states, of shape (n_a, m, T_x)
    caches -- tuple containing information from the forward pass (rnn_forward)
    
    Returns:
    gradients -- python dictionary containing:
                        dx -- Gradient w.r.t. the input data, numpy-array of shape (n_x, m, T_x)
                        da0 -- Gradient w.r.t the initial hidden state, numpy-array of shape (n_a, m)
                        dWax -- Gradient w.r.t the input's weight matrix, numpy-array of shape (n_a, n_x)
                        dWaa -- Gradient w.r.t the hidden state's weight matrix, numpy-arrayof shape (n_a, n_a)
                        dba -- Gradient w.r.t the bias, of shape (n_a, 1)
    """
        
    ### START CODE HERE ###
    
    # Retrieve values from the first cache (t=1) of caches (≈2 lines)
    (caches, x) = caches
    (a1, a0, x1, parameters) = caches[0]
    
    # Retrieve dimensions from da's and x1's shapes (≈2 lines)
    n_a, m, T_x = da.shape
    n_x, m = x1.shape
    
    # initialize the gradients with the right sizes (≈6 lines)
    dx = np.zeros((n_x, m, T_x))
    dWax = np.zeros((n_a, n_x))
    dWaa = np.zeros((n_a, n_a))
    dba = np.zeros((n_a, 1))
    da0 = np.zeros((n_a, m))
    da_prevt = np.zeros((n_a, m))

    # Loop through all the time steps
    for t in reversed(range(T_x)):
        # Compute gradients at time step t. Choose wisely the "da_next" and the "cache" to use in the backward propagation step. (≈1 line)
        gradients = rnn_cell_backward(da[:, :, t] + da_prevt, caches[t])
        # Retrieve derivatives from gradients (≈ 1 line)
        dxt, da_prevt, dWaxt, dWaat, dbat = gradients["dxt"], gradients["da_prev"], gradients["dWax"], gradients["dWaa"], gradients["dba"]
        # Increment global derivatives w.r.t parameters by adding their derivative at time-step t (≈4 lines)
        dx[:, :, t] = dxt
        dWax += dWaxt
        dWaa += dWaat
        dba += dbat

    # Set da0 to the gradient of a which has been backpropagated through all time-steps (≈1 line) 
    da0 = da_prevt
    ### END CODE HERE ###

    # Store the gradients in a python dictionary
    gradients = {"dx": dx, "da0": da0, "dWax": dWax, "dWaa": dWaa,"dba": dba}
    
    return gradients

LSTM backward pass

def lstm_cell_backward(da_next, dc_next, cache):
    """
    Implement the backward pass for the LSTM-cell (single time-step).

    Arguments:
    da_next -- Gradients of next hidden state, of shape (n_a, m)
    dc_next -- Gradients of next cell state, of shape (n_a, m)
    cache -- cache storing information from the forward pass

    Returns:
    gradients -- python dictionary containing:
                        dxt -- Gradient of input data at time-step t, of shape (n_x, m)
                        da_prev -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
                        dc_prev -- Gradient w.r.t. the previous memory state, of shape (n_a, m, T_x)
                        dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        dWi -- Gradient w.r.t. the weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
                        dWo -- Gradient w.r.t. the weight matrix of the output gate, numpy array of shape (n_a, n_a + n_x)
                        dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
                        dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
                        dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
                        dbo -- Gradient w.r.t. biases of the output gate, of shape (n_a, 1)
    """

    # Retrieve information from "cache"
    (a_next, c_next, a_prev, c_prev, ft, it, cct, ot, xt, parameters) = cache
    
    ### START CODE HERE ###
    # Retrieve dimensions from xt's and a_next's shape (≈2 lines)
    n_x, m = xt.shape
    n_a, m = a_next.shape

    # Compute gates related derivatives, you can find their values can be found by looking carefully at equations (7) to (10) (≈4 lines)
    dot = da_next * np.tanh(c_next) * ot * (1-ot)
    dcct = (dc_next*it+ot*(1-np.square(np.tanh(c_next)))*it*da_next)*(1-np.square(cct))
    dit = (dc_next*cct+ot*(1-np.square(np.tanh(c_next)))*cct*da_next)*it*(1-it)
    dft = (dc_next*c_prev+ot*(1-np.square(np.tanh(c_next)))*c_prev*da_next)*ft*(1-ft) 

    # Code equations (7) to (10) (≈4 lines)
    # dit = None
    # dft = None
    # dot = None
    # dcct = None

    # Compute parameters related derivatives. Use equations (11)-(14) (≈8 lines)
    dWf = np.dot(dft, np.concatenate((a_prev, xt), axis=0).T)
    dWi = np.dot(dit, np.concatenate((a_prev, xt), axis=0).T)
    dWc = np.dot(dcct, np.concatenate((a_prev, xt), axis=0).T)
    dWo = np.dot(dot, np.concatenate((a_prev, xt), axis=0).T)
    dbf = np.sum(dft, axis=1, keepdims=True)
    dbi = np.sum(dit, axis=1, keepdims=True)
    dbc = np.sum(dcct, axis=1, keepdims=True)
    dbo = np.sum(dot, axis=1, keepdims=True)

    # Compute derivatives w.r.t previous hidden state, previous memory state and input. Use equations (15)-(17). (≈3 lines)
    da_prev = np.dot(parameters['Wf'][:,:n_a].T, dft) + np.dot(parameters['Wi'][:,:n_a].T, dit) + np.dot(parameters['Wc'][:,:n_a].T, dcct) + np.dot(parameters['Wo'][:,:n_a].T, dot)
    dc_prev = dc_next*ft + ot*(1-np.square(np.tanh(c_next)))*ft*da_next
    dxt = np.dot(parameters['Wf'][:,n_a:].T,dft)+np.dot(parameters['Wi'][:,n_a:].T,dit)+np.dot(parameters['Wc'][:,n_a:].T,dcct)+np.dot(parameters['Wo'][:,n_a:].T,dot) 
    ### END CODE HERE ###

    # Save gradients in dictionary
    gradients = {"dxt": dxt, "da_prev": da_prev, "dc_prev": dc_prev, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}

    return gradients

def lstm_backward(da, caches):
    
    """
    Implement the backward pass for the RNN with LSTM-cell (over a whole sequence).

    Arguments:
    da -- Gradients w.r.t the hidden states, numpy-array of shape (n_a, m, T_x)
    dc -- Gradients w.r.t the memory states, numpy-array of shape (n_a, m, T_x)
    caches -- cache storing information from the forward pass (lstm_forward)

    Returns:
    gradients -- python dictionary containing:
                        dx -- Gradient of inputs, of shape (n_x, m, T_x)
                        da0 -- Gradient w.r.t. the previous hidden state, numpy array of shape (n_a, m)
                        dWf -- Gradient w.r.t. the weight matrix of the forget gate, numpy array of shape (n_a, n_a + n_x)
                        dWi -- Gradient w.r.t. the weight matrix of the update gate, numpy array of shape (n_a, n_a + n_x)
                        dWc -- Gradient w.r.t. the weight matrix of the memory gate, numpy array of shape (n_a, n_a + n_x)
                        dWo -- Gradient w.r.t. the weight matrix of the save gate, numpy array of shape (n_a, n_a + n_x)
                        dbf -- Gradient w.r.t. biases of the forget gate, of shape (n_a, 1)
                        dbi -- Gradient w.r.t. biases of the update gate, of shape (n_a, 1)
                        dbc -- Gradient w.r.t. biases of the memory gate, of shape (n_a, 1)
                        dbo -- Gradient w.r.t. biases of the save gate, of shape (n_a, 1)
    """

    # Retrieve values from the first cache (t=1) of caches.
    (caches, x) = caches
    (a1, c1, a0, c0, f1, i1, cc1, o1, x1, parameters) = caches[0]
    
    ### START CODE HERE ###
    # Retrieve dimensions from da's and x1's shapes (≈2 lines)
    n_a, m, T_x = da.shape
    n_x, m = x1.shape

    # initialize the gradients with the right sizes (≈12 lines)
    dx = np.zeros((n_x, m, T_x))
    da0 = np.zeros((n_a, m))
    da_prevt = np.zeros((n_a, m))
    dc_prevt = np.zeros((n_a, m))
    dWf = np.zeros((n_a, n_a+n_x))
    dWi = np.zeros((n_a, n_a+n_x))
    dWc = np.zeros((n_a, n_a+n_x))
    dWo = np.zeros((n_a, n_a+n_x))
    dbf = np.zeros((n_a, 1))
    dbi = np.zeros((n_a, 1))
    dbc = np.zeros((n_a, 1))
    dbo = np.zeros((n_a, 1))

    # loop back over the whole sequence
    for t in reversed(range(T_x)):
        # Compute all gradients using lstm_cell_backward
        gradients = lstm_cell_backward(da[:, :, t] + da_prevt, dc_prevt, caches[t])
        # Store or add the gradient to the parameters' previous step's gradient
        dx[:,:,t] = gradients['dxt']
        dWf = dWf + gradients['dWf']
        dWi = dWi + gradients['dWi']
        dWc = dWc + gradients['dWc']
        dWo = dWo + gradients['dWo']
        dbf = dbf + gradients['dbf']
        dbi = dbi + gradients['dbi']
        dbc = dbc + gradients['dbc']
        dbo = dbo + gradients['dbo']
    # Set the first activation's gradient to the backpropagated gradient da_prev.
    da0 = gradients['da_prev']

    ### END CODE HERE ###

    # Store the gradients in a python dictionary
    gradients = {"dx": dx, "da0": da0, "dWf": dWf,"dbf": dbf, "dWi": dWi,"dbi": dbi,
                "dWc": dWc,"dbc": dbc, "dWo": dWo,"dbo": dbo}
    
    return gradients

DeepLearning.ai笔记:(5-1)-- 循环神经网络（Recurrent Neural Networks）

2018-10-18T02:26:52.000Z

第五门课讲的是序列模型，主要是对RNN算法的应用，如GRU，LSTM算法，应用在词嵌入模型，情感分类，语音识别等领域。

第一周讲的是RNN的基本算法。

序列模型的应用

序列模型用在了很多的地方，如语音识别，音乐生成，情感分类，DNA序列分析，机器翻译，视频内容检测，名字检测等等。

数学符号

先讲一下NG在课程中主要用到的数学符号。

对于输入一个$x$的句子序列，可以细分为一个个的词，每一个词记为$x^{}$，对应的输出$y$记为$y^{}$

其中，输入x的序列长度为 $T_x$，输出$y$的序列长度为$T_y$

而针对很多个不同的序列，$X^{(i)}$表示第$i$个样本的第t的词。

那么如何用数学的形式表示这个$x^{}$呢？这里用到了one-hot编码，假设词表中一共有10000个词汇，那么$x^{}$就是一个长度为10000的向量，在这之中只有一个维度是1，其他都是0

循环神经网络

如果用传统的神经网络，经过一个N层的神经网络得到输出y。

效果并不是很好，因为：

输入和输出在不同的样本中是可以不同长度的（每个句子可以有不同的长度）
这种朴素的神经网络结果并不能共享从文本不同位置所学习到的特征。（如卷积神经网络中学到的特征的快速地推广到图片其他位置）

所以循环神经网络采用每一个时间步来计算，输入一个$x^{}$和前面留下来的记忆$a^{}$，来得到这一层的输出$y^{}$和下一层的记忆$a^{}$

这里需要注意在零时刻，我们需要编造一个激活值，通常输入一个零向量，有的研究人员会使用随机的方法对该初始激活向量进行初始化。同时，上图中右边的循环神经网络的绘制结构与左边是等价的。

循环神经网络是从左到右扫描数据的，同时共享每个时间步的参数。

$W_{ax}$管理从输入$x^{}$到隐藏层的连接，每个时间步都使用相同的$W_{ax}$，同下；
$W_{aa}$管理激活值$a^{}$到隐藏层的连接；
$W_{ya}$管理隐藏层到激活值$y^{}$的连接。

RNN的前向传播

前向传播公式如图，这里可以把$W_{aa}，W_{ax}$合并成一项，为$W_a$，而后将$[a^{},x^{}]$合并成一项。

RNN的反向传播

定义一个loss function，然后倒回去计算。

不同类型的RNN

对于RNN，不同的问题需要不同的输入输出结构。

One to many：如音乐生成，输入一个音乐类型或者空值，生成一段音乐
Many to one：如情感分类问题，输入某个序列，输出一个值来判断得分。
many to many（$T_x = T_y$）：输入和输出的序列长度相同
many to many（$T_x != T_y$）：如机器翻译这种，先输入一段，然后自己生成一段，输入和输出长度不一定相同的。

语言模型和序列生成

什么是语言模型？

对于下面的例子，两句话有相似的发音，但是想表达的意义和正确性却不相同，如何让我们的构建的语音识别系统能够输出正确地给出想要的输出。也就是对于语言模型来说，从输入的句子中，评估各个句子中各个单词出现的可能性，进而给出整个句子出现的可能性。

使用RNN构建语言模型：

训练集：一个很大的语言文本语料库；
Tokenize：将句子使用字典库标记化；其中，未出现在字典库中的词使用“UNK”来表示；
第一步：使用零向量对输出进行预测，即预测第一个单词是某个单词的可能性；
第二步：通过前面的输入，逐步预测后面一个单词出现的概率；

对新序列采样

当我们训练得到了一个模型之后，如果我们想知道这个模型学到了些什么，一个非正式的方法就是对新序列进行采样。具体方法如下：

在每一步输出$y$时，通常使用 softmax 作为激活函数，然后根据输出的分布，随机选择一个值，也就是对应的一个字或者英文单词。

然后将这个值作为下一个单元的x输入进去(即$x^{}=y^{}$), 直到我们输出了终结符，或者输出长度超过了提前的预设值n才停止采样。

RNN的梯度消失

RNN存在一个梯度消失问题，如：

The cat, which already ate ………..，was full；
The cats, which already ate ………..，were full.

cat 和 cats要经过很长的一系列词汇后，才对应 was 和 were，但是我们在传递过程中$a^{}$很难记住前面这么多词汇的内容，往往只和前面最近几个词汇有关而已。

当然，也有可能是每一层的梯度都很大，导致的梯度爆炸问题，不过这个问题可以通过设置阈值来解决，关键是要解决梯度消失问题。我们知道一旦神经网络层次很多时，反向传播很难影响前面层次的参数。

GRU(Gated Recurrent Unit)

那么如何解决梯度消失问题了，使用GRU单元可以有效的捕捉到更深层次的连接，来改善梯度消失问题。

原本的RNN单元如图：

而GRU单元多了一个c（memory cell）变量，用来提供长期的记忆能力。

具体过程为：

完整的GRU还存在另一个门，用来控制$\bar c$和 $c^{}$之间的联系强弱：

LSTM(Long short-term memory)

GRU能够让我们在序列中学习到更深的联系，长短期记忆（long short-term memory, LSTM）对捕捉序列中更深层次的联系要比GRU更加有效。

GRU只有两个门，而LSTM有三个门，分别是更新门、遗忘门、输出门：$\Gamma_u,\Gamma_f, \Gamma_o$

更新门：用来决定是否更新$\bar c^{}$

遗忘门：来决定是否遗忘上一个$c^{}$

输出门：来决定是否输出$c^{}$

双向RNN

双向RNN（bidirectional RNNs）模型能够让我们在序列的某处，不仅可以获取之间的信息，还可以获取未来的信息。

对于下图的单向RNN的例子中，无论我们的RNN单元是基本的RNN单元，还是GRU，或者LSTM单元，对于例子中第三个单词”Teddy”很难判断是否是人名，仅仅使用前面的两个单词是不够的，需要后面的信息来进行判断，但是单向RNN就无法实现获取未来的信息。

而双向RNN则可以解决单向RNN存在的弊端。在BRNN中，不仅有从左向右的前向连接层，还存在一个从右向左的反向连接层。

Deep RNN

与深层的基本神经网络结构相似，深层RNNs模型具有多层的循环结构，但不同的是，在传统的神经网络中，我们可能会拥有很多层，几十层上百层，但是对与RNN来说，三层的网络结构就已经很多了，因为RNN存在时间的维度，所以其结构已经足够的庞大。如下图所示：

DeepLearning.ai作业:(4-4)-- 特殊应用:人脸识别和神经风格转换

2018-10-12T10:55:20.000Z

本周作业分为了两个部分:

人脸识别
风格迁移

Part1：人脸识别

训练FaceNet很不现实，所以模型已经都训练好了，我们只是学习一下loss函数，然后调用模型来进行简单的识别而已。

先计算triplet_loss函数，分为4步：

# GRADED FUNCTION: triplet_loss

def triplet_loss(y_true, y_pred, alpha = 0.2):
    """
    Implementation of the triplet loss as defined by formula (3)
    
    Arguments:
    y_true -- true labels, required when you define a loss in Keras, you don't need it in this function.
    y_pred -- python list containing three objects:
            anchor -- the encodings for the anchor images, of shape (None, 128)
            positive -- the encodings for the positive images, of shape (None, 128)
            negative -- the encodings for the negative images, of shape (None, 128)
    
    Returns:
    loss -- real number, value of the loss
    """
    
    anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
    
    ### START CODE HERE ### (≈ 4 lines)
    # Step 1: Compute the (encoding) distance between the anchor and the positive, you will need to sum over axis=-1
    pos_dist = tf.reduce_sum(tf.square(anchor - positive),axis=-1)
    # Step 2: Compute the (encoding) distance between the anchor and the negative, you will need to sum over axis=-1
    neg_dist = tf.reduce_sum(tf.square(anchor - negative),axis=-1)
    # Step 3: subtract the two previous distances and add alpha.
    basic_loss = tf.add(tf.subtract(pos_dist,neg_dist), alpha)

    # Step 4: Take the maximum of basic_loss and 0.0. Sum over the training examples.
    loss = tf.reduce_sum(tf.maximum(basic_loss, 0.))
    ### END CODE HERE ###
    
    return loss

进行单个人脸验证：

# GRADED FUNCTION: verify

def verify(image_path, identity, database, model):
    """
    Function that verifies if the person on the "image_path" image is "identity".
    
    Arguments:
    image_path -- path to an image
    identity -- string, name of the person you'd like to verify the identity. Has to be a resident of the Happy house.
    database -- python dictionary mapping names of allowed people's names (strings) to their encodings (vectors).
    model -- your Inception model instance in Keras
    
    Returns:
    dist -- distance between the image_path and the image of "identity" in the database.
    door_open -- True, if the door should open. False otherwise.
    """
    
    ### START CODE HERE ###
    
    # Step 1: Compute the encoding for the image. Use img_to_encoding() see example above. (≈ 1 line)
    encoding = img_to_encoding(image_path,model)
    
    # Step 2: Compute distance with identity's image (≈ 1 line)
    dist = np.linalg.norm(encoding-database[identity])
    
    # Step 3: Open the door if dist < 0.7, else don't open (≈ 3 lines)
    if dist < 0.7:
        print("It's " + str(identity) + ", welcome home!")
        door_open = True
    else:
        print("It's not " + str(identity) + ", please go away")
        door_open = False
        
    ### END CODE HERE ###
        
    return dist, door_open

进行人脸识别：

# GRADED FUNCTION: who_is_it

def who_is_it(image_path, database, model):
    """
    Implements face recognition for the happy house by finding who is the person on the image_path image.
    
    Arguments:
    image_path -- path to an image
    database -- database containing image encodings along with the name of the person on the image
    model -- your Inception model instance in Keras
    
    Returns:
    min_dist -- the minimum distance between image_path encoding and the encodings from the database
    identity -- string, the name prediction for the person on image_path
    """
    
    ### START CODE HERE ### 
    
    ## Step 1: Compute the target "encoding" for the image. Use img_to_encoding() see example above. ## (≈ 1 line)
    encoding = img_to_encoding(image_path,model)
    
    ## Step 2: Find the closest encoding ##
    
    # Initialize "min_dist" to a large value, say 100 (≈1 line)
    min_dist = 100
    
    # Loop over the database dictionary's names and encodings.
    for (name, db_enc) in database.items():
        
        # Compute L2 distance between the target "encoding" and the current "emb" from the database. (≈ 1 line)
        dist = np.linalg.norm(encoding-database[name])

        # If this distance is less than the min_dist, then set min_dist to dist, and identity to name. (≈ 3 lines)
        if dist < min_dist:
            min_dist = dist
            identity = name

    ### END CODE HERE ###
    
    if min_dist > 0.7:
        print("Not in the database.")
    else:
        print ("it's " + str(identity) + ", the distance is " + str(min_dist))
        
    return min_dist, identity

Part2：风格迁移

模型也都是训练好的了，用的是VGG-19的网络。这里只是体验一下cost function的实现罢了。

计算J_content(C,G)

$$J_{content}(C,G) = \frac{1}{4 \times n_H \times n_W \times n_C}\sum _{ \text{all entries}} (a^{(C)} - a^{(G)})^2 $$

在这过程中需要把三维的矩阵先展开成2维的矩阵进行计算（虽然不展开也是可以计算的，但是风格损失函数需要计算）


def compute_content_cost(a_C, a_G):
    """
    Computes the content cost
    
    Arguments:
    a_C -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C 
    a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G
    
    Returns: 
    J_content -- scalar that you compute using equation 1 above.
    """
    
    ### START CODE HERE ###
    # Retrieve dimensions from a_G (≈1 line)
    m, n_H, n_W, n_C = a_G.get_shape().as_list()
    
    # Reshape a_C and a_G (≈2 lines)
    a_C_unrolled = tf.reshape(a_C,[n_H * n_W, n_C])
    a_G_unrolled = tf.reshape(a_G,[n_H * n_W, n_C])
    
    # compute the cost with tensorflow (≈1 line)
    J_content = tf.reduce_sum(tf.square(a_C_unrolled - a_G_unrolled)) / (n_H * n_W * n_C * 4)
    ### END CODE HERE ###
    
    return J_content

计算J_style(S,G)

需要把三维矩阵展开，然后转置，做矩阵乘法，才能得到相关系数矩阵

# GRADED FUNCTION: gram_matrix

def gram_matrix(A):
    """
    Argument:
    A -- matrix of shape (n_C, n_H*n_W)
    
    Returns:
    GA -- Gram matrix of A, of shape (n_C, n_C)
    """
    
    ### START CODE HERE ### (≈1 line)
    GA = tf.matmul(A,tf.transpose(A))
    ### END CODE HERE ###
    
    return GA

$$J_{style}^{[l]}(S,G) = \frac{1}{4 \times n_{C}^{2} \times (n_H \times n_W)^2} \sum_{i=1}^{n_C} \sum_{j=1}^{n_C} (G^{(S)}_{ij} - G^{(G)} _ {ij})^{2} $$

# GRADED FUNCTION: compute_layer_style_cost

def compute_layer_style_cost(a_S, a_G):
    """
    Arguments:
    a_S -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S 
    a_G -- tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G
    
    Returns: 
    J_style_layer -- tensor representing a scalar value, style cost defined above by equation (2)
    """
    
    ### START CODE HERE ###
    # Retrieve dimensions from a_G (≈1 line)
    m, n_H, n_W, n_C = a_G.get_shape().as_list()
    
    # Reshape the images to have them of shape (n_C, n_H*n_W) (≈2 lines)
    a_S = tf.transpose(tf.reshape(a_S,[n_H*n_W, n_C]))
    a_G = tf.transpose(tf.reshape(a_G,[n_H*n_W, n_C]))

    # Computing gram_matrices for both images S and G (≈2 lines)
    GS = gram_matrix(a_S)
    GG = gram_matrix(a_G)

    # Computing the loss (≈1 line)
    J_style_layer = 1 / (4 * (n_C*n_W*n_H)**2) * tf.reduce_sum(tf.square(tf.subtract(GS,GG)))
    
    ### END CODE HERE ###
    
    return J_style_layer

# GRADED FUNCTION: total_cost

def total_cost(J_content, J_style, alpha = 10, beta = 40):
    """
    Computes the total cost function
    
    Arguments:
    J_content -- content cost coded above
    J_style -- style cost coded above
    alpha -- hyperparameter weighting the importance of the content cost
    beta -- hyperparameter weighting the importance of the style cost
    
    Returns:
    J -- total cost as defined by the formula above.
    """
    
    ### START CODE HERE ### (≈1 line)
    J = alpha * J_content + beta * J_style
    ### END CODE HERE ###
    
    return J

1
2
3

### START CODE HERE ### (1 line)
J = total_cost(J_content, J_style, alpha = 10, beta = 40)
### END CODE HERE ###

def model_nn(sess, input_image, num_iterations = 200):
    
    # Initialize global variables (you need to run the session on the initializer)
    ### START CODE HERE ### (1 line)
    sess.run(tf.global_variables_initializer())
    ### END CODE HERE ###
    
    # Run the noisy input image (initial generated image) through the model. Use assign().
    ### START CODE HERE ### (1 line)
    generated_image = sess.run(model['input'].assign(input_image))
    ### END CODE HERE ###
    
    for i in range(num_iterations):
    
        # Run the session on the train_step to minimize the total cost
        ### START CODE HERE ### (1 line)
        sess.run(train_step)
        ### END CODE HERE ###
        
        # Compute the generated image by running the session on the current model['input']
        ### START CODE HERE ### (1 line)
        generated_image = sess.run(model['input'])
        ### END CODE HERE ###

        # Print every 20 iteration.
        if i%20 == 0:
            Jt, Jc, Js = sess.run([J, J_content, J_style])
            print("Iteration " + str(i) + " :")
            print("total cost = " + str(Jt))
            print("content cost = " + str(Jc))
            print("style cost = " + str(Js))
            
            # save current generated image in the "/output" directory
            save_image("output/" + str(i) + ".png", generated_image)
    
    # save last generated image
    save_image('output/generated_image.jpg', generated_image)
    
    return generated_image

DeepLearning.ai笔记:(4-4)-- 特殊应用:人脸识别和神经风格转换

2018-10-12T10:55:15.000Z

本周讲了CNN的两个特殊应用：人脸识别和神经风格转换。

人脸识别

Face Verification and Face Recognition

人脸识别和人脸验证不一样。

人脸验证是输入一张图片，和这个人的ID或者名字，然后根据输入的图片判断这个人是不是对应这个ID，是个1对1的问题。

人脸识别是有K个人的数据库，然后输入一张人脸的图片，不确定他是哪一位，然后输出在K个人的数据库中对应的那个人，是1对K的问题。

所以人脸识别难度更高，而且精度要求更高，因为如果每张图片都是99%的精度，那么K个人就是K倍了，所以应该有99.9%以上的精度。

One shot learning

人脸识别系统，通常都是只有一个人脸的样例，然后就能够成功的识别是不是这个人。这就是one shot learning，一次学习，单单通过一张照片就能识别这个人。

因此，在只有单个样本的情况下，并不能用之前的方法来实现这个识别系统。这里就需要有一个相似性函数。

similarity函数：

通过$d(img1,img2)$来表示两张图片的差异程度，如果d大于某个阈值，那么就表示差别很大，如果小于某个阈值，则认为是同一个人。

Siamese网络

那么如何计算这个$d(img1,img2)$呢？

可以利用Siamese网络来实现。

如图，输入两张图片$x^{(1)},x^{(2)}$，经过一个卷积神经网络，去掉最后的softmax层，可以得到N维的向量，$f(x^{(1)}),f(x^{(2)})$，假设是128维，而N维的向量就相当于是对输入图片的的编码(encoding)。

然后比较这两个向量之间的差值：

$$d(x1,x2) = ||f(x1) - f(x2)||^{2}_{2}$$

如果距离$d$很小，那表示这两张图片很相近，认为是同一个人。

如果距离$d$很大，那么表示这两张图片差别很大，不是同一个人。

Triplet loss

那么，我们之前说到，要得到输入图片的向量编码$f(x)$，是需要经过卷积神经网络的，那么卷积神经网络的参数如何确定呢？使用的方法就是Triplet loss损失函数，而后用梯度下降法进行迭代。

我们需要比较两组成对的图像 (Anchor, Positive, Negative)，简写(A,P,N)

Anchor：表示要检测的目标图片

Positive：表示与anchor同个人的图片

Negative：表示与anchor不同个人的图片

所以我们希望A和P的距离小，A和N的距离大，因此有了如下不等式：

$$||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha \leq 0$$

这里这个公式与SVM的损失函数很类似，$\alpha$是表示margin边界，也就是增加$d(A,P)$和$d(A,N)$之间的差距。

而如果上面的不等式小于0，那说明是符合我们的要求的，如果是大于0，则要计入损失函数中，所以得到了Triplet loss的公式是：

$$L(A,P,N) = max(||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha,0)$$

整个网络的代价函数就是把所有的图片损失加起来：

$$J = \sum L(A,P,N)$$

三元组的选择

每个三元组的选择是有讲究的，如果你要识别的是一个女人，然后对比的Negative是个老大爷，那么条件就很容易满足，学不到什么东西。所以应该尽量选择那些相似的图片进行每一组的训练，也就是：

$$d(A,P) \approx d(A,N) $$

选择的例子如下图，可以看到，每一个三元组对比的都是一些比较相似的图片：

脸部验证和二分类

除了之前说的用Triplet loss进行训练以外，还有别的方法来进行训练，也就是可以把Siamese网络当做一个二分类的问题。

如图，输入两张图片，当计算得到了两个图片的向量编码后，求两张图片的距离，然后通过一个sigmoid函数，把他变成一个二分类问题，如果同个人，输出1，不同个人则输出0。其中，权重$W,b$都可以通过训练来得到。

这个时候，人脸识别问题就变成了一个监督学习的问题，在创建每一对训练集的时候，应该有对应的输出标签y。

神经风格迁移

神经风格的迁移，就是输入两张图片，一张当做内容图片content，另一张当做风格图片style，输出的图片g兼具有一张的内容，和另一张的风格。

卷积神经网络学什么？

在进行风格迁移前，我们需要了解我们的神经网络到底在学些什么东西，把中间的隐藏单元拎出来看看。

如上图，假设我们有一个卷积神经网络，要看到不同层的隐藏单元计算结果，怎么办？依次对各个层进行如下操作：

在当前层挑选一个隐藏单元；
遍历训练集，找到最大化地激活了该运算单元的图片或者图片块；
对该层的其他运算单元执行操作。

对于在第一层的隐藏单元中，其只能看到卷积网络的小部分内容，也就是最后我们找到的那些最大化激活第一层隐层单元的是一些小的图片块。我们可以理解为第一层的神经单元通常会寻找一些简单的特征，如边缘或者颜色阴影等。

而后随着层数的增加，隐藏层单元看到的东西就越来越复杂了：

代价函数

对于神经风格迁移，我们的目标是由内容图片C和风格图片S，生成最终的风格迁移图片G。所以定义代价函数为：

$$J(G) = \alpha J_{content}(C,G) + \beta J_{style}(S,G)$$

$J_{content}(C, G)$: 代表生成图片G的内容和内容图片C的内容的相似度
$J_{style}(S, G)$: 代表生成图片G的内容和风格图片S的内容的相似度
$\alpha, \beta$: 两个超参数用来表示以上两者之间的权重

首先随机初始化G的像素，然后进行梯度下降：

内容代价函数

首先假设我们使用第$l$层隐藏层来计算$J_{content}(C, G)$注意这里的$l$一般取在中间层，而不是最前面的层，或者最后层。因为太浅了啥也看不到，太深了就太像原图了。
使用一个预训练的卷积网络。（如，VGG或其他）
$a^{[l] (C)}$和$a^{[l] (G)}$分别代表内容图片C和生成图片G的$l$层的激活值；
内容损失函数$J_{content} = \frac{1}{2}||a^{[l] (C)} - a^{[l] (G)}||^2$

风格代价函数

对于一个卷积网络中，我们选择网络的中间层$l$，定义“Style”表示$l$层的各个通道激活项之间的相关性。

那如何计算这个相关性呢？

假设我们在第$l$层有5个通道：

不同的通道之间代表着不同的神经元学习到的特征，如第一个通道（红色）可以表示含有垂直纹理的特征，第二个通道（黄色）表示区域中出现橙色的特征。

那么两个通道的相关性就表示图片中出现垂直纹理又出现橙色的可能性大小。

所以可以得到相关系数的矩阵“Gram Matrix：

$i,j.k$表示神经元所在的高度，宽度和通道。也就是每个通道的神经元分别乘上另一个通道对应位置的神经元再求和即可得到这两个通道$k,k^{\prime}$的相关系数。这个矩阵的维度是$(n_{c}^{[l]},n_{c}^{[l]})$的，也就是第$l$层的通道数乘通道数的大小。

而代价函数即为两张图片中相关系数矩阵的差值求和，再取平均。

1D to 3D 卷积

图片都是2D的卷积运算，其实还可以推广到1D和3D的情况。

典型的1D情况就是信号处理。

3D情况就像CT的切片，是一层一层叠加起来的。

DeepLearning.ai作业:(4-3)-- 目标检测（Object detection）

2018-10-11T12:15:58.000Z

本周的作业实现了YOLO算法，并用于自动驾驶的目标检测中。

Model details

输入： (m, 608, 608, 3)

输出： (m, 19, 19, 5, 85)

IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85)

也就是有5个Anchor boxes，一共有80个分类。

所以，每个box的scores也就是等于每个类预测的可能性：

Filtering with a threshold on class scores

这个时候开始创建一个函数，得到每一个box中scores最大的那个类，分数，以及位置，去掉其他没用的。

# GRADED FUNCTION: yolo_filter_boxes

def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
    """Filters YOLO boxes by thresholding on object and class confidence.
    
    Arguments:
    box_confidence -- tensor of shape (19, 19, 5, 1)
    boxes -- tensor of shape (19, 19, 5, 4)
    box_class_probs -- tensor of shape (19, 19, 5, 80)
    threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
    
    Returns:
    scores -- tensor of shape (None,), containing the class probability score for selected boxes
    boxes -- tensor of shape (None, 4), containing (b_x, b_y, b_h, b_w) coordinates of selected boxes
    classes -- tensor of shape (None,), containing the index of the class detected by the selected boxes
    
    Note: "None" is here because you don't know the exact number of selected boxes, as it depends on the threshold. 
    For example, the actual output size of scores would be (10,) if there are 10 boxes.
    """
    
    # Step 1: Compute box scores
    ### START CODE HERE ### (≈ 1 line)
    box_scores = box_confidence * box_class_probs
    ### END CODE HERE ###
    
    # Step 2: Find the box_classes thanks to the max box_scores, keep track of the corresponding score
    ### START CODE HERE ### (≈ 2 lines)
    box_classes = K.argmax(box_scores, axis=-1)    #得到box的类别 (19,19,5)
    box_class_scores = K.max(box_scores, axis=-1, keepdims=False)  #得到box这个类别的分数(19,19,5)
    ### END CODE HERE ###
    
    # Step 3: Create a filtering mask based on "box_class_scores" by using "threshold". The mask should have the
    # same dimension as box_class_scores, and be True for the boxes you want to keep (with probability >= threshold)
    ### START CODE HERE ### (≈ 1 line)
    filtering_mask = box_class_scores >= threshold
    ### END CODE HERE ###
    
    # Step 4: Apply the mask to scores, boxes and classes
    ### START CODE HERE ### (≈ 3 lines)
    scores = tf.boolean_mask(box_class_scores, filtering_mask)
    boxes = tf.boolean_mask(boxes, filtering_mask)
    classes = tf.boolean_mask(box_classes, filtering_mask)
    ### END CODE HERE ###
    
    return scores, boxes, classes

Non-max suppression

找到了这些boxes后，还需要进行筛选过滤掉。先完成一个IOU算法：

# GRADED FUNCTION: iou

def iou(box1, box2):
    """Implement the intersection over union (IoU) between box1 and box2
    
    Arguments:
    box1 -- first box, list object with coordinates (x1, y1, x2, y2)
    box2 -- second box, list object with coordinates (x1, y1, x2, y2)
    """

    # Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.
    ### START CODE HERE ### (≈ 5 lines)
    xi1 = np.maximum(box1[0], box2[0])
    yi1 = np.maximum(box1[1], box2[1])
    xi2 = np.minimum(box1[2], box2[2])
    yi2 = np.minimum(box1[3], box2[3])
    inter_area = max(xi2 - xi1,0) * max(yi2 - yi1,0)
    ### END CODE HERE ###    

    # Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)
    ### START CODE HERE ### (≈ 3 lines)
    box1_area = (box1[2] - box1[0]) * (box1[3] - box1[1])
    box2_area = (box2[2] - box2[0]) * (box2[3] - box2[1])
    union_area = box1_area + box2_area - inter_area
    ### END CODE HERE ###
    
    # compute the IoU
    ### START CODE HERE ### (≈ 1 line)
    iou = inter_area / union_area
    ### END CODE HERE ###
    
    return iou

tensorflow已经帮你实现了iou算法了，不用用自己刚才写的了：

思想就是拿掉IOU比较大的那些box

# GRADED FUNCTION: yolo_non_max_suppression

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
    """
    Applies Non-max suppression (NMS) to set of boxes
    
    Arguments:
    scores -- tensor of shape (None,), output of yolo_filter_boxes()
    boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
    classes -- tensor of shape (None,), output of yolo_filter_boxes()
    max_boxes -- integer, maximum number of predicted boxes you'd like
    iou_threshold -- real value, "intersection over union" threshold used for NMS filtering
    
    Returns:
    scores -- tensor of shape (, None), predicted score for each box
    boxes -- tensor of shape (4, None), predicted box coordinates
    classes -- tensor of shape (, None), predicted class for each box
    
    Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
    function will transpose the shapes of scores, boxes, classes. This is made for convenience.
    """
    
    max_boxes_tensor = K.variable(max_boxes, dtype='int32')     # tensor to be used in tf.image.non_max_suppression()
    K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor
    
    # Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep
    ### START CODE HERE ### (≈ 1 line)
    nms_indices = tf.image.non_max_suppression(boxes,scores,max_boxes,iou_threshold)
    ### END CODE HERE ###
    
    # Use K.gather() to select only nms_indices from scores, boxes and classes
    ### START CODE HERE ### (≈ 3 lines)
    scores = K.gather(scores,nms_indices)
    boxes = K.gather(boxes,nms_indices)
    classes = K.gather(classes,nms_indices)
    ### END CODE HERE ###
    
    return scores, boxes, classes

而后结合刚才的函数，先去掉scores低的，然后运算NMS算法

# GRADED FUNCTION: yolo_eval

def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
    """
    Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.
    
    Arguments:
    yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
                    box_confidence: tensor of shape (None, 19, 19, 5, 1)
                    box_xy: tensor of shape (None, 19, 19, 5, 2)
                    box_wh: tensor of shape (None, 19, 19, 5, 2)
                    box_class_probs: tensor of shape (None, 19, 19, 5, 80)
    image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
    max_boxes -- integer, maximum number of predicted boxes you'd like
    score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
    iou_threshold -- real value, "intersection over union" threshold used for NMS filtering
    
    Returns:
    scores -- tensor of shape (None, ), predicted score for each box
    boxes -- tensor of shape (None, 4), predicted box coordinates
    classes -- tensor of shape (None,), predicted class for each box
    """
    
    ### START CODE HERE ### 
    
    # Retrieve outputs of the YOLO model (≈1 line)
    box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs

    # Convert boxes to be ready for filtering functions 
    boxes = yolo_boxes_to_corners(box_xy, box_wh)

    # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold (≈1 line)
    scores, boxes, classes = yolo_filter_boxes(box_confidence, boxes, box_class_probs, score_threshold)
    
    # Scale boxes back to original image shape.
    boxes = scale_boxes(boxes, image_shape)

    # Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold (≈1 line)
    scores, boxes, classes = yolo_non_max_suppression(scores, boxes, classes, max_boxes, iou_threshold )
    
    ### END CODE HERE ###
    
    return scores, boxes, classes

进行预测:

def predict(sess, image_file):
    """
    Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.
    
    Arguments:
    sess -- your tensorflow/Keras session containing the YOLO graph
    image_file -- name of an image stored in the "images" folder.
    
    Returns:
    out_scores -- tensor of shape (None, ), scores of the predicted boxes
    out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
    out_classes -- tensor of shape (None, ), class index of the predicted boxes
    
    Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes. 
    """

    # Preprocess your image
    image, image_data = preprocess_image("images/" + image_file, model_image_size = (608, 608))

    # Run the session with the correct tensors and choose the correct placeholders in the feed_dict.
    # You'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})
    ### START CODE HERE ### (≈ 1 line)
    out_scores, out_boxes, out_classes = sess.run([scores, boxes, classes], feed_dict = {yolo_model.input:image_data, K.learning_phase(): 0})
    ### END CODE HERE ###

    # Print predictions info
    print('Found {} boxes for {}'.format(len(out_boxes), image_file))
    # Generate colors for drawing bounding boxes.
    colors = generate_colors(class_names)
    # Draw bounding boxes on the image file
    draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
    # Save the predicted bounding box on the image
    image.save(os.path.join("out", image_file), quality=90)
    # Display the results in the notebook
    output_image = scipy.misc.imread(os.path.join("out", image_file))
    imshow(output_image)
    
    return out_scores, out_boxes, out_classes

DeepLearning.ai笔记:(4-3)-- 目标检测（Object detection）

2018-10-11T09:02:09.000Z

这一周主要讲了卷积神经网络的进一步应用：目标检测。

主要内容有：目标定位、特征点检测、目标检测、滑动窗口、Bounding Box，IOU，NMS，Anchor Boxes，Yolo算法。

目标定位（Object localization）

在进行目标检测之前，需要先了解一下目标定位。

因为进行目标检测的时候需要预测出目标的具体位置，所以在训练的时候需要先标定一下这个目标的实际位置。

假设我们需要分类的类别有3个，行人，汽车，自行车，如果什么都没有，那么就是背景。可以看到，y一共有8个参数：

$P_c$：是否有目标
$b_x,b_y,b_h,b_w$：目标的位置x,y，高宽h,w
$c_1,c_2,c_3$：行人、汽车、自行车

如果$P_c = 0$那么表示没有目标，那么我们就不关心后面的其他参数了。

特征点检测(Landmark detection)

如果是要检测人脸，那么可以在人脸上标注若干个特征点，假设有64个特征点，那么这个时候就有128个参数了，再加上判断是否有人脸，就有129个参数。

假设要检测的是人体肢体的动作，那么同样也可以标注若干个肢体上的特征点。

注意，这些都是需要人工进行标注的。

目标检测

滑动窗口

目标检测通常采用的是滑动窗口的方法来检测的。也就是用一定窗口的大小，按照指定的步长，遍历整个图像；而后再选取更大的窗口，再次遍历，依次循环。这样子，每个窗口都相当于一张小图片，对这个小图片进行图像识别，从而得到预测结果。

但是这个方法有个很明显的问题，就是每个窗口都要进行一次图像识别，速度太慢了。因此就有了滑动窗口的卷积实现。在此之前，我们需要知道如何把全连接层变为卷积层。

全连接层转化为卷积层

如图，在经过Max pool后，我们得到了$5 \times 5 \times 16$的图像，经过第一个FC层后变成了400个节点。

而此时我们可以使用400个$5\times5$的卷积核，进行卷积，得到了$1\times1\times400$

而后再使用400个$1\times1$的卷积核，再得到了$1\times1\times400$矩阵，所以我们就将全连接层转化成了卷积层。

卷积滑动窗口的实现

因为之前的滑动窗口每一次都要进行一次计算，太慢了。所以利用上面的全连接层转化为卷积层的做法，可以一次性把滑动窗口的结果都计算出来。

为了方面观察，这里把三维图像画成了平面。

假设滑动的窗口是$14\times14\times3$，原图像大小是$16\times\times16\times3$。

蓝色表示滑动窗口，如果步数是2的话，很容易可以得到$2\times2$的图像，不难看出，在图中最后输出的左上角的蓝色部分就是原图中蓝色部分的计算结果，以此类推。

也就是说，只需要原图进行一次运算，就可以一次性得到多个滑动窗口的输出值。

具体例子如下图：

可以看到，原图为$28\times28$，最后得到了$8\times8 = 64$个滑动窗口。

Bounding Box

上面介绍的滑动窗口的方法有一个问题，就是很多情况下并不能检测出窗口的精确位置。

那么如何找到这个准确的边界框呢？有一个很快的算法叫做YOLO(you only look once)，只需要计算一次便可以检测出物体的位置。

如图，首先，将图片分为$n \times n$个部分，如图是划分成了$3\times3=9$份，而每一份都由一个向量y来表示。

因此最终得到了$3\times3\times8$的矩阵。

要得到这个$3\times3\times8$的矩阵，只要选择适当的卷积神经网络，让输出矩阵为这个矩阵就行，而每一个小图像都有一个目标标签y，这个时候y中的$b_x,b_y$都是这个小图像的相对位置，在0-1之间，而$b_h,b_w$是可以大于1的，因为整个大目标有可能在框框外。

在实际过程中可以选用更精细的划分，如$19\times19$。

交并比(Intersection over Union, IoU)

如何判断框框是否正确呢？

如图红色为车身的框，而紫色为检测到的框，那么紫色的框到底算不算有车呢？

这个时候可以用交并比来判断，也就是两个框框的交集和并集之比。

$$IoU = \frac{交集面积}{并集面积}$$

一般来说 $IoU \geq 0.5$，那么说明检测正确，当然，这个阈值可以自己设定。

非最大值抑制（NMS）

在实际过程中，很可能很多个框框都检测出同一个物体，那么如何判断这些边界框检测的是同一个对象呢？

首先，每一个框都会返回一个概率$P_c$，我们需要先去掉那些概率比较低的框，如去掉$P_c \leq 0.55$的框。
而后，在$P_c$中找到概率最大的框，然后用算法遍历其他的边框，找出并取消掉和这个边框IoU大于0.5的框（因为如果IoU大于0.5，我们可以认为是同一个物体）
循环第二步的操作

如果有多个目标类别的检测，那么对每个类别分别进行上面的NMS算法。

Anchor Box

如果一张格子中有多个目标，那怎么办？这时候就需要Anchor Box了，可以同时检测出多个对象。

我们预先定义了两个不同形状的Anchor box，如比较高的来检测人，比较宽的来检测汽车，然后重新定义了目标向量y：

这个时候最后输出的矩阵从原来的$3\times3\times8$变成了$3\times3\times16$，也可以是$3\times3\times2\times8$

在计算的时候就可以根据不同的box输出了，？号表示我们不关系这个值。

问题：

如果使用两个Box，那么如果出现3个目标怎么办，这时候需要别的手段了
如果同一个格子的两个对象的box相同怎么办，那也需要别的手段来处理了。

因为这两种情况出现的几率都比较少，所以问题不大。

注意：

Anchor box的形状都是人工指定的，一般可以选择5-10种不同的形状，来涵盖我们想要检测的不同对象
更高级一点的使用k-means聚类算法，将不同的对象形状进行聚类，然后得到一组比较具有代表性的boxes

YOLO算法

假设我们需要检测三种目标：行人、汽车、摩托车，使用两种anchor box

在训练集中：

输入同样大小的图片X
每张图片的输出Y是$3\times3\times16$的矩阵
人工标定输出Y

预测：

输入图片和训练集大小相同，得到$3\times3\times16$的输出结果

这个时候得到了很多个框框，如果是两个Anchor box，那么就有$2\times9=18$个预测框框，那么就需要把没用的框框都去掉，也就用到了上面的NMS非最大值抑制算法。

进行NMS算法：

去掉$P_c$小于某个阈值的框框
对于每个对象分别使用NMS算法得到最终的边界框。

候选区域

这里还介绍了其他的目标检测算法，不过貌似都是比较慢的。

R-CNN：

原本的滑动窗口，只有在少部分的区域是可以检测到目标的，很多区域都是背景，所以运算很慢，用R-CNN后，只选择一些候选的窗口，不需要对整个图片进行滑动。

R-CNN使用的是图像分割算法，将图片分割成很多个色块，从而减少了窗口数量。

是对每个候选区域进行分类，输出的标签和bounding box

Fast R-CNN：

候选区域，使用滑动窗口在区分所有的候选区域。

Faster R-CNN：

使用卷积神经网络而不是图像分割来获得候选区域。

DeepLearning.ai作业:(4-2)-- 深度卷积网络实例探究（Deep convolutional models:case studies）

2018-10-09T11:20:57.000Z

本周作业分为两部分，一部分是keras的基本使用，另一部分是ResNet的构建。

Part1: Keras – Tutorial

Keras是TensorFlow的高层封装，可以更高效的实现神经网络的搭建。

先导入库

import numpy as np
from keras import layers
from keras.layers import Input, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D
from keras.models import Model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
import pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from kt_utils import *

import keras.backend as K
K.set_image_data_format('channels_last')
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

%matplotlib inline

构建模型

def HappyModel(input_shape):
    """
    Implementation of the HappyModel.
    
    Arguments:
    input_shape -- shape of the images of the dataset

    Returns:
    model -- a Model() instance in Keras
    """
    
    ### START CODE HERE ###
    # Feel free to use the suggested outline in the text above to get started, and run through the whole
    # exercise (including the later portions of this notebook) once. The come back also try out other
    # network architectures as well. 
    X_input = Input(input_shape)
    X = ZeroPadding2D((3, 3))(X_input)
    X = Conv2D(32,(7,7),strides=(1,1),name="Conv0")(X)
    X = BatchNormalization(axis = 3, name = 'bn0')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((2, 2), name='max_pool')(X)
    X = Flatten()(X)
    X = Dense(1, activation='sigmoid', name='fc')(X)
    model = Model(inputs = X_input, outputs = X, name='HappyModel')
    
    
    ### END CODE HERE ###
    
    return model

然后实例化这个模型

1
2
3

### START CODE HERE ### (1 line)
happyModel = HappyModel(X_train.shape[1:])
### END CODE HERE ###

进行优化器和loss的选择

1
2
3

### START CODE HERE ### (1 line)
happyModel.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['accuracy'])
### END CODE HERE ###

训练

1
2
3

### START CODE HERE ### (1 line)
happyModel.fit(x=X_train,y = Y_train,epochs=10,batch_size=32)
### END CODE HERE ###

预测：

### START CODE HERE ### (1 line)
preds = happyModel.evaluate(X_test,Y_test)
### END CODE HERE ###
print()
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))

可以用summary()来看看详细信息：

1	happyModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 64, 64, 3)         0         
_________________________________________________________________
zero_padding2d_1 (ZeroPaddin (None, 70, 70, 3)         0         
_________________________________________________________________
Conv0 (Conv2D)               (None, 64, 64, 32)        4736      
_________________________________________________________________
bn0 (BatchNormalization)     (None, 64, 64, 32)        128       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 64, 32)        0         
_________________________________________________________________
max_pool (MaxPooling2D)      (None, 32, 32, 32)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 32768)             0         
_________________________________________________________________
fc (Dense)                   (None, 1)                 32769     
=================================================================
Total params: 37,633
Trainable params: 37,569
Non-trainable params: 64
_________________________________________________________________

用plot_model()来得到详细的graph

1 2	plot_model(happyModel, to_file='HappyModel.png') SVG(model_to_dot(happyModel).create(prog='dot', format='svg'))

Part2: Residual Networks

主要有两个步骤：

构建基本的ResNet的块
将块放到一起，变成一个网络，来做图像分类

1 - The problem of very deep neural networks

这一部分非常深的神经网络的一些问题，主要是参数会变得很小或者爆炸，这样子训练的时候就会收敛的很慢，因此，用残差网络可以有效的改善这个问题。

2 - Building a Residual Network

根据输入输入的维度不同，分为两种块：

1. identity block（一致块）

可以看到，identity block的前后两端维度是一致的，可以直接相加。

在这里我们实现了一个跳跃三层的块。

基本结构是:

First component of main path:

The first CONV2D has F1F1 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2a'. Use 0 as the seed for the random initialization.
The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

The second CONV2D has F2F2 filters of shape (f,f)(f,f) and a stride of (1,1). Its padding is “same” and its name should be conv_name_base + '2b'. Use 0 as the seed for the random initialization.
The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

The third CONV2D has F3F3 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2c'. Use 0 as the seed for the random initialization.
The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Final step:

The shortcut and the input are added together.
Then apply the ReLU activation function. This has no name and no hyperparameters.

注意在跳跃相加部分要用函数keras的函授Add()，不能用加号，不然会出错。

这里f是卷积核的大小，filters是这三层卷积层的深度的list，stage指的是哪一大层的网络，用来取名字的，后面有用，block是在stage下的某一层的网络，用a,b,c,d等字母表示。

# GRADED FUNCTION: identity_block

def identity_block(X, f, filters, stage, block):
    """
    Implementation of the identity block as defined in Figure 3
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    
    Returns:
    X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
    """
    
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value. You'll need this later to add back to the main path. 
    X_shortcut = X
    
    # First component of main path
    X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)
    
    ### START CODE HERE ###
    
    # Second component of main path (≈3 lines)
    X = Conv2D(filters = F2, kernel_size = (f, f), strides= (1,1), padding = 'same', name = conv_name_base + '2b', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2b')(X)
    X = Activation('relu')(X)

    # Third component of main path (≈2 lines)
    X = Conv2D(filters = F3, kernel_size = (1, 1), strides= (1,1), padding = 'valid', name = conv_name_base + '2c', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2c')(X)

    # Final step: Add shortcut value to main path, and pass it through a RELU activation (≈2 lines)
    X = Add()([X,X_shortcut])
    X = Activation('relu')(X)
    
    ### END CODE HERE ###
    
    return X

2. The convolutional block(卷积块)

当两端的维度不一致时，可以加一个卷积核来转化维度，这时候没有激活函数。

First component of main path:

The first CONV2D has F1F1 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '2a'.
The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

The second CONV2D has F2F2 filters of (f,f) and a stride of (1,1). Its padding is “same” and it’s name should be conv_name_base + '2b'.
The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

The third CONV2D has F3F3 filters of (1,1) and a stride of (1,1). Its padding is “valid” and it’s name should be conv_name_base + '2c'.
The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Shortcut path:

The CONV2D has F3F3 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '1'.
The BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '1'.

Final step:

The shortcut and the main path values are added together.
Then apply the ReLU activation function. This has no name and no hyperparameters.

这里参数新增了s是stride每一步数

def convolutional_block(X, f, filters, stage, block, s = 2):
    """
    Implementation of the convolutional block as defined in Figure 4
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    s -- Integer, specifying the stride to be used
    
    Returns:
    X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
    """
    
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value
    X_shortcut = X


    ##### MAIN PATH #####
    # First component of main path 
    X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)
    
    ### START CODE HERE ###

    # Second component of main path (≈3 lines)
    X = Conv2D(F2, (f, f), strides = (1,1), name = conv_name_base + '2b', padding = 'same', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2b')(X)
    X = Activation('relu')(X)

    # Third component of main path (≈2 lines)
    X = Conv2D(F3, (1, 1), strides = (1,1), name = conv_name_base + '2c', padding = 'valid', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2c')(X)

    ##### SHORTCUT PATH #### (≈2 lines)
    X_shortcut = Conv2D(F3, (1, 1), strides = (s,s), name = conv_name_base + '1', padding = 'valid', kernel_initializer = glorot_uniform(seed=0))(X_shortcut)
    X_shortcut = BatchNormalization(axis = 3, name = bn_name_base + '1')(X_shortcut)

    # Final step: Add shortcut value to main path, and pass it through a RELU activation (≈2 lines)
    X = Add()([X,X_shortcut])
    X = Activation('relu')(X)
    
    ### END CODE HERE ###
    
    return X

3 - Building your first ResNet model (50 layers)

构建一个50层的网络，分为5块，结构如下：

The details of this ResNet-50 model are:

Zero-padding pads the input with a pad of (3,3)
Stage 1:
- The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is “conv1”.
- BatchNorm is applied to the channels axis of the input.
- MaxPooling uses a (3,3) window and a (2,2) stride.
Stage 2:
- The convolutional block uses three set of filters of size [64,64,256], “f” is 3, “s” is 1 and the block is “a”.
- The 2 identity blocks use three set of filters of size [64,64,256], “f” is 3 and the blocks are “b” and “c”.
Stage 3:
- The convolutional block uses three set of filters of size [128,128,512], “f” is 3, “s” is 2 and the block is “a”.
- The 3 identity blocks use three set of filters of size [128,128,512], “f” is 3 and the blocks are “b”, “c” and “d”.
Stage 4:
- The convolutional block uses three set of filters of size [256, 256, 1024], “f” is 3, “s” is 2 and the block is “a”.
- The 5 identity blocks use three set of filters of size [256, 256, 1024], “f” is 3 and the blocks are “b”, “c”, “d”, “e” and “f”.
Stage 5:
- The convolutional block uses three set of filters of size [512, 512, 2048], “f” is 3, “s” is 2 and the block is “a”.
- The 2 identity blocks use three set of filters of size [512, 512, 2048], “f” is 3 and the blocks are “b” and “c”.
The 2D Average Pooling uses a window of shape (2,2) and its name is “avg_pool”.
The flatten doesn’t have any hyperparameters or name.
The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be 'fc' + str(classes).

Exercise: Implement the ResNet with 50 layers described in the figure above. We have implemented Stages 1 and 2. Please implement the rest. (The syntax for implementing Stages 3-5 should be quite similar to that of Stage 2.) Make sure you follow the naming convention in the text above.

You’ll need to use this function:

Average pooling see reference

# GRADED FUNCTION: ResNet50

def ResNet50(input_shape = (64, 64, 3), classes = 6):
    """
    Implementation of the popular ResNet50 the following architecture:
    CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
    -> CONVBLOCK -> IDBLOCK*5 -> CONVBLOCK -> IDBLOCK*2 -> AVGPOOL -> TOPLAYER

    Arguments:
    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    Returns:
    model -- a Model() instance in Keras
    """
    
    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)

    
    # Zero-Padding
    X = ZeroPadding2D((3, 3))(X_input)
    
    # Stage 1
    X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # Stage 2
    X = convolutional_block(X, f = 3, filters = [64, 64, 256], stage = 2, block='a', s = 1)
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='b')
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='c')


    ### START CODE HERE ###

    # Stage 3 (≈4 lines)

    X = convolutional_block(X, f = 3, filters = [128, 128, 512], stage = 3, block='a', s = 2)
    X = identity_block(X, 3, [128, 128, 512], stage=3, block='b')
    X = identity_block(X, 3, [128, 128, 512], stage=3, block='c')
    X = identity_block(X, 3, [128, 128, 512], stage=3, block='d')



    # Stage 4 (≈6 lines)
    X = convolutional_block(X, f = 3, filters = [256, 256, 1024], stage = 4, block='a', s = 2)
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='b')
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='c')
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='d')
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='e')
    X = identity_block(X, 3, [256, 256, 1024], stage=4, block='f')


    # Stage 5 (≈3 lines)
    X = convolutional_block(X, f = 3, filters = [512, 512, 2048], stage = 5, block='a', s = 2)
    X = identity_block(X, 3, [512, 512, 2048], stage=5, block='b')
    X = identity_block(X, 3, [512, 512, 2048], stage=5, block='c')



    # AVGPOOL (≈1 line). Use "X = AveragePooling2D(...)(X)"
    X = AveragePooling2D(pool_size=(2,2),strides=(1,1),padding='valid')(X)

    
    ### END CODE HERE ###

    # output layer
    X = Flatten()(X)
    X = Dense(classes, activation='softmax', name='fc' + str(classes), kernel_initializer = glorot_uniform(seed=0))(X)
    
    
    # Create model
    model = Model(inputs = X_input, outputs = X, name='ResNet50')

    return model

DeepLearning.ai笔记:(4-2)-- 深度卷积网络实例探究（Deep convolutional models:case studies）

2018-10-09T09:17:04.000Z

本周主要讲了深度卷积网络的一些模型：LeNet,AlexNet,VGGNet,ResNet,Inception,1×1卷积，迁移学习等。

经典的卷积网络

经典的卷及网络有三种：LeNet、AlexNet、VGGNet。

LeNet-5

LeNet-5主要是单通道的手写字体的识别，这是80年代提出的算法，当时没有用padding，而且pooling用的是average pooling，但是现在大家都用max pooling了。

论文中的最后预测用的是sigmoid和tanh，而现在都用了softmax。

AlexNet

AlexNet是2012年提出的算法。用来对彩色的图片进行处理，其实大致的结构和LeNet-5是很相似的，但是网络更大，参数更多了。

这个时候已经用Relu来作为激活函数了，而且用了多GPU进行计算。

VGG-16

VGG-16是2015的论文，比较简化的是，卷积层和池化层都是用相同的卷积核大小，卷积核都是3×3，stride=1，same padding，池化层用的maxpooling，为2×2，stride=2。只是在卷积的时候改变了每一层的通道数。

网络很大，参数有1.38亿个参数。

建议阅读论文顺序：AlexNet->VGG->LeNet

Residual Network(残差网络)

残差网络是由若干个残差块组成的。

因为在非常深的网络中会存在梯度消失和梯度爆炸的问题，为此，引入了Skip Connection来解决，也就是残差网络的实现。

上图即为一个残差块的基本原理，在原本的传播过程(称为主线)中，加上了$a^{[l]}$到$z^{[l+2]}$的连接，成为’short cut’或者’skip connetction’。

所以输出的表达式变成了:$a^{[l+2]} = g(z^{[l+2]} + a^{[l]})$

残差网络是由多个残差块组成的：

没有残差网络和加上残差网络的效果对比，可以看到，随着layers的增加，ResNet表现的更好：

ResNet为何有用？

假设我们已经经过了一个很大的神经网络Big NN,得到了$a^{[l]}$

那么这个时候再经过两层的神经网络得到$a^{[l+2]}$,那么表达式为：

$$a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) = g(W^{[l+2]} a^{[l+2]} + b^{[l+2]} + a^{[l]})$$

如果加上正则化，那么权值就会很小，假设$W^{[l+2]},b^{[l+2]} = 0$，因为激活函数是Relu，所以

$$a^{[l+2]} = g(a^{[l]}) = a^{[l]}$$

所以可以看到，加上残差块以后，更深的网络最差也只是和前面的效果一样，何况还有可能更好。

如果只是普通的两层网络，那么结果可能更好，也可能更差。

注意的是$a^{[l+2]}$要和$a^{[l]}$的维度一样，可以使用same padding，来保持维度。

1×1卷积

用1×1的卷积核可以来减少通道数，从而减少参数个数。

Inception Network

Inception的主要好处就是不需要人工来选择filter的大小和是否要添加池化层的问题。

如图可以一次性把各个卷积核的大小和max pool一起加进去，然后让机器自己学习里面的参数。

但是这样有一个问题，就是计算量太大了，假设是上面的$5 \times 5 \times 192$的卷积核，有32个，这样一共要进行$28\times\28\times32\times5\times5\times192=120M$的乘法次数，运算量是很大的。

如何解决这个问题呢？就需要用到前面的1×1的卷积核了。

可以看到经过维度压缩，计算次数少了十倍。

Inception 网络

单个的inception模块如下：

构成的google net如下：

使用开源的实现方案

别人已经实现的网络已经很厉害了，我觉得重复造轮子很没有必要，而且浪费时间，何况你水平也没有别人高。。还不如直接用别人的网络，然后稍加改造，这样可以很快的实现你的想法。

在GitHub上找到自己感兴趣的网络结构fork过来，好好研究！

迁移学习

之前已经讲过迁移学习了，也就是用别人训练好的网络，固定他们已经训练好的网络参数，然后套到自己的训练集上，完成训练。

如果你只有很少的数据集，那么，改变已有网络的最后一层softmax就可以了，比如原来别人的模型是有1000个分类，现在你只需要有3个分类。然后freeze冻结前面隐藏层的所有参数不变。这样就好像是你自己在训练一个很浅的神经网络，把隐藏层看做一个函数来映射，只需要训练最后的softmax层就可以了。

如果你有一定量的数据，那么freeze的范围可以减少，你可以训练后面的几层隐藏层，或者自己设计后面的隐藏层。

数据扩充

数据不够的话，进行数据扩充是很有用的。

可以采用

镜像
随机裁剪
色彩转换color shifting（如三通道：R+20,G-20,B+20）等等

tips:

在数据比赛中

ensembling：训练多个网络模型，然后平均结果，或者加权平均
测试时使用muti-crop，也就是在把单张测试图片用数据扩充的形式变成很多张，然后运行分类器，得到的结果进行平均。

cs231n作业：assignment2 - Fully-Connected Neural Nets

2018-09-30T09:52:30.000Z

GitHub地址：https://github.com/ZJUFangzh/cs231n

作业2主要是关于搭建卷积神经网络框架，还有tensorflow的基本应用。

首先先搭建一个全连接神经网络的基本框架。

之前搭建的2层神经网络都是比较简单的，但是一旦模型变大了，代码就变得难以复用。因此搭建一个神经网络框架是很有必要的。

一般都会分为两部分forward和backward,一层一层来，因此两个函数成对出现就可以了。

def layer_forward(x, w):
  """ Receive inputs x and weights w """
  # Do some computations ...
  z = # ... some intermediate value
  # Do some more computations ...
  out = # the output

  cache = (x, w, z, out) # Values we need to compute gradients

  return out, cache

def layer_backward(dout, cache):
  """
  Receive derivative of loss with respect to outputs and cache,
  and compute derivative with respect to inputs.
  """
  # Unpack cache values
  x, w, z, out = cache

  # Use values in cache to compute derivatives
  dx = # Derivative of loss with respect to x
  dw = # Derivative of loss with respect to w

  return dx, dw

Affine layer: foward

在cs231n/layers.py中修改affine_forward函数，也就是简单的全连接层的前向传播。

def affine_forward(x, w, b):
    """
    Computes the forward pass for an affine (fully-connected) layer.

    The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N
    examples, where each example x[i] has shape (d_1, ..., d_k). We will
    reshape each input into a vector of dimension D = d_1 * ... * d_k, and
    then transform it to an output vector of dimension M.

    Inputs:
    - x: A numpy array containing input data, of shape (N, d_1, ..., d_k)
    - w: A numpy array of weights, of shape (D, M)
    - b: A numpy array of biases, of shape (M,)

    Returns a tuple of:
    - out: output, of shape (N, M)
    - cache: (x, w, b)
    """
    out = None
    ###########################################################################
    # TODO: Implement the affine forward pass. Store the result in out. You   #
    # will need to reshape the input into rows.                               #
    ###########################################################################
    out = x.reshape(x.shape[0],-1).dot(w) + b
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = (x, w, b)
    return out, cache

而后修改affine_backward函数

def affine_backward(dout, cache):
    """
    Computes the backward pass for an affine layer.

    Inputs:
    - dout: Upstream derivative, of shape (N, M)
    - cache: Tuple of:
      - x: Input data, of shape (N, d_1, ... d_k)
      - w: Weights, of shape (D, M)

    Returns a tuple of:
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
    - dw: Gradient with respect to w, of shape (D, M)
    - db: Gradient with respect to b, of shape (M,)
    """
    x, w, b = cache
    dx, dw, db = None, None, None
    ###########################################################################
    # TODO: Implement the affine backward pass.                               #
    ###########################################################################
    dx = dout.dot(w.T).reshape(x.shape)
    dw = x.reshape(x.shape[0],-1).T.dot(dout)
    db = np.sum(dout,axis=0)
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx, dw, db

然后是relu_foward和relu_backward函数。

def relu_forward(x):
    """
    Computes the forward pass for a layer of rectified linear units (ReLUs).

    Input:
    - x: Inputs, of any shape

    Returns a tuple of:
    - out: Output, of the same shape as x
    - cache: x
    """
    out = None
    ###########################################################################
    # TODO: Implement the ReLU forward pass.                                  #
    ###########################################################################
    out = np.maximum(0,x)
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    cache = x
    return out, cache

def relu_backward(dout, cache):
    """
    Computes the backward pass for a layer of rectified linear units (ReLUs).

    Input:
    - dout: Upstream derivatives, of any shape
    - cache: Input x, of same shape as dout

    Returns:
    - dx: Gradient with respect to x
    """
    dx, x = None, cache
    ###########################################################################
    # TODO: Implement the ReLU backward pass.                                 #
    ###########################################################################
    dx = (x > 0) * dout
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return dx

然后它定义了一个三明治层，意思是将affine和relu连接在一起。在layer_utils.py中。

def affine_relu_forward(x, w, b):
    """
    Convenience layer that perorms an affine transform followed by a ReLU

    Inputs:
    - x: Input to the affine layer
    - w, b: Weights for the affine layer

    Returns a tuple of:
    - out: Output from the ReLU
    - cache: Object to give to the backward pass
    """
    a, fc_cache = affine_forward(x, w, b)
    out, relu_cache = relu_forward(a)
    cache = (fc_cache, relu_cache)
    return out, cache

def affine_relu_backward(dout, cache):
    """
    Backward pass for the affine-relu convenience layer
    """
    fc_cache, relu_cache = cache
    da = relu_backward(dout, relu_cache)
    dx, dw, db = affine_backward(da, fc_cache)
    return dx, dw, db

在完成这些基本的函数之后，就可搭建一个简单的神经网络了。在fc_net.py中：

先完成初始化，然后在loss中调用这些基本函数，得到loss，然后再计算梯度。

class TwoLayerNet(object):
    """
    A two-layer fully-connected neural network with ReLU nonlinearity and
    softmax loss that uses a modular layer design. We assume an input dimension
    of D, a hidden dimension of H, and perform classification over C classes.

    The architecure should be affine - relu - affine - softmax.

    Note that this class does not implement gradient descent; instead, it
    will interact with a separate Solver object that is responsible for running
    optimization.

    The learnable parameters of the model are stored in the dictionary
    self.params that maps parameter names to numpy arrays.
    """

    def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10,
                 weight_scale=1e-3, reg=0.0):
        """
        Initialize a new network.

        Inputs:
        - input_dim: An integer giving the size of the input
        - hidden_dim: An integer giving the size of the hidden layer
        - num_classes: An integer giving the number of classes to classify
        - dropout: Scalar between 0 and 1 giving dropout strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - reg: Scalar giving L2 regularization strength.
        """
        self.params = {}
        self.reg = reg

        ############################################################################
        # TODO: Initialize the weights and biases of the two-layer net. Weights    #
        # should be initialized from a Gaussian with standard deviation equal to   #
        # weight_scale, and biases should be initialized to zero. All weights and  #
        # biases should be stored in the dictionary self.params, with first layer  #
        # weights and biases using the keys 'W1' and 'b1' and second layer weights #
        # and biases using the keys 'W2' and 'b2'.                                 #
        ############################################################################
        self.params['W1'] = np.random.randn(input_dim,hidden_dim) * weight_scale
        self.params['b1'] = np.zeros((hidden_dim,))
        self.params['W2'] = np.random.randn(hidden_dim,num_classes) * weight_scale
        self.params['b2'] = np.zeros((num_classes,))

        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################


    def loss(self, X, y=None):
        """
        Compute loss and gradient for a minibatch of data.

        Inputs:
        - X: Array of input data of shape (N, d_1, ..., d_k)
        - y: Array of labels, of shape (N,). y[i] gives the label for X[i].

        Returns:
        If y is None, then run a test-time forward pass of the model and return:
        - scores: Array of shape (N, C) giving classification scores, where
          scores[i, c] is the classification score for X[i] and class c.

        If y is not None, then run a training-time forward and backward pass and
        return a tuple of:
        - loss: Scalar value giving the loss
        - grads: Dictionary with the same keys as self.params, mapping parameter
          names to gradients of the loss with respect to those parameters.
        """
        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the two-layer net, computing the    #
        # class scores for X and storing them in the scores variable.              #
        ############################################################################
        A1, A1_cache = affine_relu_forward(X,self.params['W1'],self.params['b1'])
        scores , out_cache = affine_forward(A1,self.params['W2'],self.params['b2'])
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If y is None then we are in test mode so just return scores
        if y is None:
            return scores

        loss, grads = 0, {}
        ############################################################################
        # TODO: Implement the backward pass for the two-layer net. Store the loss  #
        # in the loss variable and gradients in the grads dictionary. Compute data #
        # loss using softmax, and make sure that grads[k] holds the gradients for  #
        # self.params[k]. Don't forget to add L2 regularization!                   #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        loss, dout = softmax_loss(scores,y)
        loss += 0.5 * self.reg * (np.sum(self.params['W1']*self.params['W1']) + np.sum(self.params['W2']*self.params['W2']))
        da1, dw2, db2 = affine_backward(dout,out_cache)
        grads['W2'] = dw2 + self.reg * self.params['W2']
        grads['b2'] = db2
        _ , dw1, db1 = affine_relu_backward(da1, A1_cache)
        grads['W1'] = dw1 + self.reg * self.params['W1']
        grads['b1'] = db1
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

回到notebook中，调用了已经为我们写好的solver类，model用的就是TwoLayerNet()

model = TwoLayerNet()
solver = None

##############################################################################
# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least  #
# 50% accuracy on the validation set.                                        #
##############################################################################
# data = {
#     'X_train': X_train,
#       'y_train': y_train,
#       'X_val': X_val,
#       'y_val': y_val,
# }
solver = Solver(model,data,
                update_rule='sgd',
                optim_config={
                    'learning_rate': 1e-3,
                    },
                lr_decay=0.9,
                num_epochs=10,batch_size=100,
                print_every=100
               )
solver.train()
##############################################################################
#                             END OF YOUR CODE                               #
##############################################################################

完成之后，我们就可以类似的，搭建一个多层的神经网络了，同样是在fc_net.py的FullyConnectedNet类中。这时候先不要去在意batchnorm和dropout，后面会来实现这些函数。

class FullyConnectedNet(object):
    """
    A fully-connected neural network with an arbitrary number of hidden layers,
    ReLU nonlinearities, and a softmax loss function. This will also implement
    dropout and batch normalization as options. For a network with L layers,
    the architecture will be

    {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax

    where batch normalization and dropout are optional, and the {...} block is
    repeated L - 1 times.

    Similar to the TwoLayerNet above, learnable parameters are stored in the
    self.params dictionary and will be learned using the Solver class.
    """

    def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10,
                 dropout=0, use_batchnorm=False, reg=0.0,
                 weight_scale=1e-2, dtype=np.float32, seed=None):
        """
        Initialize a new FullyConnectedNet.

        Inputs:
        - hidden_dims: A list of integers giving the size of each hidden layer.
        - input_dim: An integer giving the size of the input.
        - num_classes: An integer giving the number of classes to classify.
        - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then
          the network should not use dropout at all.
        - use_batchnorm: Whether or not the network should use batch normalization.
        - reg: Scalar giving L2 regularization strength.
        - weight_scale: Scalar giving the standard deviation for random
          initialization of the weights.
        - dtype: A numpy datatype object; all computations will be performed using
          this datatype. float32 is faster but less accurate, so you should use
          float64 for numeric gradient checking.
        - seed: If not None, then pass this random seed to the dropout layers. This
          will make the dropout layers deteriminstic so we can gradient check the
          model.
        """
        self.use_batchnorm = use_batchnorm
        self.use_dropout = dropout > 0
        self.reg = reg
        self.num_layers = 1 + len(hidden_dims)
        self.dtype = dtype
        self.params = {}

        ############################################################################
        # TODO: Initialize the parameters of the network, storing all values in    #
        # the self.params dictionary. Store weights and biases for the first layer #
        # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be #
        # initialized from a normal distribution with standard deviation equal to  #
        # weight_scale and biases should be initialized to zero.                   #
        #                                                                          #
        # When using batch normalization, store scale and shift parameters for the #
        # first layer in gamma1 and beta1; for the second layer use gamma2 and     #
        # beta2, etc. Scale parameters should be initialized to one and shift      #
        # parameters should be initialized to zero.                                #
        ############################################################################

        n_i_prev = input_dim
        for i, n_i in enumerate(hidden_dims):
            self.params['W' + str(i+1)] = np.random.randn(n_i_prev,n_i) * weight_scale
            self.params['b' + str(i+1)] = np.zeros((n_i,))
            #是否使用batchnorm
            if self.use_batchnorm:
                self.params['gamma' +str(i+1)] = np.ones((n_i,))
                self.params['beta' + str(i+1)] = np.zeros((n_i,))

            n_i_prev = n_i

        self.params['W' + str(self.num_layers)] = np.random.randn(n_i_prev,num_classes) * weight_scale
        self.params['b' + str(self.num_layers)] = np.zeros((num_classes,))

        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # When using dropout we need to pass a dropout_param dictionary to each
        # dropout layer so that the layer knows the dropout probability and the mode
        # (train / test). You can pass the same dropout_param to each dropout layer.
        self.dropout_param = {}
        if self.use_dropout:
            self.dropout_param = {'mode': 'train', 'p': dropout}
            if seed is not None:
                self.dropout_param['seed'] = seed

        # With batch normalization we need to keep track of running means and
        # variances, so we need to pass a special bn_param object to each batch
        # normalization layer. You should pass self.bn_params[0] to the forward pass
        # of the first batch normalization layer, self.bn_params[1] to the forward
        # pass of the second batch normalization layer, etc.
        self.bn_params = []
        if self.use_batchnorm:
            self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)]

        # Cast all parameters to the correct datatype
        for k, v in self.params.items():
            self.params[k] = v.astype(dtype)


    def loss(self, X, y=None):
        """
        Compute loss and gradient for the fully-connected net.

        Input / output: Same as TwoLayerNet above.
        """
        X = X.astype(self.dtype)
        mode = 'test' if y is None else 'train'

        # Set train/test mode for batchnorm params and dropout param since they
        # behave differently during training and testing.
        if self.use_dropout:
            self.dropout_param['mode'] = mode
        if self.use_batchnorm:
            for bn_param in self.bn_params:
                bn_param['mode'] = mode

        scores = None
        ############################################################################
        # TODO: Implement the forward pass for the fully-connected net, computing  #
        # the class scores for X and storing them in the scores variable.          #
        #                                                                          #
        # When using dropout, you'll need to pass self.dropout_param to each       #
        # dropout forward pass.                                                    #
        #                                                                          #
        # When using batch normalization, you'll need to pass self.bn_params[0] to #
        # the forward pass for the first batch normalization layer, pass           #
        # self.bn_params[1] to the forward pass for the second batch normalization #
        # layer, etc.                                                              #
        ############################################################################
        #无dropout，batchnorm
        # A_prev = X
        # fc_mix_cache = []
        # for i in range(self.num_layers - 1):
        #     W, b = self.params['W' + str(i+1)],self.params['b' + str(i+1)]
        #     A, A_cache = affine_relu_forward(A_prev, W, b)
        #     A_prev = A
        #     fc_mix_cache.append(A_cache)
        # W, b = self.params['W' + str(self.num_layers)],self.params['b' + str(self.num_layers)]
        # ZL, ZL_cache = affine_forward(A_prev,W,b)
        # scores = ZL

        #加上batchnorm
        A_prev = X
        fc_mix_cache = []
        drop_cache = []
        for i in range(self.num_layers - 1):
            W, b = self.params['W' + str(i+1)],self.params['b' + str(i+1)]
            if self.use_batchnorm:
                gamma = self.params['gamma'+str(i+1)]
                beta = self.params['beta'+str(i+1)]
                A, A_cache = affine_bn_relu_forword(A_prev, W, b,gamma,beta,self.bn_params[i])
            else:
                A, A_cache = affine_relu_forward(A_prev, W, b)

            if self.use_dropout:
                A, drop_ch = dropout_forward(A, self.dropout_param) 
                drop_cache.append(drop_ch)
            A_prev = A
            fc_mix_cache.append(A_cache)
        W, b = self.params['W' + str(self.num_layers)],self.params['b' + str(self.num_layers)]
        ZL, ZL_cache = affine_forward(A_prev,W,b)
        scores = ZL
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        # If test mode return early
        if mode == 'test':
            return scores

        loss, grads = 0.0, {}
        ############################################################################
        # TODO: Implement the backward pass for the fully-connected net. Store the #
        # loss in the loss variable and gradients in the grads dictionary. Compute #
        # data loss using softmax, and make sure that grads[k] holds the gradients #
        # for self.params[k]. Don't forget to add L2 regularization!               #
        #                                                                          #
        # When using batch normalization, you don't need to regularize the scale   #
        # and shift parameters.                                                    #
        #                                                                          #
        # NOTE: To ensure that your implementation matches ours and you pass the   #
        # automated tests, make sure that your L2 regularization includes a factor #
        # of 0.5 to simplify the expression for the gradient.                      #
        ############################################################################
        # loss, dout = softmax_loss(scores,y)
        # #先算出最后一层的loss的reg
        # loss += 0.5 * self.reg * (np.sum(self.params['W'+  str(self.num_layers)]**2))
        # #计算最后一层的grads
        # dA_prev, dwl, dbl = affine_backward(dout, ZL_cache)
        # grads['W' + str(self.num_layers)] = dwl + self.reg * self.params['W'+ str(self.num_layers)]
        # grads['b' + str(self.num_layers)] = dbl
        # #循环计算前面隐藏层
        # for i in range(self.num_layers-1, 0,-1):
        #     loss += 0.5 * self.reg * np.sum(self.params['W'+ str(i)]**2)
        #     dA_prev, dw, db = affine_relu_backward(dA_prev, fc_mix_cache[i-1])
        #     grads['W'+str(i)] = dw + self.reg * self.params['W'+ str(i)]
        #     grads['b'+str(i)] = db

        loss, dout = softmax_loss(scores,y)
        #先算出最后一层的loss的reg
        loss += 0.5 * self.reg * (np.sum(self.params['W'+  str(self.num_layers)]**2))
        #计算最后一层的grads
        dA_prev, dwl, dbl = affine_backward(dout, ZL_cache)
        grads['W' + str(self.num_layers)] = dwl + self.reg * self.params['W'+ str(self.num_layers)]
        grads['b' + str(self.num_layers)] = dbl
        #循环计算前面隐藏层
        for i in range(self.num_layers-1, 0,-1):
            loss += 0.5 * self.reg * np.sum(self.params['W'+ str(i)]**2)
            if self.use_dropout:
                dA_prev = dropout_backward(dA_prev, drop_cache[i-1])
            if self.use_batchnorm:
                dA_prev, dw, db, dgamma, dbeta = affine_bn_relu_backward(dA_prev, fc_mix_cache[i-1])
            else:
                dA_prev, dw, db = affine_relu_backward(dA_prev, fc_mix_cache[i-1])

            grads['W'+str(i)] = dw + self.reg * self.params['W'+ str(i)]
            grads['b'+str(i)] = db

            if self.use_batchnorm:
                grads['gamma' + str(i)] = dgamma
                grads['beta' + str(i)] = dbeta

            
        ############################################################################
        #                             END OF YOUR CODE                             #
        ############################################################################

        return loss, grads

然后构建了三层的model

# TODO: Use a three-layer Net to overfit 50 training examples.

num_train = 50
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

weight_scale = 1e-2
learning_rate = 8e-3
model = FullyConnectedNet([100, 100],
              weight_scale=weight_scale, dtype=np.float64)
solver = Solver(model, small_data,
                print_every=10, num_epochs=20, batch_size=25,
                update_rule='sgd',
                optim_config={
                  'learning_rate': learning_rate,
                    
                },
                
         )
solver.train()

plt.plot(solver.loss_history, 'o')
plt.title('Training loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')
plt.show()

五层的：

# TODO: Use a five-layer Net to overfit 50 training examples.

num_train = 50
small_data = {
  'X_train': data['X_train'][:num_train],
  'y_train': data['y_train'][:num_train],
  'X_val': data['X_val'],
  'y_val': data['y_val'],
}

learning_rate = 3e-4
weight_scale = 1e-1
model = FullyConnectedNet([100, 100, 100, 100],
                weight_scale=weight_scale, dtype=np.float64)
solver = Solver(model, small_data,
                print_every=10, num_epochs=20, batch_size=25,
                update_rule='sgd',
                optim_config={
                  'learning_rate': learning_rate,
                }
         )
solver.train()
# print(model.params)
plt.plot(solver.loss_history, 'o')
plt.title('Training loss history')
plt.xlabel('Iteration')
plt.ylabel('Training loss')
plt.show()

而后用上了Momentum的优化方法，在optim.py中

def sgd_momentum(w, dw, config=None):
    """
    Performs stochastic gradient descent with momentum.

    config format:
    - learning_rate: Scalar learning rate.
    - momentum: Scalar between 0 and 1 giving the momentum value.
      Setting momentum = 0 reduces to sgd.
    - velocity: A numpy array of the same shape as w and dw used to store a
      moving average of the gradients.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('momentum', 0.9)
    v = config.get('velocity', np.zeros_like(w))

    next_w = None
    ###########################################################################
    # TODO: Implement the momentum update formula. Store the updated value in #
    # the next_w variable. You should also use and update the velocity v.     #
    ###########################################################################
    v = config['momentum'] * v - config['learning_rate'] * dw
    next_w = w + v
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    config['velocity'] = v

    return next_w, config

然后再尝试另外两种优化的梯度下降法RMSprop和Adam

def rmsprop(x, dx, config=None):
    """
    Uses the RMSProp update rule, which uses a moving average of squared
    gradient values to set adaptive per-parameter learning rates.

    config format:
    - learning_rate: Scalar learning rate.
    - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
      gradient cache.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - cache: Moving average of second moments of gradients.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('decay_rate', 0.99)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('cache', np.zeros_like(x))

    next_x = None
    ###########################################################################
    # TODO: Implement the RMSprop update formula, storing the next value of x #
    # in the next_x variable. Don't forget to update cache value stored in    #
    # config['cache'].                                                        #
    ###########################################################################
    config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * dx**2
    next_x = x - config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_x, config


def adam(x, dx, config=None):
    """
    Uses the Adam update rule, which incorporates moving averages of both the
    gradient and its square and a bias correction term.

    config format:
    - learning_rate: Scalar learning rate.
    - beta1: Decay rate for moving average of first moment of gradient.
    - beta2: Decay rate for moving average of second moment of gradient.
    - epsilon: Small scalar used for smoothing to avoid dividing by zero.
    - m: Moving average of gradient.
    - v: Moving average of squared gradient.
    - t: Iteration number.
    """
    if config is None: config = {}
    config.setdefault('learning_rate', 1e-3)
    config.setdefault('beta1', 0.9)
    config.setdefault('beta2', 0.999)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('m', np.zeros_like(x))
    config.setdefault('v', np.zeros_like(x))
    config.setdefault('t', 1)

    next_x = None
    ###########################################################################
    # TODO: Implement the Adam update formula, storing the next value of x in #
    # the next_x variable. Don't forget to update the m, v, and t variables   #
    # stored in config.                                                       #
    ###########################################################################
    config['t'] += 1
    config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx
    mt = config['m'] / (1 - config['beta1']**config['t'])
    config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * dx**2
    vt = config['v'] / (1 - config['beta2']**config['t'])
    next_x = x - config['learning_rate'] * mt / (np.sqrt(vt) + config['epsilon'])

    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################

    return next_x, config

DeepLearning.ai作业:(4-1)-- 卷积神经网络（Foundations of CNN）

2018-09-30T08:07:23.000Z

本周的作业分为了两部分：

卷积神经网络的模型搭建
用TensorFlow来训练卷积神经网络

Part1：Convolutional Neural Networks: Step by Step

主要内容：

convolution funtions:
- Zero Padding
- Convolve window
- Convolution forward
- Convolution backward (optional)
Pooling functions：
- Pooling forward
- Create mask
- Distribute value
- Pooling backward (optional)

Convolutional Neural Networks

创建CNN的主要函数

1. Zero Padding

先创建一个padding函数，用来输入图像X，输出padding后的图像，这里使用的是np.pad()函数，

1
2

a = np.pad(a, ((0,0), (1,1), (0,0), (3,3), (0,0)), 'constant', constant_values = (..,..))
表示a有5个维度，在第1维的两边都填上1个pad，和第3维的两边都填上3个pad，constant_values表示两边要填充的值


def zero_pad(X, pad):
    """
    Pad with zeros all images of the dataset X. The padding is applied to the height and width of an image, 
    as illustrated in Figure 1.
    
    Argument:
    X -- python numpy array of shape (m, n_H, n_W, n_C) representing a batch of m images
    pad -- integer, amount of padding around each image on vertical and horizontal dimensions
    
    Returns:
    X_pad -- padded image of shape (m, n_H + 2*pad, n_W + 2*pad, n_C)
    """
    
    ### START CODE HERE ### (≈ 1 line)
    X_pad = np.pad(X, ((0,0),(pad,pad),(pad,pad),(0,0)), 'constant', constant_values=(0,0))
    ### END CODE HERE ###
    
    return X_pad

2.Single step of convolution

创建一个单步的卷积运算，也就是一次输入一个切片，大小和卷积核相同，对应元素相乘再求和，最后再加个bias项。

# GRADED FUNCTION: conv_single_step

def conv_single_step(a_slice_prev, W, b):
    """
    Apply one filter defined by parameters W on a single slice (a_slice_prev) of the output activation 
    of the previous layer.
    
    Arguments:
    a_slice_prev -- slice of input data of shape (f, f, n_C_prev)
    W -- Weight parameters contained in a window - matrix of shape (f, f, n_C_prev)
    b -- Bias parameters contained in a window - matrix of shape (1, 1, 1)
    
    Returns:
    Z -- a scalar value, result of convolving the sliding window (W, b) on a slice x of the input data
    """

    ### START CODE HERE ### (≈ 2 lines of code)
    # Element-wise product between a_slice and W. Do not add the bias yet.
    s = a_slice_prev * W
    # Sum over all entries of the volume s.
    Z = np.sum(s)
    # Add bias b to Z. Cast b to a float() so that Z results in a scalar value.
    Z = Z + float(b)
    ### END CODE HERE ###

    return Z

3.Convolutional Neural Networks - Forward pass

创建一次完整的卷积过程，也就是利用上面的一次卷积，进行for循环。进行切片的时候，注意边界vert_start, vert_end, horiz_start and horiz_end

这一步应该先弄清楚A_prev，A，W，b的维度，超参数项包括了stride和pad

$$ n_H = \lfloor \frac{n_{H_{prev}} - f + 2 \times pad}{stride} \rfloor +1 $$
$$ n_W = \lfloor \frac{n_{W_{prev}} - f + 2 \times pad}{stride} \rfloor +1 $$
$$ n_C = \text{number of filters used in the convolution}$$

# GRADED FUNCTION: conv_forward

def conv_forward(A_prev, W, b, hparameters):
    """
    Implements the forward propagation for a convolution function
    
    Arguments:
    A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
    W -- Weights, numpy array of shape (f, f, n_C_prev, n_C)
    b -- Biases, numpy array of shape (1, 1, 1, n_C)
    hparameters -- python dictionary containing "stride" and "pad"
        
    Returns:
    Z -- conv output, numpy array of shape (m, n_H, n_W, n_C)
    cache -- cache of values needed for the conv_backward() function
    """
    
    ### START CODE HERE ###
    # Retrieve dimensions from A_prev's shape (≈1 line)  
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
    
    # Retrieve dimensions from W's shape (≈1 line)
    (f, f, n_C_prev, n_C) = W.shape
    
    # Retrieve information from "hparameters" (≈2 lines)
    stride = hparameters['stride']
    pad = hparameters['pad']
    
    # Compute the dimensions of the CONV output volume using the formula given above. Hint: use int() to floor. (≈2 lines)
    n_H = int((n_H_prev + 2 * pad - f) / stride + 1)
    n_W = int((n_W_prev + 2 * pad - f) / stride + 1)

    # Initialize the output volume Z with zeros. (≈1 line)
    Z = np.zeros((m, n_H, n_W, n_C))
    
    # Create A_prev_pad by padding A_prev
    A_prev_pad = zero_pad(A_prev, pad)
    
    for i in range(m):                               # loop over the batch of training examples
        a_prev_pad = A_prev_pad[i]                               # Select ith training example's padded activation
        for h in range(n_H):                           # loop over vertical axis of the output volume
            for w in range(n_W):                       # loop over horizontal axis of the output volume
                for c in range(n_C):                   # loop over channels (= #filters) of the output volume
                    
                    # Find the corners of the current "slice" (≈4 lines)
                    vert_start = h * stride
                    vert_end = h * stride + f
                    horiz_start = w * stride
                    horiz_end = w * stride + f
                    
                    # Use the corners to define the (3D) slice of a_prev_pad (See Hint above the cell). (≈1 line)
                    a_slice_prev = a_prev_pad[vert_start : vert_end, horiz_start : horiz_end]
                    
                    # Convolve the (3D) slice with the correct filter W and bias b, to get back one output neuron. (≈1 line)
                    Z[i, h, w, c] = conv_single_step(a_slice_prev,W[:,:,:,c],b[:,:,:,c])
                                        
    ### END CODE HERE ###
    
    # Making sure your output shape is correct
    assert(Z.shape == (m, n_H, n_W, n_C))
    
    # Save information in "cache" for the backprop
    cache = (A_prev, W, b, hparameters)
    
    return Z, cache

Pooling layer

创建池化层，注意得到的维度需要向下取整，用int()对float()进行转换

$$ n_H = \lfloor \frac{n_{H_{prev}} - f}{stride} \rfloor +1 $$
$$ n_W = \lfloor \frac{n_{W_{prev}} - f}{stride} \rfloor +1 $$
$$ n_C = n_{C_{prev}}$$

同样需要先进行切边，而后分为max和average两种，分别用np.max和np.mean

def pool_forward(A_prev, hparameters, mode = "max"):
    """
    Implements the forward pass of the pooling layer
    
    Arguments:
    A_prev -- Input data, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
    hparameters -- python dictionary containing "f" and "stride"
    mode -- the pooling mode you would like to use, defined as a string ("max" or "average")
    
    Returns:
    A -- output of the pool layer, a numpy array of shape (m, n_H, n_W, n_C)
    cache -- cache used in the backward pass of the pooling layer, contains the input and hparameters 
    """
    
    # Retrieve dimensions from the input shape
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
    
    # Retrieve hyperparameters from "hparameters"
    f = hparameters["f"]
    stride = hparameters["stride"]
    
    # Define the dimensions of the output
    n_H = int(1 + (n_H_prev - f) / stride)
    n_W = int(1 + (n_W_prev - f) / stride)
    n_C = n_C_prev
    
    # Initialize output matrix A
    A = np.zeros((m, n_H, n_W, n_C))              
    
    ### START CODE HERE ###
    for i in range(m):                         # loop over the training examples
        for h in range(n_H):                     # loop on the vertical axis of the output volume
            for w in range(n_W):                 # loop on the horizontal axis of the output volume
                for c in range (n_C):            # loop over the channels of the output volume
                    
                    # Find the corners of the current "slice" (≈4 lines)
                    vert_start = h * stride
                    vert_end = vert_start + f
                    horiz_start = w * stride
                    horiz_end = horiz_start + f
                    
                    # Use the corners to define the current slice on the ith training example of A_prev, channel c. (≈1 line)
                    a_prev_slice = A_prev[i, vert_start : vert_end, horiz_start : horiz_end, c]
                    
                    # Compute the pooling operation on the slice. Use an if statment to differentiate the modes. Use np.max/np.mean.
                    if mode == "max":
                        A[i, h, w, c] = np.max(a_prev_slice)
                    elif mode == "average":
                        A[i, h, w, c] = np.mean(a_prev_slice)
    
    ### END CODE HERE ###
    
    # Store the input and hparameters in "cache" for pool_backward()
    cache = (A_prev, hparameters)
    
    # Making sure your output shape is correct
    assert(A.shape == (m, n_H, n_W, n_C))
    
    return A, cache

Backpropagation in convolutional neural networks

卷积神经网络的求导是比较难以理解的，这里有卷积层的求导和池化层的求导。

1.Convolutional layer backward pass

假设经过卷积层后我们的输出$Z = W \times A +b$

那么反向传播过程中需要求的就是$dA,dW,db$，其中$dA$是原输入的数据，包含了原图像中的每一个像素，

而这个时候假设从后面传过来的$dZ$是已经知道的。

1.计算dA

从公式可以看出，$dA = W \times dZ$，具体一点，$dA$的每一个切片就是$W_c$乘上$dZ$在输出图片的每一个像素的求和结果，从矩阵的角度，每一次$W_c\times dZ_{hw}$得到的就是从单个输出的图片像素到输入图片切片（大小为W）的映射。因此公式为：

$$ dA += \sum _{h=0} ^{n_H} \sum_{w=0} ^{n_W} W_c \times dZ_{hw} $$

1	da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += W[:,:,:,c] * dZ[i, h, w, c]

2.计算dW

$dW = A \times dZ$，而更具体一点，因为W对Z的每一个像素都是有作用的，所以就等于每一个输入图片的切片乘以对应输出图片像素的导数，然后再求和！

$$ dW_c += \sum _{h=0} ^{n_H} \sum_{w=0} ^ {n_W} a_{slice} \times dZ_{hw} $$

1	dW[:,:,:,c] += a_slice * dZ[i, h, w, c]

3.计算db

$$ db = \sum_h \sum_w dZ_{hw} $$

1	db[:,:,:,c] += dZ[i, h, w, c]

所以得到以下：

def conv_backward(dZ, cache):
    """
    Implement the backward propagation for a convolution function
    
    Arguments:
    dZ -- gradient of the cost with respect to the output of the conv layer (Z), numpy array of shape (m, n_H, n_W, n_C)
    cache -- cache of values needed for the conv_backward(), output of conv_forward()
    
    Returns:
    dA_prev -- gradient of the cost with respect to the input of the conv layer (A_prev),
               numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
    dW -- gradient of the cost with respect to the weights of the conv layer (W)
          numpy array of shape (f, f, n_C_prev, n_C)
    db -- gradient of the cost with respect to the biases of the conv layer (b)
          numpy array of shape (1, 1, 1, n_C)
    """
    
    ### START CODE HERE ###
    # Retrieve information from "cache"
    (A_prev, W, b, hparameters) = cache
    
    # Retrieve dimensions from A_prev's shape
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
    
    # Retrieve dimensions from W's shape
    (f, f, n_C_prev, n_C) = W.shape
    
    # Retrieve information from "hparameters"
    stride = hparameters['stride']
    pad = hparameters['pad']
    
    # Retrieve dimensions from dZ's shape
    (m, n_H, n_W, n_C) = dZ.shape
    
    # Initialize dA_prev, dW, db with the correct shapes
    dA_prev = np.zeros(A_prev.shape)                           
    dW = np.zeros(W.shape)
    db = np.zeros(b.shape)

    # Pad A_prev and dA_prev
    A_prev_pad = zero_pad(A_prev, pad)
    dA_prev_pad = zero_pad(dA_prev, pad)
    
    for i in range(m):                       # loop over the training examples
        
        # select ith training example from A_prev_pad and dA_prev_pad
        a_prev_pad = A_prev_pad[i]
        da_prev_pad = dA_prev_pad[i]
        
        for h in range(n_H):                   # loop over vertical axis of the output volume
            for w in range(n_W):               # loop over horizontal axis of the output volume
                for c in range(n_C):           # loop over the channels of the output volume
                    
                    # Find the corners of the current "slice"
                    vert_start = h * stride
                    vert_end = h * stride + f
                    horiz_start = w * stride
                    horiz_end = w * stride + f
                    
                    # Use the corners to define the slice from a_prev_pad
                    a_slice = a_prev_pad[vert_start : vert_end, horiz_start : horiz_end, : ]

                    # Update gradients for the window and the filter's parameters using the code formulas given above
                    da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += W[:,:,:,c] * dZ[ i, h, w ,c]

                    dW[:,:,:,c] += a_slice * dZ[ i, h, w ,c]
                    db[:,:,:,c] += dZ[ i, h, w ,c]
                    
        # Set the ith training example's dA_prev to the unpaded da_prev_pad (Hint: use X[pad:-pad, pad:-pad, :])
        dA_prev[i, :, :, :] = da_prev_pad[pad:-pad, pad:-pad, :]
    ### END CODE HERE ###
    
    # Making sure your output shape is correct
    assert(dA_prev.shape == (m, n_H_prev, n_W_prev, n_C_prev))
    
    return dA_prev, dW, db

Pooling layer - backward pass

这里max pooling和average poolling要分开处理。

1. Max pooling - backward pass

假设pool size是$2 \times 2$的，那么，4个像素中只有1个留下来了，其余的都没有效果了，所以在max pooling中，从后面传递过来的导数值，只作用在max的那个元素，而且继续往前传递，不做任何改动，在其余3个元素的导数都是0。

创建一个mask矩阵，让最大值为1，其余的都为0，这样子就可以作为一个映射矩阵向前映射了。

$$ X = \begin{bmatrix}1 && 3 \\ 4 && 2 \end{bmatrix} \quad \rightarrow \quad M =\begin{bmatrix}
0 && 0 \\
1 && 0
\end{bmatrix}$$

def create_mask_from_window(x):
    """
    Creates a mask from an input matrix x, to identify the max entry of x.
    
    Arguments:
    x -- Array of shape (f, f)
    
    Returns:
    mask -- Array of the same shape as window, contains a True at the position corresponding to the max entry of x.
    """
    
    ### START CODE HERE ### (≈1 line)
    mask = (x == np.max(x))
    ### END CODE HERE ###
    
    return mask

2. Average pooling - backward pass

和max不同，average pooling相当于把backward传过来的值分成了$n_H \times n_W$等分。所以要计算的参数就比max pooling多很多了，这也就是为什么一般都用max pooling，不用average pooling

$$ dZ = 1 \quad \rightarrow \quad dZ =\begin{bmatrix}
1/4 && 1/4 \\
1/4 && 1/4
\end{bmatrix}$$

def distribute_value(dz, shape):
    """
    Distributes the input value in the matrix of dimension shape
    
    Arguments:
    dz -- input scalar
    shape -- the shape (n_H, n_W) of the output matrix for which we want to distribute the value of dz
    
    Returns:
    a -- Array of size (n_H, n_W) for which we distributed the value of dz
    """
    
    ### START CODE HERE ###
    # Retrieve dimensions from shape (≈1 line)
    (n_H, n_W) = shape
    
    # Compute the value to distribute on the matrix (≈1 line)
    average = n_H * n_W
    
    # Create a matrix where every entry is the "average" value (≈1 line)
    a = dz / average * np.ones((n_H, n_W))
    ### END CODE HERE ###
    
    return a

结合两种方法：

def pool_backward(dA, cache, mode = "max"):
    """
    Implements the backward pass of the pooling layer
    
    Arguments:
    dA -- gradient of cost with respect to the output of the pooling layer, same shape as A
    cache -- cache output from the forward pass of the pooling layer, contains the layer's input and hparameters 
    mode -- the pooling mode you would like to use, defined as a string ("max" or "average")
    
    Returns:
    dA_prev -- gradient of cost with respect to the input of the pooling layer, same shape as A_prev
    """
    
    ### START CODE HERE ###
    
    # Retrieve information from cache (≈1 line)
    (A_prev, hparameters) = cache
    
    # Retrieve hyperparameters from "hparameters" (≈2 lines)
    stride = hparameters['stride']
    f = hparameters['f']
    
    # Retrieve dimensions from A_prev's shape and dA's shape (≈2 lines)
    m, n_H_prev, n_W_prev, n_C_prev = A_prev.shape
    m, n_H, n_W, n_C = dA.shape
    
    # Initialize dA_prev with zeros (≈1 line)
    dA_prev = np.zeros(A_prev.shape)
    
    for i in range(m):                       # loop over the training examples
        
        # select training example from A_prev (≈1 line)
        a_prev = A_prev[i]
        
        for h in range(n_H):                   # loop on the vertical axis
            for w in range(n_W):               # loop on the horizontal axis
                for c in range(n_C):           # loop over the channels (depth)
                    
                    # Find the corners of the current "slice" (≈4 lines)
                    vert_start = h * stride
                    vert_end = vert_start + f
                    horiz_start = w * stride
                    horiz_end = horiz_start + f
                    
                    # Compute the backward propagation in both modes.
                    if mode == "max":
                        
                        # Use the corners and "c" to define the current slice from a_prev (≈1 line)
                        a_prev_slice = a_prev[vert_start : vert_end, horiz_start : horiz_end, c]
                        # Create the mask from a_prev_slice (≈1 line)
                        mask = create_mask_from_window(a_prev_slice)
                        # Set dA_prev to be dA_prev + (the mask multiplied by the correct entry of dA) (≈1 line)
                        dA_prev[i, vert_start: vert_end, horiz_start: horiz_end, c] += mask * dA[i, h, w, c]
                        
                    elif mode == "average":
                        
                        # Get the value a from dA (≈1 line)
                        da = dA[i, h, w, c]
                        # Define the shape of the filter as fxf (≈1 line)
                        shape = (f, f)
                        # Distribute it to get the correct slice of dA_prev. i.e. Add the distributed value of da. (≈1 line)
                        dA_prev[i, vert_start: vert_end, horiz_start: horiz_end, c] += distribute_value(da, shape)
                        
    ### END CODE ###
    
    # Making sure your output shape is correct
    assert(dA_prev.shape == A_prev.shape)
    
    return dA_prev

Part2：Convolutional Neural Networks: Application

用TensorFlow来搭建卷积神经网络。

1.Create placeholders

先创建placeholders，用来训练中传递X,Y


def create_placeholders(n_H0, n_W0, n_C0, n_y):
    """
    Creates the placeholders for the tensorflow session.
    
    Arguments:
    n_H0 -- scalar, height of an input image
    n_W0 -- scalar, width of an input image
    n_C0 -- scalar, number of channels of the input
    n_y -- scalar, number of classes
        
    Returns:
    X -- placeholder for the data input, of shape [None, n_H0, n_W0, n_C0] and dtype "float"
    Y -- placeholder for the input labels, of shape [None, n_y] and dtype "float"
    """

    ### START CODE HERE ### (≈2 lines)
    X = tf.placeholder(tf.float32, shape=(None,n_H0, n_W0, n_C0))
    Y = tf.placeholder(tf.float32, shape=(None,n_y))
    ### END CODE HERE ###
    
    return X, Y

2.Initialize parameters

用来初始化参数，主要是W1,W2,在这里就没有用b了

用W = tf.get_variable("W", [1,2,3,4], initializer = ...)

initializer 用tf.contrib.layers.xavier_initializer

# GRADED FUNCTION: initialize_parameters

def initialize_parameters():
    """
    Initializes weight parameters to build a neural network with tensorflow. The shapes are:
                        W1 : [4, 4, 3, 8]
                        W2 : [2, 2, 8, 16]
    Returns:
    parameters -- a dictionary of tensors containing W1, W2
    """
    
    tf.set_random_seed(1)                              # so that your "random" numbers match ours
        
    ### START CODE HERE ### (approx. 2 lines of code)
    W1 = tf.get_variable('W1', [4, 4, 3, 8],initializer= tf.contrib.layers.xavier_initializer(seed = 0 ))
    W2 = tf.get_variable('W2', [2, 2, 8, 16],initializer= tf.contrib.layers.xavier_initializer(seed = 0))
    ### END CODE HERE ###

    parameters = {"W1": W1,
                  "W2": W2}
    
    return parameters

记得这只是创建了图而已，并没有真正的初始化参数，在执行中还需要

init = tf.global_variables_initializer()

sess_test.run(init)

3. Forward propagation

模型为：CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED

- Conv2D: stride 1, padding is "SAME"
- ReLU
- Max pool: Use an 8 by 8 filter size and an 8 by 8 stride, padding is "SAME"
- Conv2D: stride 1, padding is "SAME"
- ReLU
- Max pool: Use a 4 by 4 filter size and a 4 by 4 stride, padding is "SAME"
- Flatten the previous output.
- FULLYCONNECTED (FC) layer：这里全连接层不需要有激活函数，因为后面计算cost的时候会加上softmax，因此这里不需要加

用到的函数：

tf.nn.conv2d(X,W1, strides = [1,s,s,1], padding = ‘SAME’): given an input $X$ and a group of filters $W1$, this function convolves $W1$’s filters on X. The third input ([1,f,f,1]) represents the strides for each dimension of the input (m, n_H_prev, n_W_prev, n_C_prev). You can read the full documentation here
tf.nn.max_pool(A, ksize = [1,f,f,1], strides = [1,s,s,1], padding = ‘SAME’): given an input A, this function uses a window of size (f, f) and strides of size (s, s) to carry out max pooling over each window. You can read the full documentation here
tf.nn.relu(Z1): computes the elementwise ReLU of Z1 (which can be any shape). You can read the full documentation here.
tf.contrib.layers.flatten(P): given an input P, this function flattens each example into a 1D vector it while maintaining the batch-size. It returns a flattened tensor with shape [batch_size, k]. You can read the full documentation here.
tf.contrib.layers.fully_connected(F, num_outputs): given a the flattened input F, it returns the output computed using a fully connected layer. You can read the full documentation here.

In the last function above (tf.contrib.layers.fully_connected), the fully connected layer automatically initializes weights in the graph and keeps on training them as you train the model. Hence, you did not need to initialize those weights when initializing the parameters.

# GRADED FUNCTION: forward_propagation

def forward_propagation(X, parameters):
    """
    Implements the forward propagation for the model:
    CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED
    
    Arguments:
    X -- input dataset placeholder, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "W2"
                  the shapes are given in initialize_parameters

    Returns:
    Z3 -- the output of the last LINEAR unit
    """
    
    # Retrieve the parameters from the dictionary "parameters" 
    W1 = parameters['W1']
    W2 = parameters['W2']
    
    ### START CODE HERE ###
    # CONV2D: stride of 1, padding 'SAME'
    Z1 = tf.nn.conv2d(X, filter=W1, strides=[1,1,1,1],padding='SAME')
    # RELU
    A1 = tf.nn.relu(Z1)
    # MAXPOOL: window 8x8, sride 8, padding 'SAME'
    P1 = tf.nn.max_pool(A1,ksize=[1, 8, 8, 1], strides=[1, 8, 8, 1],padding='SAME')
    # CONV2D: filters W2, stride 1, padding 'SAME'
    Z2 = tf.nn.conv2d(P1, filter=W2, strides=[1, 1, 1, 1],padding='SAME')
    # RELU
    A2 = tf.nn.relu(Z2)
    # MAXPOOL: window 4x4, stride 4, padding 'SAME'
    P2 = tf.nn.max_pool(A2,ksize=[1, 4, 4, 1], strides=[1, 4, 4, 1],padding='SAME')
    # FLATTEN
    P2 = tf.contrib.layers.flatten(P2)
    # FULLY-CONNECTED without non-linear activation function (not not call softmax).
    # 6 neurons in output layer. Hint: one of the arguments should be "activation_fn=None" 
    Z3 = tf.contrib.layers.fully_connected(P2, 6,activation_fn=None)
    ### END CODE HERE ###

    return Z3

4. Compute cost

tf.nn.softmax_cross_entropy_with_logits(logits = Z3, labels = Y): computes the softmax entropy loss. This function both computes the softmax activation function as well as the resulting loss. You can check the full documentation here.这个函数已经包含了计算softmax，还有求cross-entropy两件事了。
tf.reduce_mean: computes the mean of elements across dimensions of a tensor. Use this to sum the losses over all the examples to get the overall cost. You can check the full documentation here.

# GRADED FUNCTION: compute_cost 

def compute_cost(Z3, Y):
    """
    Computes the cost
    
    Arguments:
    Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
    Y -- "true" labels vector placeholder, same shape as Z3
    
    Returns:
    cost - Tensor of the cost function
    """
    
    ### START CODE HERE ### (1 line of code)
    cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=Z3,labels=Y))
    ### END CODE HERE ###
    
    return cost

5. Model

把前面的函数都结合起来，创建一个完整的模型。

其中random_mini_batches()已经给我们了，优化器使用了

optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

# GRADED FUNCTION: model

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.009,
          num_epochs = 100, minibatch_size = 64, print_cost = True):
    """
    Implements a three-layer ConvNet in Tensorflow:
    CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED
    
    Arguments:
    X_train -- training set, of shape (None, 64, 64, 3)
    Y_train -- test set, of shape (None, n_y = 6)
    X_test -- training set, of shape (None, 64, 64, 3)
    Y_test -- test set, of shape (None, n_y = 6)
    learning_rate -- learning rate of the optimization
    num_epochs -- number of epochs of the optimization loop
    minibatch_size -- size of a minibatch
    print_cost -- True to print the cost every 100 epochs
    
    Returns:
    train_accuracy -- real number, accuracy on the train set (X_train)
    test_accuracy -- real number, testing accuracy on the test set (X_test)
    parameters -- parameters learnt by the model. They can then be used to predict.
    """
    
    ops.reset_default_graph()                         # to be able to rerun the model without overwriting tf variables
    tf.set_random_seed(1)                             # to keep results consistent (tensorflow seed)
    seed = 3                                          # to keep results consistent (numpy seed)
    (m, n_H0, n_W0, n_C0) = X_train.shape             
    n_y = Y_train.shape[1]                            
    costs = []                                        # To keep track of the cost
    
    # Create Placeholders of the correct shape
    ### START CODE HERE ### (1 line)
    X, Y = create_placeholders(n_H0, n_W0,n_C0,n_y)
    ### END CODE HERE ###

    # Initialize parameters
    ### START CODE HERE ### (1 line)
    parameters = initialize_parameters()
    ### END CODE HERE ###
    
    # Forward propagation: Build the forward propagation in the tensorflow graph
    ### START CODE HERE ### (1 line)
    Z3 = forward_propagation(X,parameters)
    ### END CODE HERE ###
    
    # Cost function: Add cost function to tensorflow graph
    ### START CODE HERE ### (1 line)
    cost = compute_cost(Z3, Y)
    ### END CODE HERE ###
    
    # Backpropagation: Define the tensorflow optimizer. Use an AdamOptimizer that minimizes the cost.
    ### START CODE HERE ### (1 line)
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
    ### END CODE HERE ###
    
    # Initialize all the variables globally
    init = tf.global_variables_initializer()
     
    # Start the session to compute the tensorflow graph
    with tf.Session() as sess:
        
        # Run the initialization
        sess.run(init)
        
        # Do the training loop
        for epoch in range(num_epochs):

            minibatch_cost = 0.
            num_minibatches = int(m / minibatch_size) # number of minibatches of size minibatch_size in the train set
            seed = seed + 1
            minibatches = random_mini_batches(X_train, Y_train, minibatch_size, seed)

            for minibatch in minibatches:

                # Select a minibatch
                (minibatch_X, minibatch_Y) = minibatch
                # IMPORTANT: The line that runs the graph on a minibatch.
                # Run the session to execute the optimizer and the cost, the feedict should contain a minibatch for (X,Y).
                ### START CODE HERE ### (1 line)
                _ , temp_cost = sess.run([optimizer,cost],feed_dict={X:minibatch_X,Y:minibatch_Y})
                ### END CODE HERE ###
                
                minibatch_cost += temp_cost / num_minibatches
                

            # Print the cost every epoch
            if print_cost == True and epoch % 5 == 0:
                print ("Cost after epoch %i: %f" % (epoch, minibatch_cost))
            if print_cost == True and epoch % 1 == 0:
                costs.append(minibatch_cost)
        
        
        # plot the cost
        plt.plot(np.squeeze(costs))
        plt.ylabel('cost')
        plt.xlabel('iterations (per tens)')
        plt.title("Learning rate =" + str(learning_rate))
        plt.show()

        # Calculate the correct predictions
        predict_op = tf.argmax(Z3, 1)
        correct_prediction = tf.equal(predict_op, tf.argmax(Y, 1))
        
        # Calculate accuracy on the test set
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
        print(accuracy)
        train_accuracy = accuracy.eval({X: X_train, Y: Y_train})
        test_accuracy = accuracy.eval({X: X_test, Y: Y_test})
        print("Train Accuracy:", train_accuracy)
        print("Test Accuracy:", test_accuracy)
                
        return train_accuracy, test_accuracy, parameters

得到效果如图：

DeepLearning.ai笔记:(4-1)-- 卷积神经网络（Foundations of CNN）

2018-09-30T02:20:54.000Z

第四门课开始就学习深度学习关于计算机视觉的重要应用—卷积神经网络。

第一周主要是对卷积神经网络的基本构造和原理做了介绍。

计算机视觉

计算机视觉是深度学习的一个非常重要的应用。比如图像分类，目标检测，图片风格迁移等。

用传统的深度学习算法，假设你有一张$64×64$的猫片，又有RGB三通道，那么这个时候是$64×64×3=12288$，input layer的维度就是12288，这样其实也还可以，因为图片很小。那么如果你有$1000×1000$的照片呢，你的向量就会有300万！假设有1000个隐藏神经元，那么就是第一层的参数矩阵$W$有30亿个参数！算到地老天荒。所以用传统的深度学习算法是不现实的。

边缘检测

如图，这些边缘检测中，用水平检测和垂直检测会得到不同的结果。

垂直检测如下图，用一个$3×3$的过滤器（filter），也叫卷积核，在原图片$6×6$的对应地方按元素相乘，得到$4×4$的图片。

可以看到，用垂直边缘的filter可以将原图片中间的边缘区分出来，也就是得到了最右图中最亮的部分即为检测到的边缘。

当然，如果左图的亮暗分界线反过来，则输出图片中最暗的部分表示边缘。

也自然有水平的边缘分类器。

还有更复杂的，但是我们不需要进行人工的决定这些filter是什么，因为我们可以通过训练，让机器自己学到这些参数。

padding

padding是填充的意思。

我们可以从之前的例子看到，每经过一次卷积运算，图片的像素都会变小，从$6×6 —> 4×4$，这样子图片就会越来越小，后面就毛都不剩了。
还有一点就是，从卷积的运算方法来看，边缘和角落的位置卷积的次数少，会丢失有用信息。

所以就有padding的想法了，也就是在图片四周填补上像素。

8\times8$，经过$3\times3$卷积后，还是$6\times6$">

计算方法如下，

原数据是$n \times n$，filter为$f \times f$,padding为$p \times p$，

那么得到的矩阵大小是$(n + 2p -f +1)\times(n + 2p -f +1)$

padding有两种：

valid：也就是不填充
same：输入与输出大小相同的图片, $p=(f - 1) / 2$，一般padding为奇数，因为filter是奇数

stride（步长）

卷积的步长也就是每一次运算后平移的距离，之前使用都是stride=1。

假设stride=2，就会得到：

得到的矩阵大小是

$$\lfloor \frac{n+2p-f}{s}+1\rfloor \times \lfloor \frac{n+2p-f}{s}+1\rfloor$$

向下取整: 59/60 = 0

立体卷积

之前都是单通道的图片进行卷积，如果有RGB三种颜色的话，就要使用立体卷积了。

这个时候的卷积核就变成了$3 \times 3 \times 3$的三维卷积核，一共27个参数，每次对应着原图片上的RGB一共27个像素运算，然后求和得到输出图片的一个像素。因为只有一个卷积核，这个时候输出的还是$4 \times 4 \times 1$的图片。

多个卷积核

因为不同的卷积核可以提取不同的图片特征，所以可以有很多个卷积核，同时提取图片的特征，如分别提取图片的水平和垂直边缘特征。

因为有了两个卷积核，这时候输出的图片就是有两通道的图片$4\times 4 \times 2$。

这里要搞清两个概念，卷积核的通道数和个数：

通道数channel：即卷积核要作用在原图片上，原图片的通道处$n_c$，卷积核的通道数必须和原图片通道数相同
个数：即要使用多少个这样的卷积核，使用$n_{c}^{\prime}$表示，卷积核的个数也就是输出图片的通道数，如有两个卷积核，那么生成了$4\times 4 \times 2$的图片，2 就是卷积核的个数
即 $n \times n \times n_c$ ，乘上的$n_{c}^{\prime}$个卷积核 $ f \times f \times n_c$，得到$(n -f +1)\times (n - f +1 ) \times n_{c}^{\prime}$的新图片

卷积神经网络

单层卷积网络

如图是单层卷积的基本过程，先经过两个卷积核，然后再加上bias进行relu激活函数。

那么假设某层卷积层有10个$3 \times 3 \times 3$的卷积核，那么一共有$(3\times3\times3+1) \times10=280$个参数，加1是加上了bias

在这里总结了各个参数的表示方法：

简单神经网络

一般卷积神经网络层的类型有：

convolution卷积层
pool池化层
fully connected全连接层

池化层

pooling 的作用就是用来压缩数据，加速运算，提高提取特征的鲁棒性

Max pooling

在范围内取最大值

Average Pooling

取平均值

卷积神经网络示例

一般conv后都会进行pooling，所以可以把conv和pooling当做一层。

如上图就是$conv-pool-conv-pool-fc-fc-fc-softmax$的卷积神经网络结构。

各个层的参数是这样的：

可以看到，在卷积层的参数非常少，池化层没有参数，大量的参数在全连接层。

为何用卷积神经网络？

这里给出了两点主要原因：

参数共享：卷积核的参数是原图片中各个像素之间共享的，所以大大减少了参数
连接的稀疏性：每个输出值，实际上只取决于很少量的输入而已。