PyTorch自动求导Autograd中的backward

首先简明要点。

PyTorch 中所有神经网络的核心是autograd包。
autograd包为张量上所有的操作提供了自动求导。
torch.Tensor是这个包的核心类。如果设置.requires_grad = True，那么将会追踪对于该张量的操作。当完成计算后，通过调用.backward()，自动计算所有梯度，而这个张量的所有梯度将会自动积累到.grad属性。

另外在自动梯度计算包中还有一个重要的类Function。

Tensor and Function are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a .grad_fn attribute that references a Function that has created the Tensor (except for Tensors created by the user - their grad_fn is None).

Tensor 和 Function 相互连接并生成一个非循环图，它表示和储存了完整的计算历史。每个张量都有一个.grad_fn属性，这个属性引用了一个创建了Tensor的Function（除非这个张量是用户手动创建的，即，这个张量的grad_fn是None）。

如果需要计算导数，你可以在Tensor上调用.backward()。如果Tensor是一个标量（即它包含一个元素数据）则不需要为backward()指定任何参数，但是如果它有更多的元素，你需要指定一个gradient 参数来匹配张量的形状。

起初我对上面的这些描述也是一头雾水，因为看不懂的名词和参数太多了，但是没有关系，google给我们colab这么好用的工具，不用一下是不是可惜了。对于一些基本的东西我们还是要有一定的了解，比如说backward只能够作用在一个标量上（也就是只有一个维度的张量），或者在使用backward的时候给定一个特定的张量来确保输出梯度的形状。对于自动求导autograd这个包需要注意的是可能有很多人会将张量包含在一个variable中以便于自动梯度计算，但是variable已经在0.41版中被标注称为过期了，现在可以直接使用tensor。

The Variable API has been deprecated: Variables are no longer necessary to use autograd with tensors. Autograd automatically supports Tensors with requires_grad set to True. Below please find a quick guide on what has changed:

Variable(tensor) and Variable(tensor, requires_grad) still work as expected, but they return Tensors instead of Variables.

var.data is the same thing as tensor.data.

Methods such as var.backward(), var.detach(), var.register_hook() now work on tensors with the same method names.

In addition, one can now create tensors with requires_grad=True using factory methods such as torch.randn(), torch.zeros(), torch.ones(), and others like the following:
1
2
> autograd_tensor = torch.randn((2, 3, 4), requires_grad=True)
>

下面是一些测试：

import torch


a = torch.tensor([2, 3], dtype=torch.float, requires_grad=True)
b = a + 3
c = b * b * 3
out = c.mean()
out.backward()
print('input')
print(a.data)
print('compute result is')
print(out.data)
print('input gradients are')
print(a.grad)

输出结果是这样的

input
tensor([2., 3.])
compute result is
tensor(91.5000)
input gradients are
tensor([15., 18.])

简单的问题我们可以用手算来验证一下，比如所我们传入的参数为$m=(m_1=2,m_2 = 3)$。这里有一点需要注意对于每一个张量对象，都有参数requires_grad参数，默认为False，这里我们手动赋值requires_grad=True让其称为一个叶子节点。我们可以推算出
$$
o u t=\frac{3\left(\left(m_{1}+3\right)^{2}+\left(m_{2}+3\right)^{2}\right)}{2}
$$
接下来我们求偏导
$$
\frac{\partial out}{\partial m_{1}}=3\left(m_{1}+3\right)|m_{1}=2=15 , \quad \frac{\partial out}{\partial m_{2}}=3\left(m_{2}+3\right)|m_{2}=3=18
$$
这就是我们想要求得的结果，与程序结果一致，通过。

接下我们研究对非标量使用backward，下面是实验：

m = torch.tensor([[2, 3]], dtype=torch.float, requires_grad=True)
print( m )
n = torch.zeros(1, 2)
print( n )
n[0, 0] = m[0, 0] ** 2
n[0, 1] = m[0, 1] ** 3
print( n )
print(m.grad_fn)
print(n.grad_fn)
print(n.requires_grad)
n.backward(m.data)
print(m.grad)

可以发现在程序中做了非常多的输出，是为了方便确认网络结构（虽然完全称不上网络… …，逃）

输出结果

tensor([[2., 3.]], requires_grad=True)
tensor([[0., 0.]])
tensor([[ 4., 27.]], grad_fn=<CopySlices>)
None
<CopySlices object at 0x7f5882fa0e80>
True
tensor([[ 8., 81.]])

首先我们定义了输入$m=\left(x_{1}, x_{2}\right)=(2,3)$，将其运算传入n有$n=\left(x_{1}^{2}, x_{2}^{3}\right)$，简单的求出偏导数$\frac{\partial n_{1}}{\partial x_{1}}=2 x_{1}=4, \frac{\partial n_{2}}{\partial x_{2}}=3 x_{2}^{2}=27$。我们会发现，这里输出的梯度张量… …完全不对啊啊啊啊啊摔！观察数据可以发现，这里输出的梯度与应该得到的梯度，其中中间只差了一个乘法因子，而这个因子就是我们在backward时传入的m.data张量。经过了其他的一些测试，可以发现传入的矩阵其实就是求导是的系数矩阵。那我们就可以想到了，如果输入一个元素全为1的张量，那就可以得到正确的梯度张量了。接下来我们稍微更改一下程序。

m = torch.tensor([[2, 3]], dtype=torch.float, requires_grad=True)
print( m )
n = torch.zeros(1, 2)
print( n )
n[0, 0] = m[0, 0] ** 2
n[0, 1] = m[0, 1] ** 3
print( n )
print(m.grad_fn)
print(n.grad_fn)
print(n.requires_grad)
print(m.data)
k = torch.tensor([[1, 1]])
print(k)
n.backward(torch.tensor([[1, 1]], dtype=torch.float))
print(m.grad)

tensor([[2., 3.]], requires_grad=True)
tensor([[0., 0.]])
tensor([[ 4., 27.]], grad_fn=<CopySlices>)
None
<CopySlices object at 0x7f5882fa0e80>
True
tensor([[2., 3.]])
tensor([[1, 1]])
tensor([[ 4., 27.]])

果然得到了正确的答案。（虽然结果暂且对应上了，但这并不是正确的解释）

上面是完全线性不相关的情况，下面我们来看一下耦合时会发生怎样的变化。

m = v(torch.tensor([[2, 3]], dtype=torch.float), requires_grad=True)
j = torch.zeros(2 ,2)
k = v(torch.zeros(1, 2))
print(m)
print(j)
print(k)
k[0, 0] = m[0, 0] ** 2 + 3 * m[0 ,1]
k[0, 1] = m[0, 1] ** 2 + 2 * m[0, 0]
print(k)

首先设定了这样的输入$m=\left(x_{1}=2, x_{2}=3\right), k=\left(x_{1}^{2}+3 x_{2}, x_{2}^{2}+2 x_{1}\right)$。首先我们来手动计算一下结果：
$$
\frac{\partial\left(x_{1}^{2}+3 x_{2}\right)}{\partial x_{1}}=2 x_{1}=4, \frac{\partial\left(x_{1}^{2}+3 x_{2}\right)}{\partial x_{2}}=3
$$

$$
\frac{\partial\left(x_{2}^{2}+2 x_{1}\right)}{\partial x_{1}}=2, \frac{\partial\left(x_{2}^{2}+2 x_{1}\right)}{\partial x_{2}}=2 x_{2}=6
$$

可见我们求出来的Jacobian就是
$$
\begin{align}
\mathbb{J} =
\begin{bmatrix}
4 & 3\\
2 & 6
\end{bmatrix}
\end{align}
$$
接下来我们来验证一下

1 2	k.backward(torch.tensor([[1, 1]], dtype=torch.float)) print(m.grad)

1	tensor([[6., 9.]])

这个答案显然不对，甚至连形状都不对，其实出的错误主要在于backward是给定的张量参数不对，k.backward(parameters)中的参数张量一定要与k的形状一样才可以。那么我们这里给定的这个张量参数是什么意思呢？

首先我们去找官方的文档，可以找到

backward(gradient=None, retain_graph=None, create_graph=False)

Computes the gradient of current tensor w.r.t. graph leaves.

The graph is differentiated using the chain rule. If the tensor is non-scalar (i.e. its data has more than one element) and requires gradient, the function additionally requires specifying gradient. It should be a tensor of matching type and location, that contains the gradient of the differentiated function w.r.t. self.

This function accumulates gradients in the leaves - you might need to zero them before calling it.

Parameters

gradient (Tensor or None) – Gradient w.r.t. the tensor. If it is a tensor, it will be automatically converted to a Tensor that does not require grad unless create_graph is True. None values can be specified for scalar Tensors or ones that don’t require grad. If a None value would be acceptable then this argument is optional.

retain_graph (bool, optional) – If False, the graph used to compute the grads will be freed. Note that in nearly all cases setting this option to True is not needed and often can be worked around in a much more efficient way. Defaults to the value of create_graph.

create_graph (bool, optional) – If True, graph of the derivative will be constructed, allowing to compute higher order derivative products. Defaults to False.

这段文档的重点是：当我们对一个scalar自动求导数的时候不需要指定gradient参数，而对non-scale求导时则需要指定一个与backward对象完全对应的gradient张量。这是为了什么呢？我们可以想一下，对一个scale求导我们很简单就可以想象，像上面这个例子，一个二阶tensor对一个二阶tensor求导，我们也可以做的出来，结果就是一个雅可比矩阵，并没有什么特殊的，但是，在深度学习中，通常我们遇到的张量都是高阶的，假设我们要使四阶tensor对tensor求导，你能够想象出来结果是什么形状吗？所以Pytorch很简单的规定，不让tensor对tensor求导，只允许标量scalar对张量tensor求导，并且自然而然的求导的结果是与这个tensor相同形状的tensor。那么想到这里，上面的这个gradient张量参数的作用就呼之欲出，其实就是为了将backward对象的tensor变成标量scalar。变的方法就是列多项式求和，而这个多项式的参数就是我们给定的gradient张量中的每一个对应的元素。

我们结合上边的例子看一下，有$m=\left(x_{1}=2, x_{2}=3\right), k=\left(x_{1}^{2}+3 x_{2}, x_{2}^{2}+2 x_{1}\right)$，我们在backward中给定了gradient参数$(1, 1)$，则有和式
$$
\sum = 1\times (x_1^2+3x_2)+1\times (x_2^2+2x_1)
$$
接下来，这个和式分别对$x_1, x_2$求偏导
$$
\frac{\partial\sum}{\partial x_1}=2x_1+2|_{x_1=2}=6
$$

$$
\frac{\partial\sum}{\partial x_2}=2x_2+3|_{x_2=3}=9
$$

这样就得到了上面的那个原来看似错误的答案。

理解了这个地方之后，我们要求Jacobian就很简单了，将上面backward部分的代码更改如下

k.backward(torch.tensor([[1, 0]], dtype=torch.float), retain_graph=True)
j[0] = m.grad
m.grad = torch.zeros_like(m.grad)
k.backward(torch.tensor([[0, 1]], dtype=torch.float))
j[1] = m.grad
print('jacobian is')
print(j)

1
2
3

jacobian is
tensor([[4., 2.],
        [3., 6.]])

成功的输出了Jacobian。这里我们注意到在backward函数中还有参数retain_graph=True，这个参数默认为False，根据官方文档我们可以知道经过反向传播之后计算图的内存会被释放掉，这样就没有第二次计算梯度张量了，所以我们这里设置为True，官方文档同时解释说，需要设置成True的情况几乎没有，一般让其保持默认以便获取更高计算性能。

参考资料

PyTorch 的 backward 为什么有一个 grad_variables 参数？，https://zhuanlan.zhihu.com/p/29923090
PyTorch的backward()相关理解，https://blog.csdn.net/douhaoexia/article/details/78821428
PyTorch Handbook-GitHub，https://github.com/zergtant/pytorch-handbook/blob/master/chapter1/2_autograd_tutorial.ipynb
AUTOMATIC DIFFERENTIATION PACKAGE - TORCH.AUTOGRAD，https://pytorch.org/docs/stable/autograd.html#variable-deprecated