Vector form of the multivariable chain rule
Vector form of the multivariable chain rule

Vector Calculus – Chain Rule

## University

University of Kerala

## Chain Rule

Consider a function Ą ∶ ℝ 2 ⟶ ℝ of two variables þ1,þ 2. Furthermore,

þ 1 (ą) and þ 2 (ą) are themselves functions of t. To compute the gradient of Ą with respect to t, we need to apply the chain rule for multivariate functions as

ĂĄ Ăþ=

Ą ý

ý þ = [

Ą ý 1

Ą ý 2 ] [

ý 1 (þ) ýþ 2 (þ) þ

] = ýĄ 1 ýþ 1 + ýĄ 2 ýþ 2

where d denotes the gradient and partial derivatives.

Consider Ą(þ 1 , þ 2 )= þ 12 + 2þ 2 , þ 1 = þÿÿ ą and þ 2 = ÿĀĄ ą then ĂĄ Ăþ =

Ą ý 1

ý 1 þ +

Ą ý 2

ý 2 þ

= 2 þÿÿ ą þÿĀ þþ + 2 ÿāý þþ

= 2 þÿÿ ącosą 2 2 þÿÿ ą = 2 sin t (Cos t -1)

is the corresponding derivative of Ą with respect to t.

## Gradients of a Vector – Valued Functions

We are given Ą(þ)= ýþ, Ą(þ)∈ ℝý A∈ ℝýýþ, x∈ ℝþ

To compute the gradient ĂĄĂý we first determine the dimension of ĂĄĂý since Ą ∶ ℝþ ⟶ ℝý, it follows that ĂĄĂý∈ ℝýýþ. Second, to compute the gradient we determine the partial derivatives of Ą with respect to every þj:

Ąi (x) = ∑þĀ=1ýijþj⟹ĄýĀi = ýij

We collect the partial derivatives in the Jacobian and obtain the gradient.

ĂĄ Ăý= [

Ą 1 ý 1 &

Ą 1 ýþ ⋮ ⋮ Ąý ý 1 &

Ąý ýþ

##### =[

ý1i & ý1N ⋮ ⋮ ýM1 & ýMN

]= A∈ ℝýýþ

#### Chain Rule

Consider the function h: ℝ ⟶ ℝ h (t) =( Ą Ā ą) (ą) with Ą: ℝ 2 ⟶ ℝ ą: ℝ ⟶ ℝ 2 Ą(þ)= exp (þ 1 þ 22 ),

þ = [þþ 12 ]= g(t) [ą cosą sin ąą]

and compute the gradient of h with respect to t. Ą: ℝ 2 ⟶ ℝ and ą: ℝ ⟶ ℝ 2 we note that Ą ý∈ ℝ 1ý2, ą þ ∈ ℝ 2ý

The desired gradient is computed by applying the chain rule:

Ąi = ∑þĀ=1ýijþj, i= 1, ……, M,

and the partial derivatives are given as Ąi ýÿþ = þq. This allows us to compute the partial derivatives of f, with respect to a row of A, which is given as Ąi ýÿþ = þ ÿ∈ ℝ1ý1ýþ Ąi ýā≠ÿ,: = 𝕂 ÿ ∈ ℝ1ý1ýþ

Where we have to pay attention to the correct dimensionality. Since Ąi maps onto ℝ and each row of A is of size 1xN, we obtain a 1x1xN sized tensor as the partial derivative of Ąi with respect to a row of A. We stack the partial derivatives and get the desired gradient in via

Ąi ý=

[

##### 𝕂

ÿ ⋮ 𝕂ÿ þÿ 𝕂ÿ ⋮ 𝕂ÿ]

## ∈ ℝ1ý(ýýþ)

### Gradient of Matrices with Respect to Matrices

Consider a matrix R ∈ ℝýýþ and Ą ∶ ℝýýþ ⟶ ℝþ𝕋þ, with f(R)= ýÿ ý =: K ∈ ℝþýþ where we seek the gradient dK/dR. To solve this hard problem, let us first down what we already know: The gradient has the dimensions

##### Ă𝔾

Ăý ∈ ℝ

(þýþ)(ýýþ)

Which is a tensor. Moreover, Ă𝔾Ăă Ăý ∈ ℝ

1ýýþ

for p,q = 1,…., N, where is the ( p,q)th entry of K= f(R). Denoting the ith column of R by ăÿ, every entry of K is given by the dot product of two columns or R , i., 𝔾Ăă = ăĂÿăĂ = ∑ýÿ=1ýmpýmq

When we now compute the partial derivative, 𝔾ýýþÿĀ we Obtain

𝔾ýþ ýÿĀ =∑ ýmp ýÿ=1 ýmq=𝔾ĂăÿĀ

ĂăÿĀ= {

ýýiqip if Ā = ā, ā b Ă if Ā = Ă, ā b Ă 2ýiq if Ā = ā, ā = Ă 0 Āą/ăăýÿĄă

we know that the desired gradient has the dimension (NxN) x (MxN), and every single entry of this tensor is given by 𝔾ĂăÿĀ, where p, q, j =1, …., and i = 1, …, M.

### Useful Identities for computing Gradients

In the following, we list some useful gradients that are frequently required in a machine learning context (Petersen and Pedersen, 2012). Here, we use tr(.) as the trace, det (.)as the determinant and f(X ) 21 = ýÿ as the inverse of f(X ) , assuming it exists.

Consider the function

Ą (þ) = √þ 2 + exp (þ 2 )+ ÿĀĄ(þ 2 + exp (þ 2 )

By application of the chain rule, and noting that differentiation is linear, we compute the gradient.

ĂĄ Ăý =

2ý+2ý exp (ý 2 ) 2 √ý 2 +exp (ý 2 )2 þÿÿ(þ

2 + exp (þ 2 ))(2þ + 2þ exp (þ 2 ))

= 2þ ( 2 √ý 2 +exp (ý 12 )2 þÿÿ(þ 2 + exp (þ 2 ))) (1 + exp (þ 2 ))

Writing out the gradient in this explicit way is often impractical since it often results in a very lengthy expression for a derivative. In practice, it means that, if we are not careful, the implementation of the gradient could be significantly more expensive than computing the function, which imposes unnecessary overhead. For training deep neural network mod¬els, the backpropagation algorithm (Kelley, 1960; Bryson, 1961; Dreyfus, 1962; Rumelhart et al., 1986) is an efficient way to compute the gradient of an error function with respect to the parameters of the model.

## Automatic Differentiation

It turns out that backpropagation is a special case of a general technique in numerical analysis called automatic differentiation. We can think of automatic differentiation as a set of techniques to numerically (in contrast to symbolically) evaluate the exact (up to machine precision) gradient of a

function by working with intermediate variables and applying the chain rule. Automatic differentiation applies a series of elementary arithmetic operations, e., addition and multiplication and elementary functions, e., sin, cos, exp, log. By applying the chain rule to these operations, the gradient of quite complicated functions can be computed automatically. Automatic differentiation applies to general computer programs and has forward and reverse modes. Baydin et al. (2018) give a great overview of automatic differentiation in machine learning.

If we were to compute the derivative dy/dx, we would apply the chain rule and obtain Ăþ Ăý =

Ăþ ĂĀ

ĂĀ Ăÿ

Ăÿ Ăý Intuitively, the forward and reverse mode differ in the order of multiplication. Due to the associativity of matrix multiplication, we can choose between

Ăþ Ăý = (

Ăþ ĂĀ

ĂĀ Ăÿ)

Ăÿ Ăý, Ăþ Ăý =

Ăþ ĂĀ (

ĂĀ Ăÿ

Ăÿ Ăý).

In the following, we will focus on reverse Mode automatic differentiation, which is backpropagation. In the context of neural networks, where the input dimensionality is often much higher than the dimensionality of the

corresponding computation graph the flow of data and computations required to obtain the function value Ą.

We can directly compute the derivatives of the intermediate variables with respect to their corresponding inputs by recalling the definition of the derivative of elementary functions. We obtain the following:

ÿ ý = 2x Ā ÿ = exp (a) ā ÿ = 1 =

ā Ā Ă ā =

1 2 √ā ă ā = -sin (c) Ą Ă = 1=

Ą ă.

we can compute df /dx by working backward from the output and obtain

Ą ā=

ā +

ă

ă ā Ą Ā=

ā

ā Ā Ą ÿ=

ÿ+

ā

ā ÿ

þ=

##### Ą

ÿ

ÿ þ

Note that we implicitly applied the chain rule to obtain df /dx. By substituting the results of the derivatives of the elementary functions, we get

Ą ă= 1.

##### 1

2 √ā+ 1 (2 sin(ā)) Ą Ā=

Ą ā. Ą ÿ=

Ą Ā exp (a) +

Ą ā. 1 Ą þ=

##### Ą

ÿ 2þ

By thinking of each of the derivatives above as a variable, we observe that the computation required for calculating the derivative is of similar complexity as the computation of the function itself. This is quite

counterintuitive since the mathematical expression for the derivative Ąý is

significantly more complicated than the mathematical expression of the function f(x).

### Exercise:

practice problems finding gradient of the given functions Problems fors finding partial derivatives using chain rule Problems for finding gradient in polar coordinates

Vector Calculus – Chain Rule

You are watching: Chain Rule Consider a function Ą ∶ ℝ 2 ⟶ ℝ of two variables þ1,þ 2.. Info created by Bút Chì Xanh selection and synthesis along with other related topics.