1ddddd
2dddddddds
Q.5
hi
dd
By recalling the definition of VC Dimension, we learn that:
d v c ( H ) d_{vc}(\mathcal{H}) d v c ( H ) is the size of the largest set of points that H \mathcal{H} H can shatter.
Consider a set of points S S S of size n = d v c ( H 1 ∪ H 2 ) n = d_{vc}(\mathcal{H}_1 \cup \mathcal{H}_2) n = d v c ( H 1 ∪ H 2 ) , which can be shattered by H 1 ∪ H 2 \mathcal{H}_1 \cup \mathcal{H}_2 H 1 ∪ H 2 .
For any dichotomy of S S S , there must exist an h ∈ H 1 ∪ H 2 h \in \mathcal{H}_1 \cup \mathcal{H}_2 h ∈ H 1 ∪ H 2 that can realize this dichotomy.
Split S S S into two subsets:
S 1 S_1 S 1 : points for which the dichotomy can be realized by hypotheses in H 1 \mathcal{H}_1 H 1
S 2 S_2 S 2 : points for which the dichotomy can only be realized by hypotheses in H 2 \mathcal{H}_2 H 2
Clearly, ∣ S 1 ∣ ≤ d v c ( H 1 ) |S_1| \leq d_{vc}(\mathcal{H}_1) ∣ S 1 ∣ ≤ d v c ( H 1 ) , because H 1 \mathcal{H}_1 H 1 can shatter at most d v c ( H 1 ) d_{vc}(\mathcal{H}_1) d v c ( H 1 ) points.
Similarly, ∣ S 2 ∣ ≤ d v c ( H 2 ) |S_2| \leq d_{vc}(\mathcal{H}_2) ∣ S 2 ∣ ≤ d v c ( H 2 ) .
Since S = S 1 ∪ S 2 S = S_1 \cup S_2 S = S 1 ∪ S 2 , by the triangle inequality, we have:
n = ∣ S ∣ = ∣ S 1 ∪ S 2 ∣ ≤ ∣ S 1 ∣ + ∣ S 2 ∣ ≤ d v c ( H 1 ) + d v c ( H 2 ) n = |S| = |S_1 \cup S_2| \leq |S_1| + |S_2| \leq d_{vc}(\mathcal{H}_1) + d_{vc}(\mathcal{H}_2) n = ∣ S ∣ = ∣ S 1 ∪ S 2 ∣ ≤ ∣ S 1 ∣ + ∣ S 2 ∣ ≤ d v c ( H 1 ) + d v c ( H 2 )
Therefore, d v c ( H 1 ∪ H 2 ) = n ≤ d v c ( H 1 ) + d v c ( H 2 ) d_{vc}(\mathcal{H}_1 \cup \mathcal{H}_2) = n \leq d_{vc}(\mathcal{H}_1) + d_{vc}(\mathcal{H}_2) d v c ( H 1 ∪ H 2 ) = n ≤ d v c ( H 1 ) + d v c ( H 2 ) .
Q.6
For a given x \mathbf{x} x , since now we suppose that a false negative is 10 times more important than a false positive. Hence, we have the following assumption:
if we predict + 1 +1 + 1 , the penalty of the wrong expectation would be:
E 1 = 1 ⋅ P ( y = − 1 ∣ x ) E_1 = 1 \cdot P(y=-1|\mathbf{x}) E 1 = 1 ⋅ P ( y = − 1∣ x )
if we predict − 1 -1 − 1 , the expected cost of misclassification would be: E − 1 = 10 ⋅ P ( y = + 1 ∣ x ) E_{-1} = 10 \cdot P(y=+1|\mathbf{x}) E − 1 = 10 ⋅ P ( y = + 1∣ x )
According to the description of the problem, we can construct following rules:
We should predict +1 when E 1 < E − 1 E_1 < E_{-1} E 1 < E − 1 . That is, predict +1 when P ( y = − 1 ∣ x ) < 10 P ( y = + 1 ∣ x ) P(y=-1|\mathbf{x}) < 10P(y=+1|\mathbf{x}) P ( y = − 1∣ x ) < 10 P ( y = + 1∣ x )
We should predict -1 when E 1 > E − 1 E_1 > E_{-1} E 1 > E − 1 . That is, predict +1 when P ( y = − 1 ∣ x ) < 10 P ( y = + 1 ∣ x ) P(y=-1|\mathbf{x}) < 10P(y=+1|\mathbf{x}) P ( y = − 1∣ x ) < 10 P ( y = + 1∣ x )
Hence, can further simplify this condition:
P ( y = − 1 ∣ x ) < 10 P ( y = + 1 ∣ x ) 1 − P ( y = + 1 ∣ x ) < 10 P ( y = + 1 ∣ x ) 1 < 11 P ( y = + 1 ∣ x ) P ( y = + 1 ∣ x ) > 1 11 \begin{aligned}&P(y=-1|\mathbf{x}) < 10P(y=+1|\mathbf{x}) \\&1 - P(y=+1|\mathbf{x}) < 10P(y=+1|\mathbf{x})\\&1 < 11P(y=+1|\mathbf{x})\\&P(y=+1|\mathbf{x}) > \frac{1}{11}\end{aligned} P ( y = − 1∣ x ) < 10 P ( y = + 1∣ x ) 1 − P ( y = + 1∣ x ) < 10 P ( y = + 1∣ x ) 1 < 11 P ( y = + 1∣ x ) P ( y = + 1∣ x ) > 11 1
Therefore, we can renew our rule as:
Predict +1 when P ( y = + 1 ∣ x ) > 1 11 P(y=+1|\mathbf{x}) > \frac{1}{11} P ( y = + 1∣ x ) > 11 1
Predict -1 when P ( y = + 1 ∣ x ) < 1 11 P(y=+1|\mathbf{x}) < \frac{1}{11} P ( y = + 1∣ x ) < 11 1
Thus, we get f MKT ( x ) = sign ( P ( y = + 1 ∣ x ) − 1 11 ) f_{\text{MKT}}(\mathbf{x}) = \text{sign}(P(y=+1|\mathbf{x}) - \frac{1}{11}) f MKT ( x ) = sign ( P ( y = + 1∣ x ) − 11 1 )
What’s more, we show that α = 1 11 \alpha = \frac{1}{11} α = 11 1
Q.7
According to the original definition
E out ( 2 ) ( h ) = E x ∼ P ( x ) , y ∼ P ( y ∣ x ) [ [ h ( x ) ≠ y ] ] E^{(2)}_\text{out}(h) = \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x}),y\sim P(y|\mathbf{x})}[\![h(\mathbf{x}) \neq y]\!] E out ( 2 ) ( h ) = E x ∼ P ( x ) , y ∼ P ( y ∣ x ) [ [ h ( x ) = y ] ]
,it shows that we first take the expectation with respect to x \mathbf{x} x and then take the expectation of y given x \mathbf{x} x .
Therefore, we can rewrite the equation as
E out ( 2 ) ( h ) = E x ∼ P ( x ) [ E y ∼ P ( y ∣ x ) [ [ h ( x ) ≠ y ] ] ] E^{(2)}_\text{out}(h) = \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[\mathbb{E}_{y\sim P(y|\mathbf{x})}[\![h(\mathbf{x}) \neq y]\!]] E out ( 2 ) ( h ) = E x ∼ P ( x ) [ E y ∼ P ( y ∣ x ) [ [ h ( x ) = y ] ]]
As for E y ∼ P ( y ∣ x ) [ [ h ( x ) ≠ y ] ] \mathbb{E}_{y\sim P(y|\mathbf{x})}[\![h(\mathbf{x}) \neq y]\!] E y ∼ P ( y ∣ x ) [ [ h ( x ) = y ] ] , it is in fact P ( y ≠ h ( x ) ∣ x ) P(y\neq h(\mathbf{x})|\mathbf{x}) P ( y = h ( x ) ∣ x ) . It shows that for every fixed x \mathbf{x} x , the probability of y ≠ h ( x ) y \neq h(\mathbf{x}) y = h ( x )
Hence, now we have
E out ( 2 ) ( h ) = E x ∼ P ( x ) [ P ( y ≠ h ( x ) ∣ x ) ] E^{(2)}_\text{out}(h) = \mathbb{E}_{\mathbf{x} \sim P(\mathbf{x})}[P(y\neq h(\mathbf{x})|\mathbf{x})] E out ( 2 ) ( h ) = E x ∼ P ( x ) [ P ( y = h ( x ) ∣ x )]
Let
A A A be the event of "y ≠ h ( x ) y \neq h(x) y = h ( x ) "
B B B is for "f ( x ) = h ( x ) f(x) = h(x) f ( x ) = h ( x ) "
Then, by the Law of total probability, we get
P ( A ∣ x ) = P ( A ∣ B , x ) P ( B ∣ x ) + P ( A ∣ B c , x ) P ( B c ∣ x ) P(A|\mathbf{x}) = P(A|B,\mathbf{x})P(B|\mathbf{x}) + P(A|B^c,\mathbf{x})P(B^c|\mathbf{x}) P ( A ∣ x ) = P ( A ∣ B , x ) P ( B ∣ x ) + P ( A ∣ B c , x ) P ( B c ∣ x )
Note that f ( x ) = arg max y ∈ { − 1 , + 1 } P ( y ∣ x ) = sign ( P ( y = + 1 ∣ x ) − 1 2 ) f(\mathbf{x}) = \arg\max_{y\in\{-1,+1\}}P(y|\mathbf{x}) = \text{sign}\left(P(y=+1|\mathbf{x}) - \frac{1}{2}\right) f ( x ) = arg max y ∈ { − 1 , + 1 } P ( y ∣ x ) = sign ( P ( y = + 1∣ x ) − 2 1 ) , the definition implies that f ( x ) f(x) f ( x ) always choose the most possible label.
That is to say, for any hypothesis h h h , if h ( x ) ≠ f ( x ) h(x) \neq f(x) h ( x ) = f ( x ) , h ( x ) h(x) h ( x ) choose the less possible label, which means the probability of misclassification must be greater than that of f ( x ) f(x) f ( x ) . Meanwhile, if h ( x ) = f ( x ) h(x) = f(x) h ( x ) = f ( x ) , then the possiility will be the same.
Thus, we show that P ( y ≠ f ( x ) ∣ x ) ≤ P ( y ≠ h ( x ) ∣ x ) P(y\neq f(\mathbf{x})|\mathbf{x}) \leq P(y\neq h(\mathbf{x})|\mathbf{x}) P ( y = f ( x ) ∣ x ) ≤ P ( y = h ( x ) ∣ x )
holds for any h h h and x \mathbf{x} x by the definition of f ( x ) f(\mathbf{x}) f ( x )
Furthermore, we learn
P ( A ∣ B , x ) = P ( y ≠ f ( x ) ∣ x ) ∵ y ≠ h ( x ) = f ( x ) P(A|B,\mathbf{x}) = P(y\neq f(\mathbf{x})|\mathbf{x})\quad \because y \neq h(\mathbf{x}) = f(\mathbf{x}) P ( A ∣ B , x ) = P ( y = f ( x ) ∣ x ) ∵ y = h ( x ) = f ( x )
P ( A ∣ B c , x ) ≥ P ( y ≠ f ( x ) ∣ x ) P(A|B^c,\mathbf{x}) \geq P(y\neq f(\mathbf{x})|\mathbf{x}) P ( A ∣ B c , x ) ≥ P ( y = f ( x ) ∣ x )
For B B B (that is, f ( x ) = h ( x ) f(\mathbf{x}) = h(\mathbf{x}) f ( x ) = h ( x ) )
P ( A ∣ B , x ) = P ( y ≠ f ( x ) ∣ x ) P(A|B,\mathbf{x}) = P(y\neq f(\mathbf{x})|\mathbf{x}) P ( A ∣ B , x ) = P ( y = f ( x ) ∣ x )
For B c B^c B c (that is, f ( x ) ≠ h ( x ) f(\mathbf{x}) \neq h(\mathbf{x}) f ( x ) = h ( x ) )
∵ P ( y ≠ f ( x ) ∣ x ) ≤ P ( y ≠ h ( x ) ∣ x ) \because P(y\neq f(\mathbf{x})|\mathbf{x}) \leq P(y\neq h(\mathbf{x})|\mathbf{x}) ∵ P ( y = f ( x ) ∣ x ) ≤ P ( y = h ( x ) ∣ x )
∴ P ( A ∣ B c , x ) = P ( y ≠ h ( x ) ∣ f ( x ) ≠ h ( x ) , x ) ≥ P ( y ≠ f ( x ) ∣ x ) \therefore P(A|B^c,\mathbf{x}) = P(y\neq h(\mathbf{x})|f(\mathbf{x})\neq h(\mathbf{x}),\mathbf{x}) \geq P(y\neq f(\mathbf{x})|\mathbf{x}) ∴ P ( A ∣ B c , x ) = P ( y = h ( x ) ∣ f ( x ) = h ( x ) , x ) ≥ P ( y = f ( x ) ∣ x )
Hence,
P ( A ∣ x ) = P ( y ≠ f ( x ) ) ⋅ P ( f ( x ) ) = h ( x ) + P ( A ∣ B c , x ) ⋅ P ( B c ) ≤ P ( y ≠ f ( x ) ∣ x ) + P ( B c ) \begin{aligned} &P(A | \mathbf{x}) = P(y \neq f(\mathbf{x})) \cdot P(f(x)) \\ &= h(\mathbf{x}) + P(A|B^c,\mathbf{x})\cdot P(B^c) \\ & \leq P(y \neq f(\mathbf{x})| \mathbf{x}) + P(B^c) \end{aligned} P ( A ∣ x ) = P ( y = f ( x )) ⋅ P ( f ( x )) = h ( x ) + P ( A ∣ B c , x ) ⋅ P ( B c ) ≤ P ( y = f ( x ) ∣ x ) + P ( B c )
Note that
E x ∼ P ( x ) [ P ( y ≠ f ( x ) ∣ x ) ] = E out ( 2 ) ( f ) \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[P(y\neq f(\mathbf{x})|\mathbf{x})] = E^{(2)}_\text{out}(f) E x ∼ P ( x ) [ P ( y = f ( x ) ∣ x )] = E out ( 2 ) ( f )
E x ∼ P ( x ) [ P ( f ( x ) ≠ h ( x ) ) ] = E out ( 1 ) ( h ) \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[P(f(\mathbf{x})\neq h(\mathbf{x}))] = E^{(1)}_\text{out}(h) E x ∼ P ( x ) [ P ( f ( x ) = h ( x ))] = E out ( 1 ) ( h )
Therefore,
E out ( 2 ) ( h ) = E x ∼ P ( x ) [ P ( y ≠ h ( x ) ∣ x ) ] ≤ E x ∼ P ( x ) [ P ( y ≠ f ( x ) ∣ x ) ] + E x ∼ P ( x ) [ P ( f ( x ) ≠ h ( x ) ) ] = E out ( 2 ) ( f ) + E out ( 1 ) ( h ) \begin{aligned}
E^{(2)}_\text{out}(h) &= \mathbb{E}_{\mathbf{x} \sim P(\mathbf{x})}[P(y\neq h(\mathbf{x})|\mathbf{x})] \\
\quad \\
&\leq \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[P(y\neq f(\mathbf{x})|\mathbf{x})] + \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[P(f(\mathbf{x})\neq h(\mathbf{x}))] \\
\quad \\
&= E^{(2)}_\text{out}(f) + E^{(1)}_\text{out}(h)
\end{aligned} E out ( 2 ) ( h ) = E x ∼ P ( x ) [ P ( y = h ( x ) ∣ x )] ≤ E x ∼ P ( x ) [ P ( y = f ( x ) ∣ x )] + E x ∼ P ( x ) [ P ( f ( x ) = h ( x ))] = E out ( 2 ) ( f ) + E out ( 1 ) ( h )
Then, we proved the inequality
E out ( 2 ) ( h ) ≤ E out ( 1 ) ( h ) + E out ( 2 ) ( f ) E^{(2)}_\text{out}(h) \leq E^{(1)}_\text{out}(h) + E^{(2)}_\text{out}(f) E out ( 2 ) ( h ) ≤ E out ( 1 ) ( h ) + E out ( 2 ) ( f )
Thus we complete the proof.
Q.8
According to Lecture 9 p.10, for the original data, we have
w L I N = ( X T X ) − 1 X T y w_{LIN} = (X^TX)^{-1}X^Ty w L I N = ( X T X ) − 1 X T y
Now, change every x 0 x_0 x 0 to 1126 1126 1126 instead of 1 1 1 . This implies that multiplying the first column of X X X by 1126 1126 1126 . That is,
D = ( 1126 0 ⋯ 0 0 1 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ 1 ) D = \begin{pmatrix}
1126 & 0 & \cdots & 0 \\
0 & 1 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & 1
\end{pmatrix} D = 1126 0 ⋮ 0 0 1 ⋮ 0 ⋯ ⋯ ⋱ ⋯ 0 0 ⋮ 1
Hence, the new matrix will be D X DX D X and we have w L U C K Y w_{LUCKY} w LU C K Y :
w L U C K Y = ( ( D X ) T ( D X ) ) − 1 ( D X ) T y w_{LUCKY} = ((DX)^T(DX))^{-1}(DX)^Ty w LU C K Y = (( D X ) T ( D X ) ) − 1 ( D X ) T y
By expanding the equation, we learn that
w L U C K Y = ( X T D T D X ) − 1 X T D T y w_{LUCKY} = (X^TD^TDX)^{-1}X^TD^Ty w LU C K Y = ( X T D T D X ) − 1 X T D T y
Note that D D D is a diagonal matrix and thus has the following properties:
D T = D D^T = D D T = D
D T D = D 2 D^TD = D^2 D T D = D 2
Therefore,
D 2 = ( 112 6 2 0 ⋯ 0 0 1 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ 1 ) D^2 = \begin{pmatrix} 1126^2 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1\end{pmatrix} D 2 = 112 6 2 0 ⋮ 0 0 1 ⋮ 0 ⋯ ⋯ ⋱ ⋯ 0 0 ⋮ 1
Hence, back to w L U C K Y w_{LUCKY} w LU C K Y ,
w L U C K Y = ( ( D X ) T ( D X ) ) − 1 ( D X ) T y = [ X T ( D T D ) X ] − 1 X T D T y = ( X T D 2 X ) − 1 X T D y ( ∵ D T = D ) = [ ( X T X ) − 1 D − 2 ] X T D y = ( X T X ) − 1 X T D − 1 y \begin{aligned} w_{LUCKY} &= ((DX)^T(DX))^{-1}(DX)^Ty \\
\quad \\&= \left[ X^T (D^T D) X\right]^{-1} X^TD^Ty \\
\quad \\&= (X^TD^2X)^{-1}X^TDy \qquad (\because D^T=D)\\
\quad \\&=\left[(X^TX)^{-1}D^{-2} \right]X^TDy \\
\quad \\ &=(X^TX)^{-1}X^TD^{-1}y\\
\end{aligned} w LU C K Y = (( D X ) T ( D X ) ) − 1 ( D X ) T y = [ X T ( D T D ) X ] − 1 X T D T y = ( X T D 2 X ) − 1 X T Dy ( ∵ D T = D ) = [ ( X T X ) − 1 D − 2 ] X T Dy = ( X T X ) − 1 X T D − 1 y
As for w L I N = ( X T X ) − 1 X T y w_{LIN} = (X^TX)^{-1}X^Ty w L I N = ( X T X ) − 1 X T y , w can rewrite w l i n w_{lin} w l in as:
w L U C K Y = D − 1 w L I N w_{LUCKY} = D^{-1}w_{LIN} w LU C K Y = D − 1 w L I N
Then, we have
w L I N = D w L U C K Y w_{LIN} = Dw_{LUCKY} w L I N = D w LU C K Y
Hence, we complete the proof
Q.9
Note that we have the new logistic function θ ~ ( s ) = 1 1 + exp ( − s ) \tilde{\theta}(s) = \frac{1}{1+\text{exp}(-s)} θ ~ ( s ) = 1 + exp ( − s ) 1 .
First, we need to construct a likelihood. Hence, we must clarify that:
h ~ ( x n ) = P ( y n = + 1 ∣ x n ) \tilde{h}(\mathbf{x}_n) = P(y_n = +1 | \mathbf{x}_n) h ~ ( x n ) = P ( y n = + 1∣ x n )
1 − h ~ ( x n ) = P ( y n = − 1 ∣ x n ) 1 - \tilde{h}(\mathbf{x}_n) = P(y_n = -1 | \mathbf{x}_n) 1 − h ~ ( x n ) = P ( y n = − 1∣ x n )
Then, for a sample ( x n , y n ) (\mathbf{x}_n, y_n) ( x n , y n ) where y n ∈ { − 1 , + 1 } y_n \in \{-1, +1\} y n ∈ { − 1 , + 1 } , we get the likelihood function as
P ( y n ∣ x n ) = h ~ ( x n ) y n + 1 2 ( 1 − h ~ ( x n ) ) 1 − y n 2 P(y_n|\mathbf{x}_n) = \tilde{h}(\mathbf{x}_n)^{\frac{y_n+1}{2}}(1-\tilde{h}(\mathbf{x}_n))^{\frac{1-y_n}{2}} P ( y n ∣ x n ) = h ~ ( x n ) 2 y n + 1 ( 1 − h ~ ( x n ) ) 2 1 − y n
Second, according to p.11 on the handout, we have
ln L = ∑ n = 1 N y n + 1 2 ln h ~ ( x n ) + 1 − y n 2 ln ( 1 − h ~ ( x n ) ) \ln L = \sum_{n=1}^N \frac{y_n+1}{2}\ln\tilde{h}(\mathbf{x}_n) + \frac{1-y_n}{2}\ln(1-\tilde{h}(\mathbf{x}_n)) ln L = n = 1 ∑ N 2 y n + 1 ln h ~ ( x n ) + 2 1 − y n ln ( 1 − h ~ ( x n ))
Third, construct the E i n ( w ) E_{in}(\mathbf{w}) E in ( w ) as:
E ~ i n ( w ) = − 1 N ln L = − 1 N ∑ n = 1 N [ y n + 1 2 ln h ~ ( x n ) + 1 − y n 2 ln ( 1 − h ~ ( x n ) ) ] \tilde{E}_{in}(\mathbf{w}) = -\frac{1}{N}\ln L = -\frac{1}{N}\sum_{n=1}^N \left[\frac{y_n+1}{2}\ln\tilde{h}(\mathbf{x}_n) + \frac{1-y_n}{2}\ln(1-\tilde{h}(\mathbf{x}_n))\right] E ~ in ( w ) = − N 1 ln L = − N 1 n = 1 ∑ N [ 2 y n + 1 ln h ~ ( x n ) + 2 1 − y n ln ( 1 − h ~ ( x n )) ]
Fourth, since we want to get the gradient, we must do the partial differentiation
∂ E ~ i n ∂ w = − 1 N ∑ n = 1 N [ y n + 1 2 1 h ~ ( x n ) ∂ h ~ ( x n ) ∂ w − 1 − y n 2 1 1 − h ~ ( x n ) ∂ h ~ ( x n ) ∂ w ] \frac{\partial \tilde{E}_{in}}{\partial \mathbf{w}} = -\frac{1}{N}\sum_{n=1}^N \left[\frac{y_n+1}{2}\frac{1}{\tilde{h}(\mathbf{x}_n)}\frac{\partial \tilde{h}(\mathbf{x}_n)}{\partial \mathbf{w}} - \frac{1-y_n}{2}\frac{1}{1-\tilde{h}(\mathbf{x}_n)}\frac{\partial \tilde{h}(\mathbf{x}_n)}{\partial \mathbf{w}}\right] ∂ w ∂ E ~ in = − N 1 n = 1 ∑ N [ 2 y n + 1 h ~ ( x n ) 1 ∂ w ∂ h ~ ( x n ) − 2 1 − y n 1 − h ~ ( x n ) 1 ∂ w ∂ h ~ ( x n ) ]
∂ h ~ ( x n ) ∂ w = 1 2 ∂ ∂ w ( w T x n 1 + ( w T x n ) 2 + 1 ) = 1 2 x n ( 1 + ( w T x n ) 2 ) 3 / 2 \frac{\partial \tilde{h}(\mathbf{x}_n)}{\partial \mathbf{w}} = \frac{1}{2}\frac{\partial}{\partial \mathbf{w}}\left(\frac{\mathbf{w}^T\mathbf{x}_n}{\sqrt{1+(\mathbf{w}^T\mathbf{x}_n)^2}}+1\right) = \frac{1}{2}\frac{\mathbf{x}_n}{(1+(\mathbf{w}^T\mathbf{x}_n)^2)^{3/2}} ∂ w ∂ h ~ ( x n ) = 2 1 ∂ w ∂ ( 1 + ( w T x n ) 2 w T x n + 1 ) = 2 1 ( 1 + ( w T x n ) 2 ) 3/2 x n
Last but not least, combine the results in step 4, and we get
∇ E ~ i n ( w ) = − 1 2 N ∑ n = 1 N [ y n + 1 h ~ ( x n ) − 1 − y n 1 − h ~ ( x n ) ] x n ( 1 + ( w T x n ) 2 ) 3 / 2 \nabla \tilde{E}_{in}(\mathbf{w}) = -\frac{1}{2N}\sum_{n=1}^N \left[\frac{y_n+1}{\tilde{h}(\mathbf{x}_n)} - \frac{1-y_n}{1-\tilde{h}(\mathbf{x}_n)}\right]\frac{\mathbf{x}_n}{(1+(\mathbf{w}^T\mathbf{x}_n)^2)^{3/2}} ∇ E ~ in ( w ) = − 2 N 1 n = 1 ∑ N [ h ~ ( x n ) y n + 1 − 1 − h ~ ( x n ) 1 − y n ] ( 1 + ( w T x n ) 2 ) 3/2 x n
Hence, we complete the proof.