6 min read
Latex TRY

1ddddd

2dddddddds

Q.5

hi

dd
  1. By recalling the definition of VC Dimension, we learn that:
    dvc(H)d_{vc}(\mathcal{H}) is the size of the largest set of points that H\mathcal{H} can shatter.

  2. Consider a set of points SS of size n=dvc(H1H2)n = d_{vc}(\mathcal{H}_1 \cup \mathcal{H}_2), which can be shattered by H1H2\mathcal{H}_1 \cup \mathcal{H}_2.

  3. For any dichotomy of SS, there must exist an hH1H2h \in \mathcal{H}_1 \cup \mathcal{H}_2 that can realize this dichotomy.

  4. Split SS into two subsets:

    1. S1S_1 : points for which the dichotomy can be realized by hypotheses in H1\mathcal{H}_1
    2. S2S_2: points for which the dichotomy can only be realized by hypotheses in H2\mathcal{H}_2
  5. Clearly, S1dvc(H1)|S_1| \leq d_{vc}(\mathcal{H}_1), because H1\mathcal{H}_1 can shatter at most dvc(H1)d_{vc}(\mathcal{H}_1) points.

  6. Similarly, S2dvc(H2)|S_2| \leq d_{vc}(\mathcal{H}_2).

  7. Since S=S1S2S = S_1 \cup S_2, by the triangle inequality, we have:

n=S=S1S2S1+S2dvc(H1)+dvc(H2)n = |S| = |S_1 \cup S_2| \leq |S_1| + |S_2| \leq d_{vc}(\mathcal{H}_1) + d_{vc}(\mathcal{H}_2)
  1. Therefore, dvc(H1H2)=ndvc(H1)+dvc(H2)d_{vc}(\mathcal{H}_1 \cup \mathcal{H}_2) = n \leq d_{vc}(\mathcal{H}_1) + d_{vc}(\mathcal{H}_2).

Q.6

For a given x\mathbf{x} , since now we suppose that a false negative is 10 times more important than a false positive. Hence, we have the following assumption:

  1. if we predict +1+1, the penalty of the wrong expectation would be: E1=1P(y=1x)E_1 = 1 \cdot P(y=-1|\mathbf{x})
  2. if we predict 1-1, the expected cost of misclassification would be: E1=10P(y=+1x)E_{-1} = 10 \cdot P(y=+1|\mathbf{x}) According to the description of the problem, we can construct following rules:
  • We should predict +1 when E1<E1E_1 < E_{-1}. That is, predict +1 when P(y=1x)<10P(y=+1x)P(y=-1|\mathbf{x}) < 10P(y=+1|\mathbf{x})
  • We should predict -1 when E1>E1E_1 > E_{-1}. That is, predict +1 when P(y=1x)<10P(y=+1x)P(y=-1|\mathbf{x}) < 10P(y=+1|\mathbf{x})

Hence, can further simplify this condition:

P(y=1x)<10P(y=+1x)1P(y=+1x)<10P(y=+1x)1<11P(y=+1x)P(y=+1x)>111\begin{aligned}&P(y=-1|\mathbf{x}) < 10P(y=+1|\mathbf{x}) \\&1 - P(y=+1|\mathbf{x}) < 10P(y=+1|\mathbf{x})\\&1 < 11P(y=+1|\mathbf{x})\\&P(y=+1|\mathbf{x}) > \frac{1}{11}\end{aligned}

Therefore, we can renew our rule as:

  • Predict +1 when P(y=+1x)>111P(y=+1|\mathbf{x}) > \frac{1}{11}
  • Predict -1 when P(y=+1x)<111P(y=+1|\mathbf{x}) < \frac{1}{11}

Thus, we get fMKT(x)=sign(P(y=+1x)111)f_{\text{MKT}}(\mathbf{x}) = \text{sign}(P(y=+1|\mathbf{x}) - \frac{1}{11})

What’s more, we show that α=111\alpha = \frac{1}{11}

Q.7

According to the original definition

Eout(2)(h)=ExP(x),yP(yx)[ ⁣[h(x)y] ⁣]E^{(2)}_\text{out}(h) = \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x}),y\sim P(y|\mathbf{x})}[\![h(\mathbf{x}) \neq y]\!]

,it shows that we first take the expectation with respect to x\mathbf{x} and then take the expectation of y given x\mathbf{x}. Therefore, we can rewrite the equation as

Eout(2)(h)=ExP(x)[EyP(yx)[ ⁣[h(x)y] ⁣]]E^{(2)}_\text{out}(h) = \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[\mathbb{E}_{y\sim P(y|\mathbf{x})}[\![h(\mathbf{x}) \neq y]\!]]

As for EyP(yx)[ ⁣[h(x)y] ⁣]\mathbb{E}_{y\sim P(y|\mathbf{x})}[\![h(\mathbf{x}) \neq y]\!] , it is in fact P(yh(x)x)P(y\neq h(\mathbf{x})|\mathbf{x}). It shows that for every fixed x\mathbf{x}, the probability of yh(x)y \neq h(\mathbf{x}) Hence, now we have

Eout(2)(h)=ExP(x)[P(yh(x)x)]E^{(2)}_\text{out}(h) = \mathbb{E}_{\mathbf{x} \sim P(\mathbf{x})}[P(y\neq h(\mathbf{x})|\mathbf{x})]

Let

  • AA be the event of "yh(x)y \neq h(x)"
  • BB is for "f(x)=h(x)f(x) = h(x)"

Then, by the Law of total probability, we get

P(Ax)=P(AB,x)P(Bx)+P(ABc,x)P(Bcx)P(A|\mathbf{x}) = P(A|B,\mathbf{x})P(B|\mathbf{x}) + P(A|B^c,\mathbf{x})P(B^c|\mathbf{x})

Note that f(x)=argmaxy{1,+1}P(yx)=sign(P(y=+1x)12)f(\mathbf{x}) = \arg\max_{y\in\{-1,+1\}}P(y|\mathbf{x}) = \text{sign}\left(P(y=+1|\mathbf{x}) - \frac{1}{2}\right), the definition implies that f(x)f(x) always choose the most possible label. That is to say, for any hypothesis hh, if h(x)f(x)h(x) \neq f(x), h(x)h(x) choose the less possible label, which means the probability of misclassification must be greater than that of f(x)f(x). Meanwhile, if h(x)=f(x)h(x) = f(x), then the possiility will be the same. Thus, we show that P(yf(x)x)P(yh(x)x)P(y\neq f(\mathbf{x})|\mathbf{x}) \leq P(y\neq h(\mathbf{x})|\mathbf{x}) holds for any hh and x\mathbf{x} by the definition of f(x)f(\mathbf{x})

Furthermore, we learn

  • P(AB,x)=P(yf(x)x)yh(x)=f(x)P(A|B,\mathbf{x}) = P(y\neq f(\mathbf{x})|\mathbf{x})\quad \because y \neq h(\mathbf{x}) = f(\mathbf{x})
  • P(ABc,x)P(yf(x)x)P(A|B^c,\mathbf{x}) \geq P(y\neq f(\mathbf{x})|\mathbf{x})
    • For BB (that is, f(x)=h(x)f(\mathbf{x}) = h(\mathbf{x}))
      • P(AB,x)=P(yf(x)x)P(A|B,\mathbf{x}) = P(y\neq f(\mathbf{x})|\mathbf{x})
    • For BcB^c (that is, f(x)h(x)f(\mathbf{x}) \neq h(\mathbf{x}))
      • P(yf(x)x)P(yh(x)x)\because P(y\neq f(\mathbf{x})|\mathbf{x}) \leq P(y\neq h(\mathbf{x})|\mathbf{x})
      • P(ABc,x)=P(yh(x)f(x)h(x),x)P(yf(x)x)\therefore P(A|B^c,\mathbf{x}) = P(y\neq h(\mathbf{x})|f(\mathbf{x})\neq h(\mathbf{x}),\mathbf{x}) \geq P(y\neq f(\mathbf{x})|\mathbf{x})

Hence,

P(Ax)=P(yf(x))P(f(x))=h(x)+P(ABc,x)P(Bc)P(yf(x)x)+P(Bc)\begin{aligned} &P(A | \mathbf{x}) = P(y \neq f(\mathbf{x})) \cdot P(f(x)) \\ &= h(\mathbf{x}) + P(A|B^c,\mathbf{x})\cdot P(B^c) \\ & \leq P(y \neq f(\mathbf{x})| \mathbf{x}) + P(B^c) \end{aligned}

Note that

  • ExP(x)[P(yf(x)x)]=Eout(2)(f)\mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[P(y\neq f(\mathbf{x})|\mathbf{x})] = E^{(2)}_\text{out}(f)
  • ExP(x)[P(f(x)h(x))]=Eout(1)(h)\mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[P(f(\mathbf{x})\neq h(\mathbf{x}))] = E^{(1)}_\text{out}(h)

Therefore,

Eout(2)(h)=ExP(x)[P(yh(x)x)]ExP(x)[P(yf(x)x)]+ExP(x)[P(f(x)h(x))]=Eout(2)(f)+Eout(1)(h)\begin{aligned} E^{(2)}_\text{out}(h) &= \mathbb{E}_{\mathbf{x} \sim P(\mathbf{x})}[P(y\neq h(\mathbf{x})|\mathbf{x})] \\ \quad \\ &\leq \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[P(y\neq f(\mathbf{x})|\mathbf{x})] + \mathbb{E}_{\mathbf{x}\sim P(\mathbf{x})}[P(f(\mathbf{x})\neq h(\mathbf{x}))] \\ \quad \\ &= E^{(2)}_\text{out}(f) + E^{(1)}_\text{out}(h) \end{aligned}

Then, we proved the inequality Eout(2)(h)Eout(1)(h)+Eout(2)(f)E^{(2)}_\text{out}(h) \leq E^{(1)}_\text{out}(h) + E^{(2)}_\text{out}(f)

Thus we complete the proof.

Q.8

According to Lecture 9 p.10, for the original data, we have wLIN=(XTX)1XTyw_{LIN} = (X^TX)^{-1}X^Ty Now, change every x0x_0 to 11261126 instead of 11. This implies that multiplying the first column of XX by 11261126. That is,

D=(112600010001)D = \begin{pmatrix} 1126 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{pmatrix}

Hence, the new matrix will be DXDX and we have wLUCKYw_{LUCKY}:

wLUCKY=((DX)T(DX))1(DX)Tyw_{LUCKY} = ((DX)^T(DX))^{-1}(DX)^Ty

By expanding the equation, we learn that

wLUCKY=(XTDTDX)1XTDTyw_{LUCKY} = (X^TD^TDX)^{-1}X^TD^Ty

Note that DD is a diagonal matrix and thus has the following properties:

  • DT=DD^T = D
  • DTD=D2D^TD = D^2

Therefore,

D2=(1126200010001)D^2 = \begin{pmatrix} 1126^2 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1\end{pmatrix}

Hence, back to wLUCKYw_{LUCKY},

wLUCKY=((DX)T(DX))1(DX)Ty=[XT(DTD)X]1XTDTy=(XTD2X)1XTDy(DT=D)=[(XTX)1D2]XTDy=(XTX)1XTD1y\begin{aligned} w_{LUCKY} &= ((DX)^T(DX))^{-1}(DX)^Ty \\ \quad \\&= \left[ X^T (D^T D) X\right]^{-1} X^TD^Ty \\ \quad \\&= (X^TD^2X)^{-1}X^TDy \qquad (\because D^T=D)\\ \quad \\&=\left[(X^TX)^{-1}D^{-2} \right]X^TDy \\ \quad \\ &=(X^TX)^{-1}X^TD^{-1}y\\ \end{aligned}

As for wLIN=(XTX)1XTyw_{LIN} = (X^TX)^{-1}X^Ty, w can rewrite wlinw_{lin} as:

wLUCKY=D1wLINw_{LUCKY} = D^{-1}w_{LIN}

Then, we have

wLIN=DwLUCKYw_{LIN} = Dw_{LUCKY}

Hence, we complete the proof

Q.9

Note that we have the new logistic function θ~(s)=11+exp(s)\tilde{\theta}(s) = \frac{1}{1+\text{exp}(-s)}. First, we need to construct a likelihood. Hence, we must clarify that:

  • h~(xn)=P(yn=+1xn)\tilde{h}(\mathbf{x}_n) = P(y_n = +1 | \mathbf{x}_n)
  • 1h~(xn)=P(yn=1xn)1 - \tilde{h}(\mathbf{x}_n) = P(y_n = -1 | \mathbf{x}_n) Then, for a sample (xn,yn)(\mathbf{x}_n, y_n) where yn{1,+1}y_n \in \{-1, +1\}, we get the likelihood function as P(ynxn)=h~(xn)yn+12(1h~(xn))1yn2P(y_n|\mathbf{x}_n) = \tilde{h}(\mathbf{x}_n)^{\frac{y_n+1}{2}}(1-\tilde{h}(\mathbf{x}_n))^{\frac{1-y_n}{2}}

Second, according to p.11 on the handout, we have

lnL=n=1Nyn+12lnh~(xn)+1yn2ln(1h~(xn))\ln L = \sum_{n=1}^N \frac{y_n+1}{2}\ln\tilde{h}(\mathbf{x}_n) + \frac{1-y_n}{2}\ln(1-\tilde{h}(\mathbf{x}_n))

Third, construct the Ein(w)E_{in}(\mathbf{w}) as:

E~in(w)=1NlnL=1Nn=1N[yn+12lnh~(xn)+1yn2ln(1h~(xn))]\tilde{E}_{in}(\mathbf{w}) = -\frac{1}{N}\ln L = -\frac{1}{N}\sum_{n=1}^N \left[\frac{y_n+1}{2}\ln\tilde{h}(\mathbf{x}_n) + \frac{1-y_n}{2}\ln(1-\tilde{h}(\mathbf{x}_n))\right]

Fourth, since we want to get the gradient, we must do the partial differentiation

E~inw=1Nn=1N[yn+121h~(xn)h~(xn)w1yn211h~(xn)h~(xn)w]\frac{\partial \tilde{E}_{in}}{\partial \mathbf{w}} = -\frac{1}{N}\sum_{n=1}^N \left[\frac{y_n+1}{2}\frac{1}{\tilde{h}(\mathbf{x}_n)}\frac{\partial \tilde{h}(\mathbf{x}_n)}{\partial \mathbf{w}} - \frac{1-y_n}{2}\frac{1}{1-\tilde{h}(\mathbf{x}_n)}\frac{\partial \tilde{h}(\mathbf{x}_n)}{\partial \mathbf{w}}\right] h~(xn)w=12w(wTxn1+(wTxn)2+1)=12xn(1+(wTxn)2)3/2\frac{\partial \tilde{h}(\mathbf{x}_n)}{\partial \mathbf{w}} = \frac{1}{2}\frac{\partial}{\partial \mathbf{w}}\left(\frac{\mathbf{w}^T\mathbf{x}_n}{\sqrt{1+(\mathbf{w}^T\mathbf{x}_n)^2}}+1\right) = \frac{1}{2}\frac{\mathbf{x}_n}{(1+(\mathbf{w}^T\mathbf{x}_n)^2)^{3/2}}

Last but not least, combine the results in step 4, and we get

E~in(w)=12Nn=1N[yn+1h~(xn)1yn1h~(xn)]xn(1+(wTxn)2)3/2\nabla \tilde{E}_{in}(\mathbf{w}) = -\frac{1}{2N}\sum_{n=1}^N \left[\frac{y_n+1}{\tilde{h}(\mathbf{x}_n)} - \frac{1-y_n}{1-\tilde{h}(\mathbf{x}_n)}\right]\frac{\mathbf{x}_n}{(1+(\mathbf{w}^T\mathbf{x}_n)^2)^{3/2}}

Hence, we complete the proof.