Old Examq QuestionsVariable elimination: a numerical exampleConsider the following Bayes net, modeling the behavior of a group of students and their professor. The variables are the following:
We want to compute {$P(A|K) = \frac{P(A,K)}{P(K)}$}. Therefore we need to compute {$P(A,K)$}, i.e. eliminate S, G, W and C. We have the following: {$P(K,G,W,C,S,A) = P(K)P(G)P(W|K,G)P(C|G)P(S|W,C)P(A|S,C)$} We write the marginalization sum and push the sum symbols as far inside as possible: as explained in the BN lecture, this saves computations and is the whole point of variable elimination. {$ \; $ \begin{align*} P(A,K) = P(K) & \sum_c \sum_w \sum_g P(G=g)P(C=c|G=g)P(W=w|K,G=g) \\ & \sum_s P(S=s | W=w,C=c) P(A|S=s,C=c) \end{align*} $ \; $} When eliminating S, we compute the inner-most sum: the first factor {$g1(W,C,A)$} is computed as a CPT (conditional probability table) over W, C and A:
We repeat this process for G, W and C: notice that CPT’s for {$k$} binary variables store {$2^k$} values. We could very well directly compute {$P(K,G,W,C,S,A)$} as a big CPT and sum appropriate rows to obtain desired marginals, but variable elimination saves computations, and scales to much bigger networks (see example in the BN lecture). Questions from past examsNote that we covered somewhat different material this year than previous years, so some of the material below (e.g. “halving”) will not be familiar to you and will not be on the final. Shorts (2009 final)
(:ans:) Irrelevant features can hurt with small datasets. (:ansend:)
(:ans:) {$y = x_1 \; {\rm XOR} \; x_2 $} (:ansend:)
(:ans:) If her classifier is accurate and make different errors than your current ensemble, start negotiating with her. (:ansend:) EM this HMM (2008 final)Consider a simple two-step HMM, with hidden states {$ (Z_1,Z_2) $} and observed variables {$ (O_1,O_2) $}. Let us denote the parameters of the HMM as: {$ \; $\begin{align*} \theta_{Z_1=z_1} = & P(Z_1=z_1) \\ \theta_{Z_2=z_2|Z_1=z_1} = & P(Z_2=z_2|Z_1=z_1) \\ \theta_{O_i=o_i|Z_i =z_i}= & P(O_i=o_i | Z_i = z_i), \;\; i=1,2 \end{align*}$ \; $}
(:ans:) {$\log P(z_1,z_2,o_1,o_2) = \log(\theta_{Z_1=z_1}\theta_{O_1=o_1|Z_1=z_1}\theta_{Z_2=z_2|Z_1=z_1}\theta_{O_2=o_2|Z_2=z_2})$} (:ansend:)
(:ans:) {$\sum_{j=1}^m \log\theta_{Z_1=z_1^j}\theta_{O_1=o_1^j|Z_1=z_1^j}\theta_{Z_2=z_2^j|Z_1=z_1^j}\theta_{O_2=o_2^j|Z_2=z_2^j})$} (:ansend:)
(:ans:) {$\theta_{Z_2=z_2|Z_1=z_1} = \frac{\sum_j \mathbf{1}(z_1^j=z_1,z_2^j=z_2)}{\sum_j \mathbf{1}(z_1^j=z_1)}$} (:ansend:)
(:ans:) {$\theta_{O_i=o_i|Z_i=z_i} = \frac{\sum_j \sum_i \mathbf{1}(z_i^j=z_i,o_i^j=o_i)}{\sum_j \sum_i \mathbf{1}(z_i^j=z_i)} $} (:ansend:)
(:ans:) {$Q(Z_1=z_1,Z_2=z_2|o_1^j,o_2^j) = \frac{\theta_{Z_1=z_1}\theta_{O_1=o_1^j|Z_1=z_1}\theta_{Z_2=z_2|Z_1=z_1}\theta_{O_2=o_2^j|Z_2=z_2}}{\sum_{z_1,z_2} \theta_{Z_1=z_1}\theta_{O_1=o_1^j|Z_1=z_1}\theta_{Z_2=z_2|Z_1=z_1}\theta_{O_2=o_2^j|Z_2=z_2}}$} (:ansend:)
(:ans:) {$\theta_{Z_2=z_2|Z_1=z_1} = \frac{\sum_j Q(Z_1=z_1,Z_2=z_2|o_1^j,o_2^j)}{\sum_j Q(Z_1=z_1|o_1^j,o_2^j)} \\ \\ (Q(Z_1=z_1|o_1^j,o_2^j) = \sum_{z_2} Q(Z_1=z_1,Z_2=z_2|o_1^j,o_2^j))$} (:ansend:)
(:ans:) {$\theta_{O_i=o_i|Z_i=z_i} = \frac{\sum_j \sum_i Q(Z_i=z_i | o_1^j,o_2^j)\mathbf{1}(o_i^j=o_i)}{\sum_j \sum_i Q(Z_i=z_i | o_1^j, o_2^j)} $} (:ansend:) Online Learning (2010 final)In online learning, the halving algorithm is used to make predictions when there are {$N$} experts and it is assumed that one of the experts makes no mistakes. You’ve seen that the halving algorithm will make at most {$\log_2 N$} mistakes. However, if there are no perfect experts, then this algorithm will not work: since every expert might make mistakes, they will all eventually be eliminated. Instead of throwing away the experts that make mistakes, we can use the weighted majority algorithm and weight each expert by some {$0\leq \beta \leq 1$} if they make a mistakes. The algorithm is: At each time step {$t = 1$} to {$T$}:
{$\hat{y_t} = \arg\max_y \sum_i w_{i,t}{\bf 1}[h_{i,t}=y]$}
{$ \; $ w_{i,t+1} = \begin{cases} \beta w_{i,t} & \text{if }h_{i,t}\not=y_t \\ w_{i,t} & \text{if }h_{i,t}=y_t \end{cases}$ \; $} For the following questions, assume {$y \in \{0,1\}$}.
(:ans:) {$\beta = 0$} (:ansend:)
(:ans:) If a mistake is made, then there must have been at least half the weight voting for it. The new weight is at most {$1/2 + (1/3)(1/2) = 3/6 + 1/6 = 4/6 = 2/3$}, giving {$W_{t+1} \leq 2/3W_{t}$}. (:ansend:)
(:ans:) The total weight at the beginning of the algorithm is {$n$}. Each time a mistake is made, the weight is reduced by at least 1/3. Since {$M$} mistakes were make, this happened {$M$} times, and so {$W_{t} \leq N(2/3)(2/3)\dots(2/3) = N(2/3)^M$}. (:ansend:)
{$ M \leq \frac{1}{\log_3 3/2}[m+\log_3 N].$} (:ans:) Starting with {$(1/3)^m \leq N(2/3)^M$}, we can take the log of both sides, giving {$ m\log_3(1/3) \leq \log_3 N + M\log_3(2/3).$} Negating both sides gives {$m \geq -\log_3 N + M\log_3(3/2),$} and rearranging gives the desired result: {$M \leq \frac{1}{\log_3 3/2}\left[m+\log_3 N\right].$} (:ansend:) |