FAQ /
Homework 2On this page… (hide) Plotting and LATEX FiguresWhen we request plots in the homework, it is probably easiest to make them in some program like matlab, then use a figure environment to include them in your latex. For example, in matlab you can save your plot to a .png file, say “plot.png”, then include it in the latex like so: \begin{figure} Problem 11.1Your expression for complexity should be big-O of an expression containing {$m$}s and {$n$}s. Note: don’t assume {$m \ll n$}; your big-O expression should include the term with the highest power of {$n$} and also the term with the highest power of {$m$}, even if the latter term isn’t the same as the first. 1.2This part is not meant to be tricky --- we just want you to write a summation illustrating which elements of the {$H$} matrix are multiplied with which elements of {$Y$} to produce the {$i$}-th element in the {$\hat{Y}$} vector. 1.3Take a look at the hint Ben sent to the class mailing list. Remember that {$(Xw-Z)^T(Xw-Z) = \sum_j (w^T x_j - Z_j)^2$}. 1.4The solution here should be very similar in form to that of part 1.2. Problem 22.1.1We’re looking for you to state the expressions for the objective functions here, and then answer: Is Naive Bayes generative or discriminative? What about logistic regression? This is the kind of “difference” we would like you to state. 2.1.2In class we went over how to show that the decision boundary for a restricted case of Gaussian Naive Bayes is linear just like the decision boundary for logistic regression. In this question you are being asked to show something similar, except that instead of having continuous X variables, you are working with binary X. So, you are trying to show that the decision boundary for Bernoulli Naive Bayes is linear in the {$X_i$} just like the boundary for logistic regression. What we would like in the end is an expression, in terms of {$\pi$} and {$\theta_{ij}$} and {$X_i$}, where you can identify which part is analogous to the constant offset {$w_0$} of logistic regression, and which part is analogous to the other weights {$w_i$}. In case you are confused about Bernoulli: A Bernoulli random variable models an event that can take on only one of two possible values. If we say that {$Y \sim \mathrm{Bernoulli}(\pi)$}, then we can conclude that:
So, if {$X$} is the result of a coin toss that lands on heads with probability {$p$}, then {$X \sim \mathrm{Bernoulli}(p)$}. By extension, when the problem says {$P(X_i \mid Y = j)$} is {$\mathrm{Bernoulli}(\theta_{ij})$}, this means {$P(X_i = 1 \mid Y = j) = \theta_{ij}$}, and {$P(X_i = 0 \mid Y = j) = 1 - \theta_{ij}$}. 2.2.1We want the minimum number of parameters here, so any parameter that can be inferred from the ones you are already counting shouldn’t be included in the count. 2.1.2 (b)The error rate is the probability of the classifier guessing the incorrect answer: {$E_{x,y\sim~P} \left[ \mathbf{1}\left(y \ne \arg \max_{y'} \hat{P}(y' \mid x) \right) \right].$} More concretely, for this question, if we define {$g(y, x_1, x_2) = P(Y = y)P(X_1 = x_1 | Y = y)P(X_2 = x_2 | Y = y)$}, then error is: {$\sum_{x_1, x_2, y} \mathbf{1}\left(g(y, x_1, x_2) < g(\neg y, x_1, x_2)\right) * g(y, x_1, x_2).$} The {$\mathbf{1}(\ldots)$} function here is the indicator function, which has value 1 when the expression inside is true, and value 0 otherwise. So, for example, one term in this sum would be: {$\mathbf{1}(g(T, T, T) < g(F, T, T)] * g(T, T, T))$} and since g(T, T, T) = 0.5*0.8*0.5 = 0.2 while g(F, T, T) = 0.5*0.3*0.1 = 0.015, the indicator function would have value 0 and this term in the sum would reduce to zero. You just have to work out the math for the other 7 terms in the sum, and that should give you the overall error rate. Problem 3Take a look at Schapire’s Tutorial before beginning this question. 3.1.5When the homework says show that training error gets “stuck” above 0, you should show that no matter how many iterations of boosting are run, if error for the classifier selected in iteration t is 1/2, then training error of the overall classifier will be above 0 in the end. In fact, you should find you can even show that if the t-th classifier’s error is 1/2, no further progress at all will be made in decreasing the overall training error after that iteration. |