Eigenwords

From DataMining
Jump to: navigation, search

[Eigenword] (sometimes eigenfeature): a real-valued vector associated with a word that captures its meaning in the sense that distributionally similar words have similar eigenwords. Eigenwords are computed as the singular vectors of the matrix of co-occurrence of words and their contexts.

    • context-oblivious: the vector does not depend on the context, only on the word
    • context-specific: the vector does depend on the context

Project description

We use spectral methods (SVD) to building statistical language models. The resulting vector models of language are then used to predict a variety of properties of words including their entity type (E.g., person, place, organization ...), their part of speech, and their "meaning" (or at least their word sense). Canonical Correlation Analysis, CCA, a generalization of Principle Component Analysis (PCA), gives context-oblivious vector representations of words. More sophisticated spectral methods are used to estimate Hidden Markov Models (HMMs) and generative parsing models such as dependency parsers. These methods give context-dependent state estimates, which again improve performance on many NLP tasks.


contact: Prof. Lyle Ungar (ungar@cis.upenn.edu)

co-advisor: Prof. Dean Foster (Statistics)

Background

To get a flavor for our approach,

for more detail on methods see Background and Method, especially in the git directories listed under "software"

For more information


See also related work of Shay Cohen and Michael Collins including our joint tutorial on spectral NLP