skip to main content

Best Paper Award for Work in Natural Language Processing

1/23/2012 12:22:00 PM

University of Illinois computer science graduate student Gourab Kundu received the Best Student Paper Award at the 2011 Conference on Computational Natural Language Learning (CoNLL 2011) for his work on open domain natural language processing. His paper, “Adapting Text Instead of the Model: An Open Domain Approach” introduces a new training approach that allows Natural Language Processing (NLP) researchers to adapt existing systems to new domains without the need for retraining their system on labeled data from the new domain.

Researchers in NLP develop algorithms and automated systems that analyze and interpret natural language text. For example, given the sentence “Gourab wrote two excellent papers and submitted them to 2012 conferences”, a Semantic Role Labeling program would analyze the sentence at the level of “who does what to whom, when, where”, and determine, for example, that Gourab is the Writer and also the Submitter, that two papers were submitted, that the submission happened in 2012, etc. While this task is trivial to humans, automating this process is very difficult due to the inherent ambiguity and variability of natural language. Modern NLP approaches therefore use machine learning and inference approaches to accomplish this task. Consequently, these systems must be “trained” or fed a lot of sample text, along with the corresponding analysis, in order to learn how to analyze new, previously unobserved text. However, while natural language systems trained on labeled data perform well on the domain in which they were trained, they adapt poorly to other domains due to changes in vocabulary, sentence structure, etc . For example, a system that was trained to parse natural language sentences on the Wall Street Journal may perform very poorly when parsing language from fiction novels. 

Current approaches to this problem require researchers to conduct the time-consuming task of training a new model for the new domain using data from that domain. This is costly especially since there is a need to first annotate the data from the new domain with the correct analysis. 

Kundu proposes looking at the problem from another angle. Rather than adapting the old model for new text, why not adapt the new text for the old model? The result is ADUT (ADaption Using label-preserving Transformation), an approach that avoids the need for retraining and does not require any knowledge of the new domain. ADUT applies multiple label-preserving transformations to the text it is attempting to analyze in order to make the target text “more similar” (in terms of vocabulary and structure) to the text on which it was initially trained. The ADUT system then applies the existing model on the transformed sentences, and combines the predictions using a constrained optimization inference approach to produce the desired prediction on the target text. In this manner, AUDT is able to work on the fly to analyze new text.

This way, Kundu’s ADUT approach can use a single trained model to analyze text from multiple domains, even when the notion of a domain is not well defined (e.g., streaming data from the web). The approach was shown to yield a 13% reduction in error for the task of Semantic Role Labeling when adapting from news wire text to text from fiction.

Kundu conducts his research in computer science professor Dan Roth’s research group. Roth was co-author of the paper. One of Professor Roth’s former students, Scott Wen-tau Yih, also won the Best Paper Award (non-student) at the same CoNLL conference for his paper,
Learning Discriminative Projections for Text Similarity Measures.”