8/23/2018 11:39:52 AM
The flood of online information now being produced has provided a rich environment for data mining and discovery, but that near-endless flow of text is difficult to make sense of.As an Illinois Computer Science PhD student, Xiang Ren devoted his work to finding better ways to sort and categorize that jumble of text, and his dissertation was recognized this month with the 2018 SIGKDD Doctoral Dissertation Award. The award from the Association for Computing Machinery recognizes outstanding work in the areas of data science, machine learning and data mining.
Ren (MS ’16, PhD ’18), who is now an assistant professor of computer science at USC, points out that many real-world applications rely on being able to quickly understand and analyze text data – from news, medical texts, and any number of other sources – and at volumes he says are “almost impossible for human to digest and curate.”
But different types of data about different subjects are often expressed in ways specific to that subject, or in language unique to the individual author – in other words, they’re messy.
“The majority of data we are collecting today are primarily natural-language text data, which are unstructured sequences of words that are hard for machines to process,” he said.
Ren’s challenge was to find a way to build systems that can easily and quickly switch between domains and work for different downstream applications, a problem his thesis proposed to address the use of what he calls an “effort light” methodology.
“The key is leveraging existing knowledge-base facts, which are already curated by human crowds, to automatically generate labeled data at a large scale, and train noise-robust machine-learning models with such automatically labeled data,” he said.
Being able to do that depends on two key factors:
- In a large data set, information about some of the entities being referenced already exists in external knowledge bases, which Ren proposed to use to create training data.
- Recurring words and phrases within a large data set could be used to characterize the semantics of the data set. For example, if words like government, speech, and party recur in the training data, an algorithm should be able to guess that the data includes references to a politician.
Ren says the methodology he proposed in his dissertation has led to the current work he is doing with his students at USC and with collaborators at Illinois, Stanford and elsewhere on a much broader set of techniques for information extraction and text mining. And systems they have developed have been adopted by a number of companies and institutions. He and collaborators, for instance, have built a system that turns data from literally millions of medical papers into a searchable knowledge graph being used by Stanford and UCLA medical schools.
Ren is only the most recent student from Abel Bliss Professor Jiawei Han’s Data Mining Group to receive ACM SIGKDD Dissertation Awards. Others include Xiaoxin Yin in 2009, Yizhou Sun in 2013, and Chi Wang in 2015.
“Xiang did brilliant work during his PhD study and his work has been well cited and well recognized,” Han said.
Ren called studying under Han “an amazing experience” that prepared him for an academic career by including him in grant-proposal writing, paper reviewing, principal investigator meetings, and guest lectures.
“Even during weekend nights, (Han) spent hours remotely to give me advice on my research projects and plans, and to teach me how to do research, write scientific papers and manage all sorts of things in my PhD Career,” Ren said.