Parameswaran Investigates Crowdsourcing in the Computer Science Industry
CS Assistant Professor Aditya Parameswaran has literally written the book on crowdsourcing. He is co-author of the recently published Crowdsourced Data Management: Industry and Academic Perspectives, which is published by Now Publishers and comprises volume 6 of the publisher’s Foundations and Trends in Databases series.
“There are a lot of problems that we face in data science that are notoriously hard. Anything to do with image understanding, video understanding, text understanding, text extraction—understanding and making sense of any of these data types is super challenging,” Parameswaran said.
To solve these problems requires training systems to recognize what is going on in the text, databases, image, or video. And to do that requires a large trove of training data. And the best way to create training data remains having humans provide responses to small questions involving this data. Once the machine learning algorithm is provided this training data, it can go on to work on new data on its own.
The most efficient way to create the amount of training data needed by industry is through crowdsourcing—getting a large pool of people to provide the needed information or input. Unfortunately, crowdsourcing is little-studied by computer science researchers.
“Companies don’t talk about it,” Parameswaran explained, “either because they consider that their ‘secret sauce,’ or [because] just saying, ‘Hey, at the backend, we have a bunch of humans doing the annotation’ doesn’t make for a sexy technology story.”
And yet, crowdsourcing needs examination because machine learning (which relies on the crowdsourcing training data) has become a dominant part of how computing and technologies impact the lives of everyday people. “In some sense, the pressing problems of the day cannot be solved without crowdsourcing,” Parameswaran said.
It is this gap between what the technology industry is engaged in and what researchers know about the problems faced by the industry that are addressed through this book. The book also argues that crowdsourcing will become—if it hasn’t already done so—an essential attribute for any company working with large amounts of data.
Parameswaran’s co-author on the book is Adam Marcus, an MIT PhD who, after graduating, led the data team at a crowdsourcing startup called Locu that was acquired by GoDaddy, and has since co-founded Unlimited Labs, a company dedicated to improving the future of creative and analytical expert work. Like Parameswaran, Marcus has published extensively on crowdsourcing for large-scale data processing.
“We conducted thirteen interviews of companies that use crowdsourcing, and four interviews of companies that manage the marketplaces,” Parameswaran said. “We tried to figure out what sorts of problems do the large-scale industry users of crowdsourcing face, and what sorts of problems do the marketplace owners face.” The book details the insights from these interviews, and provides statistics for how much companies spend on crowdsourcing each year, ranging from tens of thousands to tens of millions of dollars, depending on the scale of the operation.
The book includes chapters on:
- crowd-powered algorithms;
- crowd-powered systems, the utilization of humans to power data processing systems;
- overviews of large-scale crowdsourcing in industry: statistics, typical use-cases, typical crowdsourced workflows, incentives, quality management (four chapters detailing surveys); and
- overview of crowdsourcing marketplaces: design decisions, marketplace management, quality control, and challenges (one chapter detailing surveys).
Parameswaran has been working in the area of crowdsourcing for many years. His dissertation on the subject received several awards: the 2014 ACM SIGMOD Jim Gray Doctoral Dissertation Award, as well as Stanford University’s 2013 Arthur Samuel award for the best dissertation in Computer Science. He was also runner-up for the 2014 SIGKDD Dissertation Award.