NSF Grant Funds DataHub Project to Make Data Science More Collaborative
CS Assistant Professor Aditya Parameswaran is one of the leaders of a multi-university research project that recently received a $1 million grant from the National Science Foundation. Named DataHub, this project is intended to, in Parameswaran’s words, “make collaborative data science simple.”
DataHub is a collaborative platform for cleaning, storing, understanding, sharing, and publishing data sets being developed by Parameswaran and his colleagues Amol Deshpande of the University of Maryland and Sam Madden of MIT, who are co-PIs on the project. DataHub will be a publicly accessible platform that will host public or private user datasets.
There is a common problem among people who work with large data sets. As researchers work and make changes on the data set, they may want to keep a copy of the new data set, but retain the old one as well. The simplest solution is to make a copy to work on. However, if there are many collaborators working together on a particular data set, these copied data sets can proliferate and quickly expand the amount of storage needed for the data research.
Parameswaran and his colleagues felt there had to be an easier way. GitHub, which had been around for a while, had revolutionized source code version control. They thought, why not a similar repository for data sets. “GitHub sort of revolutionized source code version control,” Parameswaran said, “which made collaborative coding, collaborative development, a lot easier. But there was no analog for collaborative data set management. That is what we are building.“
DataHub will include several key features: a flexible, source code control-like versioning system for data that efficiently branches, merges, and differences datasets; new tools designed to automate data cleaning process; the ability to search for "related" tables and to integrate them into the analysis process; and the ability to selectively share and collaborate on data sets across users and teams. Overall, DataHub will significantly reduce the amount of effort involved on the part of data scientists for preparing, analyzing, sharing, and managing data.
“It will handle the entire pipeline,” Paramesaran said. “[DataHub will] allow people to work with data sets, do versioning, and do analysis. And we will also have other tools like data visualization, data integration all built in so that users don’t need to go to other platforms for them. In a sense it is a one stop shop for data science.”
DataHub will serve as a clearinghouse for data science, freeing up end users from tedious dataset bookkeeping tasks, and providing support for researchers. Because all three co-PIs were working on projects at MIT when they met and began the DataHub project, DataHub will be housed there. The code will be open source.
“This NSF grant is allowing us to develop the basic versioning technologies, allowing us to figure out how to store large numbers of versions of data sets efficiently,” Parameswaran said. “The second thing we are doing is developing a query language or some sort of interface that will allow you to query over this large collection of data sets and retrieve ones of interest.”
One company, InstaBase, has already been spun out of the DataHub research. The company was launched by an MIT PhD student who had been working on the project. Parameswaran is on the company’s advisory board.