Institute for Computer Research seminar
Peta-byte scale data warehousing at facebook
Dr. Ning Zhang
Wednesday, June 9, 2010
Davis Centre, DC 1302, University of Waterloo
Data warehousing at Facebook faces enormous challenges due to the exponential growth curve it enjoyed in the past several years. The data size generated every day exceeds tens of terabytes and approaching a hundred terabytes a day. At the same time, the number of users that query the data also increases at a faster rate thanks to the high level query language and easy-to-use tools developed on top of the huge amount of data. These constitute one of the biggest and busiest data warehouse systems in the world. To make it more challenging, more and more use cases require realtime data analytics, where online data acquisition and online query processing are the keys. In this talk, I will introduce the data infrastructure that enables data analytics on top of Facebook's data warehouse system. This includes the ETL tools, the Hadoop MapReduce clusters, the Hive query language, and the recent research and development efforts. I will also highlight some of the research challenges that we face.
Ning Zhang is a software engineer at Facebook (profile http://www.facebook.com/nzhang). He is currently working on Hive in the Data Infrastructure team. Before joining Facebook, he worked on storage and query processing of XML databases at Oracle. Ning Zhang got his M.Math and Ph.D. from the University of Waterloo in the areas of spatial databases and XML databases respectively. He got his B.S. from Nanjing University, China.
200 University Avenue West
Waterloo, ON N2L 3G1