Researchers at the University of Waterloo and the University of Maryland have collaborated with the Comcast Applied AI Research Lab to improve the voice query understanding capabilities of the Comcast Xfinity X1 entertainment platform.
Today, we have become accustomed to talking to intelligent agents that do our bidding — from Siri on a mobile phone to Alexa at home. Why wouldn’t we be able to do the same with TVs? Comcast’s Xfinity X1 does exactly that — the platform comes with a “voice remote” that accepts spoken queries. Your wish is its command — tell your TV to change channels, ask it about free kids’ movies, and even about the weather forecast.
Although the media and technology company had already delivered more than 20 million remotes to customers by the end of 2017 and fielded billions of voice commands, the Comcast team still saw room for improvement in the results returned by one of the company’s most popular products. Even though the device returns remarkably accurate results, thanks to an AI-powered platform and Comcast’s rich trove of entertainment metadata (information like a show’s title, actors and genre), it would still sometimes return odd responses. This was in part due to the type of AI used in the platform, which is based on matching patterns, and doesn’t always correctly interpret user intent.
Enter Jinfeng Rao, a newly-minted PhD at the University of Maryland, who together with his advisor Professor Jimmy Lin at the University of Waterloo and mentor Ferhan Ture, a researcher at the Comcast Applied AI Research Lab, set about tackling the complex problem of understanding voice queries. Their idea was to take advantage of the latest AI technology — a technique known as hierarchical recurrent neural networks — to better model context and improve the system’s accuracy. This research will be presented at the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining from August 19 to 23 in London, United Kingdom.
How does their technique work? Explains Dr. Rao, “Say the viewer asks for ‘Chicago Fire,’ which refers to both a drama series and a soccer team — how does the system determine what you want to watch? What’s special about this approach is that we take advantage of context — such as previously watched shows and favourite channels — to personalize results, significantly increasing accuracy.”
In January 2018, the researchers’ new neural network model was deployed in production to answer queries from real live users. “Our previous system was totally confused by a small but consistent percentage of queries,” says Dr. Ture. “The new model handles most of these very difficult queries appropriately, greatly enhancing the user experience.” Today, the model answers millions of user queries per day.
Not content with this success, the researchers have begun to develop an even richer model, which is also outlined in their paper. The intuition is that by analyzing queries from multiple perspectives, the system can better understand what the viewer is saying. This model is being readied for deployment at the moment.
Professor Jimmy Lin, the David R. Cheriton Chair at the David R. Cheriton School of Computer Science at the University of Waterloo, sums up this research, “This work is a great example of a successful collaboration between academia and industry that yielded significant real-world impact."
"My research group aims to build intelligent agents that can interact with humans in natural ways, and this project provides a great example of how we can deploy AI technologies to improve the user experience.”
For more information about this research, please see Jinfeng Rao, Ferhan Ture, Jimmy Lin, "Multi-task learning with neural networks for voice query understanding on an entertainment platform," Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 19–23, 2018.