Web scraping tourism reviews

Friday, April 30, 2010
by Peter Johnson

I’m sure that many tourism business owners have spent a lot of time investigating review sites like Trip Advisor and Yelp, reading up on what their customers are saying. This is good business practice and tourism operators should always have an open ear to any praise or critique.

It is easy to look at reviews for one particular business, but what about at the regional or provincial level? What about comparing reviews across a destination? Which areas are reviewed the most and which are reviewed the least? Do users of Trip Advisor and Yelp leave a path of reviews as they travel or do only the best/worst experiences get mentioned? What is the percentage of positive vs. negative reviews? What is the overall quality of these reviews? These are just some of the questions that I’ve been thinking about recently.

Last fall I started a small research project that ‘scraped’ reviews from Trip Advisor for Nova Scotia. Web scraping is a somewhat controversial technique that actually uses software “agents” to harvest information from websites. In a basic sense, it is an automated version of copying specific information from a web site and pasting it into a spreadsheet. The tool I used to accomplish this is Mozenda. I ended up getting nearly 6,000 total reviews, including user, date, location, star rating /5, and comments. A very rich data source! I did some basic analysis by dividing the reviews up into three categories: accommodation, restaurants, and attractions, and the geolocating them at one of 77 different named destinations. I presented this preliminary material at the 2009 Travel and Tourism Research Association – Canada Chapter annual meeting in Guelph, Ontario. You can take a look at the Slideshare here:

Mining the Web: How user-generated content can become a data source for tourism research

View more presentations from Peter Johnson.

View Mining the Web: How user-generated content can become a data source for tourism research presentation.