Open data on restaurant health ratings served as the initial motivation for this project. We wanted to explore the possibility of supplementing restaurant health ratings with user generated reviews. In this proof of concept we were able to demonstrate how to acquire, clean, and merge disparate data sets, including from open data portals and Yelp’s academic dataset. Data availability proved to be a significant challenge in this project. The Yelp Academic Dataset provided full corpus of the written reviews data, however, the set is far from complete and only represents a small sample of the businesses in Toronto or Las Vegas. Several different models were tested including logistic regression, gradient boosted trees, gradient boosted trees with principal component analysis (PCA), and random forest classifier. These models did not produce highly accurate (accuracy between 0.75 to 0.76) or specific results (AUC between 0.5 and 0.6), these results are encouraging as a proof of concept for further exploration perhaps with a keen eye towards feature selection and dimensionality reduction.
Another point of discussion was around how restaurants synonymous with being American/Canadian aren’t rated as harshly as “ethnic” restaurants. Specifically, Mexican, Filipino and Thai Restaurants are at the top of the positive predicted features. This is consistent with reports from mainstream media outlets such as Slate, which coined this phenomenon as “gastronomic bigotry”, where there is a tendency to attribute a tummy ache to an unfamiliar cuisine. (Simmons, 2014)
Paper for this project can be found here.
Repository for this project can be found here.