Yelp Dataset Analysis

Skills: SQLite, Data Modeling, Data Profiling, Data Cleaning

This is an analysis that I performed on a real-world dataset from Yelp.

The objective of this project was to gain a deeper understanding of the provided Yelp dataset and determine if there were any correlations within the data through analysis. I looked at data regarding users, businesses, and reviews to profile the data and identify any patterns.

I used SQL to profile and clean the data, and write the necessary queries to complete my analyses of the data.

Objective

About the Data

I used a dataset from a company called Yelp, which allows users to provide reviews and ratings for different types of businesses.

This data includes information about Yelp users; such as their name, review count, number of fans, and how long the user has been on Yelp. It also includes information about the businesses on Yelp; such as the business name, address, review count, business category, hours of operation, and star rating. The data is stored in tables which are connected by primary and foreign keys.

Understanding the Data

For the first part of this project, I profiled the data to better understand the design of the Yelp dataset and how the data tables relate to each other. Using SQL, I performed a series of queries to find out information about what kind of users were writing reviews, whether the location of a business affected its reviews and ratings, and what types of reviews were being written.

I started by finding the total number of records and checked the data for any null values. I found that each table has 10,000 records and that there were no null values. Then, I found the minimum, maximum, and mean for a number of different fields, such as review star rating, business star rating, and user review count. I also counted the number of reviews for each city and found the cities with the highest number of reviews, which were Woodmere Village, Mount Lebanon, and Charlotte. In addition, I found the distribution of star ratings for certain cities to see if there was a difference in ratings in various locations.

I then looked at user information on Yelp by finding the top 3 users with the most number of reviews and the top 10 users with the most number of fans. I also grouped users based on the numbers of reviews they had posted and counted the number of fans these users had to determine if posting more reviews correlated to having more fans. Overall, as the number of reviews written by a user increased, the average number of fans for the user increased as well.

Finally, to get an idea of the types of reviews being written on Yelp, I checked the review text to see if there were more reviews with the word “hate” or “love” in them. I counted the number of times that both words appeared in the review text and found that there were significantly more reviews that mentioned the word “love”.

Analysis

For the second part of this project, I analyzed different aspects of the dataset to find patterns between users and businesses.

In my first analysis, I picked a business category in a specific city and grouped those businesses based on their star rating. I chose to look at restaurants in the city of Toronto. I created two groups of businesses: 2-3 star restaurants and 4-5 star restaurants. I compared these two groups to see if there were any difference in the distribution of hours, number of reviews, and locations of the restaurants.

To get a better understanding of the distribution of hours, I used case statements to group the different business hour opening and closing times. I created two groups for business opening times: “morning” and “afternoon”. Then, I created three groups for business closing times: “afternoon”, “evening”, and “night”. This made it easier to compare what times in the day that 2-3 star restaurants and 4-5 star restaurants were usually open.

Looking at each of the restaurant groups, 2-3 star restaurants in Toronto tended to be open for longer hours and opened earlier in the day than 4-5 star restaurants. They also had fewer reviews, with an average of about 29 reviews per restaurant, while 4-5 star restaurants had about 41 reviews per restaurant. In addition, the 2-3 star restaurants were typically in the same areas (Downtown Core and the Entertainment District), while 4-5 star restaurants were in a variety of different neighborhoods in Toronto.

In my second analysis, I looked at open and closed businesses to identify if there were any differences between them.

I wrote a query to retrieve data for the total count of the businesses, average star rating, and average review count. Then, I took each of these columns and grouped them into two groups: the businesses that were still open, and the businesses that were closed. I found that there were a lot more open businesses than closed ones, with a total number of 8,480 open businesses and 1,520 closed businesses.

I also compared the reviews and star ratings between these two groups to get an idea of how these businesses might have performed. I found that the average star rating for closed businesses was 3.5 stars, while the average star rating for open businesses was 3.7 stars. The closed businesses also had fewer reviews, with an average of 23 reviews. Meanwhile, open businesses had an average of 32 reviews.

In my third analysis, I grouped users based on the number of years that they have been on Yelp to determine if there is a correlation between the number of fans that the user has and the number of years that the user has been on Yelp.

I retrieved the data column for the year when users first joined and subtracted it from the current date to calculate the number of total years that users have been on Yelp. I also took the average of the total number of fans. Finally, I grouped the selected columns of data by the total number of years.

I found that there appeared to be a correlation between how many years users had been on Yelp and how many fans these users had. Users who had been on Yelp for the shortest amount time, about 5 years, only had an average of less than 1 fan. Meanwhile, users who had been on Yelp for the longest amount of time, about 17 years, had an average of about 23 fans. The longer that a user had been on Yelp, the more fans they had on average.