YumEats! is an international restaurant aggregator company, planning to expand its operations and set foot in India. YumEats! allows users to select food from a list of restaurants, and have it delivered to their location. Their revenue comes from a special rewards program or an exclusive membership program that allows customers to prescribe to a monthly or annual membership.
YumEats! needs to drive traffic to the app by attracting and retaining new customers, thereby cultivating a growing number of daily active users. Some of the strategies suggested were putting out discount campaigns, personalized recommendations based on past behaviour, having more variety on the app to cover all tastes and requirements, and offering discounts for users who are leaving/ switching out of the app.
There is a caveat; the company’s previous marketing campaign offering deep discounts led to the piling up of losses and has put a lot of strain on the company. Post discussion with the stakeholders, it was decided that from the above recommended options, onboarding more restaurants would be the best way forward.
Now this process has come to you, the Data Scientist. You have been given a mandate to identify suitable restaurants in a given area to be onboarded to the app. Here the stakeholder would be the Director of Sales, who is responsible for bringing more restaurants on the app.
Framing the Data Science Problem
So now we are clear with the business needs. The next step is to convert the business need into an analytics requirement. This would be the first deliverable that we would be presenting to our stakeholders. In this case, the first milestone is to analyse the data of all the restaurants already available on the app.
An analytic dashboard showing a visualization of the restaurants already available on the app was created. Using a naive rule-based approach, the restaurants were grouped into good and bad separately and the features were visualized in comparison with each other.
We need to classify whether a restaurant is good or bad. Currently, the goodness of the restaurant is a subjective parameter. But to define a metric, it must be objective. So, we can reframe the problem as - identifying restaurants with high ratings to be onboarded on the app. A metric must be measurable - so in terms of a data science problem, it can be - using the ratings of a restaurant to decide if they can be onboarded on the app.
Now, out of a rating of 5, the ones with rating 4 and 5 can be considered as good restaurants and rest as bad restaurants. Now the next question to analyse is, which is more harmful - identifying a good restaurant as a bad one or identifying a bad restaurant as a good one? The stakeholders are more conscious of brand image and don't want to onboard a possibly bad restaurant. Hence the metric to optimize could be to reduce the false positives or 'precision'.
Now our Data Science problem is to predict the rating bucket of a given restaurant (good or bad) to decide if the restaurant can be onboarded to the app. And our metric is Precision. The journey till arriving at a data science problem is the hard part and the next part of the analysis is all about diving deep into analysis and applying ML for the given problem.
Under the hood ML
We already have the details of the restaurants existing on our app. To identify the new restaurants that could be on-boarded onto our app, we would also require detailed data of all the restaurants which are not available on our app. After scraping through the web, exploring trending venues in the neighbourhood (using APIs as well as web scraping libraries), aggregating the data from all the different sources, many data operations (feature engineering, encoding, etc) and a lot of persistence, we got the data in the required format as below. Remember that most of the data science project cycle is spent in this stage.
Target variable: Rating Bucket - Good or Bad
- URL of the restaurant page from where data was scraped
- Location of the restaurant
- Cost for 2
- Delivery available (Yes or no)
- Dishes typically liked by people
- Conversations around the restaurant (From Twitter)
With the dataset in our hand, it is time to do exploratory data analysis and other explorations. The following points were obtained by exploring the data.
- For a particular city, a locality wise breakup of the number of restaurants in each locality was given.
- For each locality, breakup of the number of restaurants based on their cuisine i.e. Indian, Continental, Chinese was given.
- The density of restaurants in the localities, whether the restaurants are very close or spread far apart was visualized.
- For every locality, the average cost of restaurants in that locality was calculated.
- For every locality, the percentages of the restaurants that deliver food was calculated.
- Breaking the data into good restaurants and bad restaurants and exploring further, the following were identified.
- Average cost of good and bad restaurants
- Locality wise breakup
- Are all the good restaurants present in the same locality or are they spread across different localities?
- Cuisine wise breakup of good and bad restaurants. Is there a particular cuisine in a city or locality that is not doing so well?
Now we already have some insights to present to our stakeholders.
Next, we need the solution to be interpretable so that the results can be explained to the stakeholders properly. Considering that our dependent variable is binary in nature we have a choice of either going with a logistic regression model or a decision tree model. The constraint we have with a logistic regression model is that all the features need to be independent. For now, we are good to assume that all the available features are independent of each other. Also, in our 10-fold cross-validation using both logistic regression and decision tree, the performance of the logistic regression model was found to be better. Thus, logistic regression was used for classifying the restaurants since its interpretability is almost the same as that of a decision tree.
Results and Impact
Some of the initial insights that could be beneficial for the business
- Locality wise distribution of restaurants, with the density of the restaurants in each locality - restaurants in sparser localities are suitable for onboarding since there would be lesser competition and more deliveries.
- Cuisine-wise distribution of restaurants. We can recommend the stakeholder to onboard restaurants of different cuisines in the same locality so that they don’t eat into each other’s business.
- Good rating restaurants with no delivery facility are ideal candidates for onboarding to the app.
This model helped the business identify the new ‘good’ restaurants to be on-boarded onto the app, and by looking into the feature importances, we would know which features are crucial to identify a good restaurant. Now let’s look at the possible impact of the data science solution:
Impact to customers - A wider and more curated list of restaurants to choose from.
Impact to stakeholder - Informed decision making on which restaurants to target for onboarding with a quicker turnaround than extensive manual search. (Good restaurant recommendations for onboarding.)
Impact to the organization - More restaurants onboarded implies more catalogue for users and hence eventual increase in revenue.
In future, we can even think of other related problems like identifying the right locality for setting up a new restaurant and recommend that to the restaurant partners and aid their expansion.
Finally, here is a Case Study Template that would be useful to tackle business problems as Data Science problems:
Business Problem Statement
- Describe the company or domain for the context of the case study.
- Explain the motivation for the business problem.
- Identify the stakeholders who would be interested in the solution to the problem.
- End the section with a well-defined business problem that will be the focus of the case study.
Framing the Data Science Problem
- From the business problem, explain in detail what could be the various milestones on the journey to solve the problem.
- Break down the business problem into a solvable data science problem.
- Identify the target metric and talk of how the target metric is in line and goes hand in hand with the business metric.
- State the data science problem clearly along with the corresponding data science and business metrics.
Under the hood ML
- From here on, talk about the detailed analysis, you would do to understand the problem and solve it.
- Talk about the data that is required to solve the problem.
- If there is an existing dataset that goes hand in hand with the case study, mention the dataset here. Talk in detail about the features of the dataset.
- If there is no existing dataset, then talk of how the data could be obtained to solve the given problem. Come up with hypothetical detailed features of the dataset that can solve the problem.
- Talk of the low hanging fruits in the analysis, some easily visualizable insights that can be useful for the stakeholder with EDA.
- Now talk in detail about the ML algorithm that would be used to solve the problem. Why is the ML algorithm suitable here, or why other algorithms will not be suitable for the problem? What made you choose the particular ML algorithm?
Results and Impact
- Talk about the results that are obtained from the analysis.
- Put forth (some of) the insights and recommendations you are going to give to the stakeholder.
- Talk in detail about the business impact in solving the problem.
- Future scope and further problems to solve