*07.03.2019, Lesezeit: ~3min*

**Executive summary**

*A grocery store wishes to sort customers entering their website
into one of two categories. They are very concerned about assigning
customers to these groups correctly. Success could lead to a boost in
sales, but incorrect assignment might cause people to leave the site
without making a purchase. The BI team at Cards & Systems explain
this use case in greater detail.*

The problem that has been described is a classification problem – the company would like to sort shoppers into one of two possible categories. It is also a supervised learning problem, which means that there is a set of data, known as the *training data*, where the classes are already known.

In this case, logistic regression has been selected to solve the problem. This is a machine learning algorithm that solves binary classification problems such as the one in this project. While other algorithms may perform slightly better for very large data sets, logistic regression is thought to be a good choice when there is little to distinguish one category from the other, as we might expect in this project.

There are many alternatives to the logistic regression algorithm. One example is a random forest algorithm. This algorithm categorises the data based on a series of decisions. These can be thought of as questions where the only possible answers are ‘yes’ or ‘no’. For example: ‘did the shopper search for ham?’ If was a huge difference in the number of people in each category, random forest might prove to be more accurate than logistic regression, but in this case, the *class imbalance* was not large enough to make this choice.

In this project, a post-processing step was added to explain the outcomes. The algorithm LIME (Local Interpretable Model-Agnostic Explanations) was used to find the key features contributing to the classifications resulting from the trained logistic regression algorithm. In this step, the word *local* is key – it is applied to each shopper independently, so for every case a different set of features can contribute. Nonetheless, by looking at a representative sample of shoppers, some simple business rules can be devised, which may speed up decision making. For example, in most cases, buying horseradish – even though it is a vegetable – is an indicator for shoppers sorted into the meat category.

Once the logistic regression algorithm has been trained, it outputs information for every shopper. This output is summarised in the following example:

Probability of ‘meat’: 31%,

Probability of ‘vegan’: 69%.

These probability estimates can be used as part of the solution to this problem. To do this, a threshold value is chosen. If the selected label’s probability is above the threshold, the associated promotion will be displayed. However, if it falls below the threshold, the neutral banner will stay on the screen.

As explained in the previous blog, the company wishes to be quite certain, so they set a threshold of 75%. Looking at the example above, the probability of the vegan label is lower than this value, so the promotion will not be shown. The probability is simply not high enough to risk the chances of showing an inappropriate offer.

In a second example, probability of ‘meat’ is 88% and probability of ‘vegan’ is 12%. Here, the probability of ‘meat’ is well above the threshold set by the company. They therefore decide to show this shopper the meat promotion banner, as there is only a small chance that they will make the wrong decision.

In the next blog, we will show how Tableau can be leveraged to visualise some of the data and insights in this project, and how visualising data is key to understanding it. In part 4, our BI marketing team give their take on the project and its benefits to the client.

**Dr. Fern Watson**

**Data Scientist**