5 Steps in using DATA for Customer Acquisition
Customer Service Week is an excellent opportunity to celebrate your employees’ customer-centric efforts and refocus on your customer experience (CX) initiatives. Here are three customer service week ideas to help you kickstart this special week and continue your CX efforts all through the year.
With the aim of bringing new customers to their brands, companies have been developing their customer acquisition processes and strategies. It’s not easy though! The first step of any basic customer acquisition plan is to identify quality potential customers and, guess what! Data Science can help us with that! This project is all about strategies that can be used in order to understand customers' characteristics and possibly, acquire them.
This project applies Machine Learning algorithms in order to get information about potential customers out of demographic data. It is the Capstone project of the Udacity Machine Learning Engineer Nanodegree program which is provided by Arvato Financial Solutions, a Bertelsmann subsidiary. I chose this project mainly because the data provided is real and almost no cleaning has been done to it. In addition to that, I really enjoy understanding business and customer needs in order to provide the best experience for both the company and those who make use of a business product or service.
The first part of the project consists of building a customer segmentation report based on Arvato Financial Solutions' existing customers and the general population of Germany. To do that I used unsupervised learning techniques to identify parts of the population that best describe the core customer base of the company.
The second part consists in building a customer marketing campaign response predictor in order to identify individuals that are most likely to convert into becoming a customer. In order to do that, I tried out a few supervised learning algorithms and ended up with the one that has given me the best ROC AUC score.
As mentioned above, the data for this project has been provided by Arvato Financial Solutions, a subsidiary of Bertelsmann in Germany. There are six data files associated with this project as follows:
Part I ) Customer Segmentation
● Udacity_AZDIAS_052018.csv : Demographics data for the general population of Germany
● Udacity_CUSTOMERS_052018.csv : Demographics data for customers of a mail-order company
● DIAS Information Levels — Attributes 2017.xlsx : Top-level list of attributes and descriptions, organized by informational category
● DIAS Attributes — Values 2017.xlsx : Data values for each variable in alphabetical order
Part II ) Marketing Campaign Response Prediction
● Udacity_MAILOUT_052018_TRAIN.csv : Demographics data for individuals who were targets of a marketing campaign
● Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign
Note: The DIAS Information Levels and the DIAS Attributes xlsx files have been translated to the data_info.csv file, which contains information regarding the features and their respective possible values.
In order to feed the machine learning algorithm a couple of clean and preprocessing steps has to be executed. In particular, the data provided needed to pass through 8 steps of preprocessing and feature engineering. Do you know the 80–20 rule? Well, many people say that about 80% of the time of a data science project is spent on data preparation (preprocessing) and 20% is spent on data analysis and machine learning approaches. That was exactly what happened on this particular project. I built a function that does all the required process as you can see below:
The cleaning function drops 35 features ( columns) due to their number of missing values. Those columns have more than 40% of nan data. This lack of information may negatively affect the model and because of that, I decided to get rid of them. Apart from that, for the customer dataset I also the function also drops three extra columns (‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’, ‘PRODUCT_GROUP’).
Step 2 — Converting missing code
I created the data_info.csv file that summarizes information from both the DIAS Information Levels — Attributes 2017.xlsx and DIAS Attributes — Values 2017.xlsx files. In addition to that, I also included the missing value code of each feature so that I will be able to identify all missing values per feature and convert them to NAN.
This is how the data_indo data frame looks like:
Step 3 — Drop rows and data imputation
In order to analyze the number of missing values per row, I split the data frame into different sets. The first with a high amount of missing values ( more than 250 ) and others with less than 250 missing values. By doing that I could investigate the distribution of a few features and compare both dataframes. I realized that the distributions (for many features) are different, so that makes me decide to keep working on the data frame with less missing value and drop all rows that have more than 250 nan variables.
Part 4 — Data imputation
Although the majority of the features are either categorical or ordinal, I assumed all variables as being categorical. Knowing that, I replaced all missing values (nan variables) with the most frequent value of the feature in order to compensate for missing values in the data set.
The unsupervised learning algorithm that will be used to build the customer segmentation, requires numerical values. Because of that, all the data must be numerically encoded so that the model can proceed the way it is supposed to proceed.