Lord of the Machine: Data Science Hackathon
Published:
Problem statement:
Email Marketing is still the most successful marketing channel and the essential element of any digital marketing strategy. Marketers spend a lot of time in writing that perfect email, labouring over each word, catchy layouts on multiple devices to get them best in-industry open rates & click rates. ‘How can I build my campaign to increase the click-through rates of email?’ - a question that is often heard when marketers are creating their email marketing plans. Can we optimize our email marketing campaigns with Data Science?
It’s time to unlock marketing potential and build some exceptional data-science products for email marketing.
competition link: https://datahack.analyticsvidhya.com/contest/lord-of-the-machines/.
My team consisting of 2 including me, ranked 8th in this competition out of 3594 competitors.
Data Overview:
campaign.csv
Contains the features related to 52 email Campaigns
Variable | Definition |
---|---|
campaign_id | Email campaign ID |
communication_type | Email agenda |
total_links | Total links inside the email |
no_of_internal_links | Total internal links inside the email (redirecting to analyticsvidhya.com) |
no_of_images | Number of images inside the email |
no_of_sections | Number of sections inside the email |
email_body | Email Text |
subject | Email Subject |
email_url | Email URL |
train.csv
Contains the click and open information for each user corresponding to given campaign id (Jul 17 - Dec 17)
Variable | Definition |
---|---|
id | Unique ID for email session |
user_id | User ID |
campaign_id | Email Campaign ID |
send_date | Time stamp for email sent |
is_open | Did not open - 0, Opened -1 |
is_click | Did not click - 0, clicked - 1 |
test.csv
Contains the user and campaigns for which is_click needs to be predicted (Jan 18 - Mar 18)
Variable | Definition |
---|---|
id | Unique ID for email session |
campaign_id | Email Campaign ID |
user_id | User ID |
send_date | Time stamp for email sent |
Evaluation metric
AUC ROC score
Feature Extraction
Prominent Features Extracted
- Date
- Time (in minutes)
- Day of Week
- Communication Type
- Total Links
- No of Internal Links, No of Images
- Subject - Count of Sentences, Letters, Punctuations and Stopwords
- Subject - Unique Word Percentage
- Subject - Punctuation Percentage
- Email - Count of Word, Punctuation and Capital Letters
- Count Click
- Count User
- Click Confidence
- Count of People Opening the Mail
- Open Confidence
- Email Similarity, Subject Similarity
- Subscription Period
- Communication Type Click Percentage
- Count User Frequency
- Sentiment of Mail
Correlation between Extracted Features and Output
Data Analysis
No of Emails per Communication Type
Distribution of Click Confidence, Open Confidence, Is Open and Is Click
Distribution of Click Confidence and Open Confidence for Is Click=0
Distribution of Click Confidence and Open Confidence and Is Click=1
Further data analysis - Link
Under-sampling using Repeated Edited Nearest Neighbour Algorithm
The train dataset was highly imbalanced and contained 1010409 samples with is_click=0 while only 12782 samples with is_click=1.
After undersampling the data using RENN, the number of samples with is_click=0 was reduced to 958301. Other algorithms such as ENN, AllKNN and SMOTE were also explored. But I found RENN to the best of all though it required significant amount of time to undersample the dataset.
Our Solution
The overall solution consists of a weighted average ensemble of two boosting algorithms -
- XGBoost
- LightGBM
Results
Sr No. | Public LB AUC | Private LB AUC |
---|---|---|
LightGBM | 0.68173 | - |
XGBoost | 0.66823 | - |
Ensemble | 0.68799 | 0.68630 |
Key Points -
- LightGBM outperformed XGBoost by a significant margin. Moreover, it required much less time to train than XGBoost.
- Extracting prominent features provided a major boost to the score. Most of these features were based on modelling user characteristics and extracting time series properties.
- Undersampling the data also provided a significant increase in the score.
- Boosting algorithms mostly ruled the competition.
My team consisting of 2 including me, ranked 8th in this competition out of 3594 competitors. Thanks and congrats to my teammate Rahul, who also helped me put up this article.
Github repository: https://github.com/soham97/Lord-of-the-Machine-Data-Science-Hackathon