Lord of the Machine: Data Science Hackathon

3 minute read

Published:

First snow at CMU


Problem statement:

Email Marketing is still the most successful marketing channel and the essential element of any digital marketing strategy. Marketers spend a lot of time in writing that perfect email, labouring over each word, catchy layouts on multiple devices to get them best in-industry open rates & click rates. ‘How can I build my campaign to increase the click-through rates of email?’ - a question that is often heard when marketers are creating their email marketing plans. Can we optimize our email marketing campaigns with Data Science?
It’s time to unlock marketing potential and build some exceptional data-science products for email marketing.
competition link: https://datahack.analyticsvidhya.com/contest/lord-of-the-machines/.

My team consisting of 2 including me, ranked 8th in this competition out of 3594 competitors.

Data Overview:

campaign.csv

Contains the features related to 52 email Campaigns

VariableDefinition
campaign_idEmail campaign ID
communication_typeEmail agenda
total_linksTotal links inside the email
no_of_internal_linksTotal internal links inside the email (redirecting to analyticsvidhya.com)
no_of_imagesNumber of images inside the email
no_of_sectionsNumber of sections inside the email
email_bodyEmail Text
subjectEmail Subject
email_urlEmail URL

train.csv

Contains the click and open information for each user corresponding to given campaign id (Jul 17 - Dec 17)

VariableDefinition
idUnique ID for email session
user_idUser ID
campaign_idEmail Campaign ID
send_dateTime stamp for email sent
is_openDid not open - 0, Opened -1
is_clickDid not click - 0, clicked - 1

test.csv

Contains the user and campaigns for which is_click needs to be predicted (Jan 18 - Mar 18)

VariableDefinition
idUnique ID for email session
campaign_idEmail Campaign ID
user_idUser ID
send_dateTime stamp for email sent

Evaluation metric

AUC ROC score

Feature Extraction

Prominent Features Extracted

  • Date
  • Time (in minutes)
  • Day of Week
  • Communication Type
  • Total Links
  • No of Internal Links, No of Images
  • Subject - Count of Sentences, Letters, Punctuations and Stopwords
  • Subject - Unique Word Percentage
  • Subject - Punctuation Percentage
  • Email - Count of Word, Punctuation and Capital Letters
  • Count Click
  • Count User
  • Click Confidence
  • Count of People Opening the Mail
  • Open Confidence
  • Email Similarity, Subject Similarity
  • Subscription Period
  • Communication Type Click Percentage
  • Count User Frequency
  • Sentiment of Mail

Correlation between Extracted Features and Output

Correlation between extracted features and output


Data Analysis

No of Emails per Communication Type

No of emails per communication type

Distribution of Click Confidence, Open Confidence, Is Open and Is Click

Distribution

Distribution of Click Confidence and Open Confidence for Is Click=0

Distribution CC OC for is_click=0

Distribution of Click Confidence and Open Confidence and Is Click=1

Distribution CC OC for is_click=1

Further data analysis - Link


Under-sampling using Repeated Edited Nearest Neighbour Algorithm

The train dataset was highly imbalanced and contained 1010409 samples with is_click=0 while only 12782 samples with is_click=1.

Output Distribution

After undersampling the data using RENN, the number of samples with is_click=0 was reduced to 958301. Other algorithms such as ENN, AllKNN and SMOTE were also explored. But I found RENN to the best of all though it required significant amount of time to undersample the dataset.


Our Solution

The overall solution consists of a weighted average ensemble of two boosting algorithms -

  • XGBoost
  • LightGBM

Results

Sr No.Public LB AUCPrivate LB AUC
LightGBM0.68173-
XGBoost0.66823-
Ensemble0.687990.68630

Key Points -

  • LightGBM outperformed XGBoost by a significant margin. Moreover, it required much less time to train than XGBoost.
  • Extracting prominent features provided a major boost to the score. Most of these features were based on modelling user characteristics and extracting time series properties.
  • Undersampling the data also provided a significant increase in the score.
  • Boosting algorithms mostly ruled the competition.

My team consisting of 2 including me, ranked 8th in this competition out of 3594 competitors. Thanks and congrats to my teammate Rahul, who also helped me put up this article.

Github repository: https://github.com/soham97/Lord-of-the-Machine-Data-Science-Hackathon