Topic : Predict if Client will subscribe to direct marketing campaign for a banking institution

Summary : The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

1. Univariate and Bivariate Analysis

Y(has the client subscribed a term deposit?) (binary: 'yes','no')

In [52]:
df_y = df['y'].value_counts()
print 'Percentage of Y=yes:',(df_y[1] / float(df_y[0] + df_y[1])) * 100
Percentage of Y=yes: 11.2654171118
In [53]:
sns.countplot(df['y'])
plt.show()

Take Aways:

  1. We can clearly see that the target variable(subscribed) is imbalanced. We must deal with this imbalance during model building

Month

Last Contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

In [4]:
#Distribution of variable month
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.countplot(df['month'])
plt.title('Bar Graph showing distribution of Calls made across various months')
plt.subplot(1,2,2)
sns.countplot(x="month", hue="y", data=df);
plt.title('Graph showing distribution of Calls made along with Convertion results across various months')
plt.show()
In [5]:
df_months = pd.crosstab(index=df['month'],columns=df['y'])
df_months['percentage(yes)'] = (df_months['yes'] / (df_months['yes'] + df_months['no'])) * 100
df_months.head()
Out[5]:
y no yes percentage(yes)
month
apr 2093 539 20.478723
aug 5523 655 10.602137
dec 93 89 48.901099
jul 6525 649 9.046557
jun 4759 559 10.511470

Take Aways:

  1. We can clearly see that the users are usually contacted more in May-June-July-Aug
  2. Having said that the outcome(Person subscribing) really is not dependent on the month they were last called in

Day of week

Last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

In [6]:
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.countplot(df['day_of_week'])
plt.title('Distribution of Calls made across various days of the week')
plt.subplot(1,2,2)
sns.countplot(x="day_of_week", hue="y", data=df);
plt.title('Distribution of Calls made along with Convertion results across various days of the week')
plt.show()
In [7]:
df_days = pd.crosstab(index=df['day_of_week'],columns=df['y'])
df_days['percentage(yes)'] = (df_days['yes'] / (df_days['yes'] + df_days['no'])) * 100
df_days.head()
Out[7]:
y no yes percentage(yes)
day_of_week
fri 6981 846 10.808739
mon 7667 847 9.948320
thu 7578 1045 12.118752
tue 7137 953 11.779975
wed 7185 949 11.667076

Take Aways:

  1. There's no trend as such in the days of week as the number of calls made based on day is uniform
  2. There's an equal chance/prob of a person subscribing accros every day of the week

Duration:

last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

In [8]:
#Distribution of variable day of week
plt.figure(figsize=(15,10))
plt.subplot(1,2,1)
plt.hist(df['duration'],bins=[0,100,200,300,400,500,1000,1500,2000])
plt.title('Histogram of Call duration')
plt.xlim(0,2500)
plt.subplot(1,2,2)
sns.boxplot(x='y',y='duration',data=df)
plt.title('Distribution of Call duration(in secs) vs Subscribed')
Out[8]:
<matplotlib.text.Text at 0x7f6537149190>
In [9]:
plt.show()
In [10]:
df[df['duration'] == 0]
Out[10]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
6251 39 admin. married high.school no yes no telephone may tue ... 4 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
23031 59 management married university.degree no yes no cellular aug tue ... 10 999 0 nonexistent 1.4 93.444 -36.1 4.965 5228.1 no
28063 53 blue-collar divorced high.school no yes no cellular apr fri ... 3 999 0 nonexistent -1.8 93.075 -47.1 1.479 5099.1 no
33015 31 blue-collar married basic.9y no no no cellular may mon ... 2 999 0 nonexistent -1.8 92.893 -46.2 1.299 5099.1 no

4 rows × 21 columns

NOTE: We only have 4 datapoints with duration = 0 which means these people were contacted first time These 4 data points should be removed before model training, as the duration is not known

In [11]:
new_data = df[df['duration'] != 0]
#Let's see if there is still any trend in coversion based on duration after removing duration = 0, data points
plt.figure(figsize=(10,12))
sns.boxplot(x='y',y='duration',data=new_data)
plt.show()

Take Aways:

  1. Most number of calls last between 0-200 secs
  2. There's a clear trend showing the chances of person subscribing increases once the duration goes above 300 secs

Campaign

Number of contacts performed during this campaign and for this client (numeric, includes last contact)

In [12]:
#Distribution of variable day of week
plt.figure(figsize=(15,10))
plt.subplot(1,2,1)
plt.hist(df['campaign'])
plt.title('Distribution of Calls made during a campaign')
camp_less_than_20 = df[df['campaign'] < 20]
plt.subplot(1,2,2)
sns.countplot(x="campaign", hue="y", data=camp_less_than_20);
plt.title('Distribution of Calls made during a campaign vs Subscribed')
Out[12]:
<matplotlib.text.Text at 0x7f6536ce1b90>
In [13]:
plt.show()