#MachineLearning #SupervisedLearning #LinearRegression

By Billy Gustave

Fyntra Fashion/Retail

Fyntra is the largest online clothing company in USA. It sells clothing online, but they also have in-store style and clothing advice sessions. Customers come into the store, have sessions/meetings with a personal stylist, then can go home and order either on a mobile app or website for the clothes they want.

Company wants to decide whether to focus the effort on mobile app experience or its website. As a drastic measure, it is also evaluating shutting down the website.

In [1]:
# libraries
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
In [2]:
df = pd.read_csv('FyntraCustomerData.csv')
df.head()
Out[2]:
Email Address Avatar Avg_Session_Length Time_on_App Time_on_Website Length_of_Membership Yearly_Amount_Spent
0 mstephenson@fernandez.com 835 Frank Tunnel\nWrightmouth, MI 82180-9605 Violet 34.497268 12.655651 39.577668 4.082621 587.951054
1 hduke@hotmail.com 4547 Archer Common\nDiazchester, CA 06566-8576 DarkGreen 31.926272 11.109461 37.268959 2.664034 392.204933
2 pallen@yahoo.com 24645 Valerie Unions Suite 582\nCobbborough, D... Bisque 33.000915 11.330278 37.110597 4.104543 487.547505
3 riverarebecca@gmail.com 1414 David Throughway\nPort Jason, OH 22070-1220 SaddleBrown 34.305557 13.717514 36.721283 3.120179 581.852344
4 mstephens@davidson-herman.com 14023 Rodriguez Passage\nPort Jacobville, PR 3... MediumAquaMarine 33.330673 12.795189 37.536653 4.446308 599.406092
In [3]:
X = df.loc[:,'Avg_Session_Length':'Length_of_Membership']
y = df.Yearly_Amount_Spent

Goal:

  • Using observations to predict 'Yearly Amount Spent'
  • Compare time spent on Website vs Mobile App

Data cleaning and Exploration

In [4]:
df.shape
Out[4]:
(500, 8)
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
Email                   500 non-null object
Address                 500 non-null object
Avatar                  500 non-null object
Avg_Session_Length      500 non-null float64
Time_on_App             500 non-null float64
Time_on_Website         500 non-null float64
Length_of_Membership    500 non-null float64
Yearly_Amount_Spent     500 non-null float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

We have 500 observations and no missing values

Low Variance filter: We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, hence these variables can be safely dropped.

In [6]:
numeric = X
var = numeric.var()
numeric = numeric.columns
variable = [ ]
for i in range(0,len(var)):
    if var[i]>=10:   #setting the threshold as 10%
        variable.append(numeric[i-1])
print(variable)
[]

No features with low variance

In [7]:
sns.pairplot(df)
Out[7]:
<seaborn.axisgrid.PairGrid at 0x2321917e1c8>
In [8]:
df.corr()
Out[8]:
Avg_Session_Length Time_on_App Time_on_Website Length_of_Membership Yearly_Amount_Spent
Avg_Session_Length 1.000000 -0.027826 -0.034987 0.060247 0.355088
Time_on_App -0.027826 1.000000 0.082388 0.029143 0.499328
Time_on_Website -0.034987 0.082388 1.000000 -0.047582 -0.002641
Length_of_Membership 0.060247 0.029143 -0.047582 1.000000 0.809084
Yearly_Amount_Spent 0.355088 0.499328 -0.002641 0.809084 1.000000

There are strong correlations between how long a customer has been a member and amount spent per year. Also no strong correlation between features.

In [9]:
# the data fits a linear model as well
sns.regplot(x='Yearly_Amount_Spent',y='Length_of_Membership',data=df)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x23219e48fc8>

Observed some correlation betweeen time spent on app and spending:

In [10]:
# it seems there is a positive correlation between Time on App and Yearly Amount spent.
sns.jointplot(x='Yearly_Amount_Spent',y='Time_on_App',data=df)
Out[10]:
<seaborn.axisgrid.JointGrid at 0x2321a312548>

Train test split

In [11]:
# random_state guarantees the same output everytime the program is run
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = .2, random_state=85)

Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction

  • Random Forest is one of the most widely used algorithm for feature selection.
  • It comes packaged with in-built feature importance which helps us select a smaller subset of features.
In [12]:
rfr = RandomForestRegressor(random_state=85)
df = pd.get_dummies(df)
rfr.fit(x_train, y_train)
Out[12]:
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=85, verbose=0, warm_start=False)
In [13]:
features = x_train.columns
importances = rfr.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

From the above graph, all four features carry some weight

Feature selection - Forward selection
This method allows us to use GridSearchCV along with being able to fine tune your estimator(s) parameters using Pipeline :

*Side note:
This technique is useful only for data with low number of features.
Adding parameters for 'forward' and 'floating' will give us more options than just Foward selection

In [14]:
lr = LinearRegression()
sfs1 = SFS(estimator=lr,k_features=4,forward=True,floating=False,scoring='neg_mean_squared_error',cv=10)
pipe = Pipeline([('sfs', sfs1),('lr', lr)])
param_grid = [{'sfs__k_features': [1, 2, 3, 4]}]
gs = GridSearchCV(estimator=pipe,param_grid=param_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=10,refit=False)
gs = gs.fit(x_train, y_train)
print("Best parameters via GridSearch", gs.best_params_)
Best parameters via GridSearch {'sfs__k_features': 3}

Using information to determine best features:

*Side note:
This step could be done with the previous one

In [15]:
sfs1 = SFS(estimator=lr,k_features=3,forward=True,floating=False,scoring='neg_mean_squared_error',cv=10)
sfs1 = sfs1.fit(x_train, y_train)
for k,v in sfs1.subsets_.items():
    print('avg_score: ',v.get('avg_score'))
    print('feature_names: ',v.get('feature_names'))
avg_score:  -2272.9988804481713
feature_names:  ('Length_of_Membership',)
avg_score:  -760.5587309360617
feature_names:  ('Time_on_App', 'Length_of_Membership')
avg_score:  -101.78054021192959
feature_names:  ('Avg_Session_Length', 'Time_on_App', 'Length_of_Membership')

Best features:
'Avg_Session_Length', 'Time_on_App', 'Length_of_Membership'
Score:
-101.78054021192959

Prediction and MSE

In [16]:
x_train_noweb = x_train.drop('Time_on_Website', axis=1)
x_test_noweb = x_test.drop('Time_on_Website', axis=1)
model = lr.fit(x_train_noweb, y_train)
pred_y = lr.predict(x_test_noweb)
mse = metrics.mean_squared_error(y_test, pred_y)
print("mse = ", mse)
print("rmse = ", np.sqrt(mse))
print("score = ", lr.score(x_test_noweb,y_test))
mse =  93.9011519981122
rmse =  9.690260677510807
score =  0.9811959379105001
In [17]:
# Graph of predicted vs test
plt.scatter(y_test,pred_y)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()

Comparing Mobile App vs Website revenue:

The 'coef_' allows us to see how much X value is multiplied by for y result.

In [18]:
model = lr.fit(x_train, y_train)
pred_y = lr.predict(x_test)
In [19]:
coeffecients = pd.DataFrame(lr.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients
Out[19]:
Coeffecient
Avg_Session_Length 25.947252
Time_on_App 39.066821
Time_on_Website 0.682530
Length_of_Membership 61.334694

Time On App leads to a higher revenue. For every 1 unit increase in Time on App, there's an increase of 39.1 in dollars spent.
Mobile_App: 39.1 (unit)
Website: 0.7
(unit)

Contact Me

www.linkedin.com/in/billygustave

billygustave.com