#MachineLearning #SupervisedLearning #LinearRegression

By Billy Gustave

Fyntra Fashion/Retail

Fyntra is the largest online clothing company in USA. It sells clothing online, but they also have in-store style and clothing advice sessions. Customers come into the store, have sessions/meetings with a personal stylist, then can go home and order either on a mobile app or website for the clothes they want.

Company wants to decide whether to focus the effort on mobile app experience or its website. As a drastic measure, it is also evaluating shutting down the website.

# libraries
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

df = pd.read_csv('FyntraCustomerData.csv')
df.head()

X = df.loc[:,'Avg_Session_Length':'Length_of_Membership']
y = df.Yearly_Amount_Spent

Goal:

Using observations to predict 'Yearly Amount Spent'
Compare time spent on Website vs Mobile App

Data cleaning and Exploration ¶

df.shape

(500, 8)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
Email                   500 non-null object
Address                 500 non-null object
Avatar                  500 non-null object
Avg_Session_Length      500 non-null float64
Time_on_App             500 non-null float64
Time_on_Website         500 non-null float64
Length_of_Membership    500 non-null float64
Yearly_Amount_Spent     500 non-null float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB

We have 500 observations and no missing values

Low Variance filter: We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, hence these variables can be safely dropped.

numeric = X
var = numeric.var()
numeric = numeric.columns
variable = [ ]
for i in range(0,len(var)):
    if var[i]>=10:   #setting the threshold as 10%
        variable.append(numeric[i-1])
print(variable)

[]

No features with low variance

sns.pairplot(df)

<seaborn.axisgrid.PairGrid at 0x2321917e1c8>

df.corr()

There are strong correlations between how long a customer has been a member and amount spent per year. Also no strong correlation between features.

# the data fits a linear model as well
sns.regplot(x='Yearly_Amount_Spent',y='Length_of_Membership',data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x23219e48fc8>

Observed some correlation betweeen time spent on app and spending:

# it seems there is a positive correlation between Time on App and Yearly Amount spent.
sns.jointplot(x='Yearly_Amount_Spent',y='Time_on_App',data=df)

<seaborn.axisgrid.JointGrid at 0x2321a312548>

Train test split ¶

# random_state guarantees the same output everytime the program is run
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = .2, random_state=85)

Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction

Random Forest is one of the most widely used algorithm for feature selection.
It comes packaged with in-built feature importance which helps us select a smaller subset of features.

rfr = RandomForestRegressor(random_state=85)
df = pd.get_dummies(df)
rfr.fit(x_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=85, verbose=0, warm_start=False)

features = x_train.columns
importances = rfr.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

From the above graph, all four features carry some weight

Feature selection - Forward selection
This method allows us to use GridSearchCV along with being able to fine tune your estimator(s) parameters using Pipeline :

*Side note:
This technique is useful only for data with low number of features.
Adding parameters for 'forward' and 'floating' will give us more options than just Foward selection

lr = LinearRegression()
sfs1 = SFS(estimator=lr,k_features=4,forward=True,floating=False,scoring='neg_mean_squared_error',cv=10)
pipe = Pipeline([('sfs', sfs1),('lr', lr)])
param_grid = [{'sfs__k_features': [1, 2, 3, 4]}]
gs = GridSearchCV(estimator=pipe,param_grid=param_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=10,refit=False)
gs = gs.fit(x_train, y_train)
print("Best parameters via GridSearch", gs.best_params_)

Best parameters via GridSearch {'sfs__k_features': 3}

Using information to determine best features:

*Side note:
This step could be done with the previous one

sfs1 = SFS(estimator=lr,k_features=3,forward=True,floating=False,scoring='neg_mean_squared_error',cv=10)
sfs1 = sfs1.fit(x_train, y_train)
for k,v in sfs1.subsets_.items():
    print('avg_score: ',v.get('avg_score'))
    print('feature_names: ',v.get('feature_names'))

avg_score:  -2272.9988804481713
feature_names:  ('Length_of_Membership',)
avg_score:  -760.5587309360617
feature_names:  ('Time_on_App', 'Length_of_Membership')
avg_score:  -101.78054021192959
feature_names:  ('Avg_Session_Length', 'Time_on_App', 'Length_of_Membership')

Best features:
'Avg_Session_Length', 'Time_on_App', 'Length_of_Membership'
Score:
-101.78054021192959

Prediction and MSE ¶

x_train_noweb = x_train.drop('Time_on_Website', axis=1)
x_test_noweb = x_test.drop('Time_on_Website', axis=1)
model = lr.fit(x_train_noweb, y_train)
pred_y = lr.predict(x_test_noweb)
mse = metrics.mean_squared_error(y_test, pred_y)
print("mse = ", mse)
print("rmse = ", np.sqrt(mse))
print("score = ", lr.score(x_test_noweb,y_test))

mse =  93.9011519981122
rmse =  9.690260677510807
score =  0.9811959379105001

# Graph of predicted vs test
plt.scatter(y_test,pred_y)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()

Comparing Mobile App vs Website revenue:

The 'coef_' allows us to see how much X value is multiplied by for y result.

model = lr.fit(x_train, y_train)
pred_y = lr.predict(x_test)

coeffecients = pd.DataFrame(lr.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients

Time On App leads to a higher revenue. For every 1 unit increase in Time on App, there's an increase of 39.1 in dollars spent.
Mobile_App: 39.1 (unit)
Website: 0.7 (unit)

	Email	Address	Avatar	Avg_Session_Length	Time_on_App	Time_on_Website	Length_of_Membership	Yearly_Amount_Spent
0	mstephenson@fernandez.com	835 Frank Tunnel\nWrightmouth, MI 82180-9605	Violet	34.497268	12.655651	39.577668	4.082621	587.951054
1	hduke@hotmail.com	4547 Archer Common\nDiazchester, CA 06566-8576	DarkGreen	31.926272	11.109461	37.268959	2.664034	392.204933
2	pallen@yahoo.com	24645 Valerie Unions Suite 582\nCobbborough, D...	Bisque	33.000915	11.330278	37.110597	4.104543	487.547505
3	riverarebecca@gmail.com	1414 David Throughway\nPort Jason, OH 22070-1220	SaddleBrown	34.305557	13.717514	36.721283	3.120179	581.852344
4	mstephens@davidson-herman.com	14023 Rodriguez Passage\nPort Jacobville, PR 3...	MediumAquaMarine	33.330673	12.795189	37.536653	4.446308	599.406092

	Avg_Session_Length	Time_on_App	Time_on_Website	Length_of_Membership	Yearly_Amount_Spent
Avg_Session_Length	1.000000	-0.027826	-0.034987	0.060247	0.355088
Time_on_App	-0.027826	1.000000	0.082388	0.029143	0.499328
Time_on_Website	-0.034987	0.082388	1.000000	-0.047582	-0.002641
Length_of_Membership	0.060247	0.029143	-0.047582	1.000000	0.809084
Yearly_Amount_Spent	0.355088	0.499328	-0.002641	0.809084	1.000000

Billy Gustave

Fyntra Retail/Fashion

Fyntra Fashion/Retail

Data cleaning and Exploration ¶

Train test split ¶

Prediction and MSE ¶

Comparing Mobile App vs Website revenue:

Contact Me

www.linkedin.com/in/billygustave

billygustave.com

Billy Gustave

	Coeffecient
Avg_Session_Length	25.947252
Time_on_App	39.066821
Time_on_Website	0.682530
Length_of_Membership	61.334694