#MachineLearning #SupervisedLearning #LinearRegression
By Billy Gustave
Fyntra is the largest online clothing company in USA. It sells clothing online, but they also have in-store style and clothing advice sessions. Customers come into the store, have sessions/meetings with a personal stylist, then can go home and order either on a mobile app or website for the clothes they want.
Company wants to decide whether to focus the effort on mobile app experience or its website. As a drastic measure, it is also evaluating shutting down the website.
# libraries
import pandas as pd, seaborn as sns, numpy as np, matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
df = pd.read_csv('FyntraCustomerData.csv')
df.head()
X = df.loc[:,'Avg_Session_Length':'Length_of_Membership']
y = df.Yearly_Amount_Spent
Goal:
df.shape
df.info()
We have 500 observations and no missing values
Low Variance filter: We apply this approach to identify and drop constant variables from the dataset. The target variable is not unduly affected by variables with low variance, hence these variables can be safely dropped.
numeric = X
var = numeric.var()
numeric = numeric.columns
variable = [ ]
for i in range(0,len(var)):
if var[i]>=10: #setting the threshold as 10%
variable.append(numeric[i-1])
print(variable)
No features with low variance
sns.pairplot(df)
df.corr()
There are strong correlations between how long a customer has been a member and amount spent per year. Also no strong correlation between features.
# the data fits a linear model as well
sns.regplot(x='Yearly_Amount_Spent',y='Length_of_Membership',data=df)
Observed some correlation betweeen time spent on app and spending:
# it seems there is a positive correlation between Time on App and Yearly Amount spent.
sns.jointplot(x='Yearly_Amount_Spent',y='Time_on_App',data=df)
# random_state guarantees the same output everytime the program is run
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = .2, random_state=85)
Random Forest: This is one of the most commonly used techniques which tells us the importance of each feature present in the dataset. We can find the importance of each feature and keep the top most features, resulting in dimensionality reduction
rfr = RandomForestRegressor(random_state=85)
df = pd.get_dummies(df)
rfr.fit(x_train, y_train)
features = x_train.columns
importances = rfr.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
From the above graph, all four features carry some weight
Feature selection - Forward selection
This method allows us to use
GridSearchCV
along with being able to fine tune your estimator(s) parameters using
Pipeline
:
*Side note:
This technique is useful only for data with low number of features.
Adding parameters for 'forward' and 'floating' will give us more options than just Foward selection
lr = LinearRegression()
sfs1 = SFS(estimator=lr,k_features=4,forward=True,floating=False,scoring='neg_mean_squared_error',cv=10)
pipe = Pipeline([('sfs', sfs1),('lr', lr)])
param_grid = [{'sfs__k_features': [1, 2, 3, 4]}]
gs = GridSearchCV(estimator=pipe,param_grid=param_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=10,refit=False)
gs = gs.fit(x_train, y_train)
print("Best parameters via GridSearch", gs.best_params_)
Using information to determine best features:
*Side note:
This step could be done with the previous one
sfs1 = SFS(estimator=lr,k_features=3,forward=True,floating=False,scoring='neg_mean_squared_error',cv=10)
sfs1 = sfs1.fit(x_train, y_train)
for k,v in sfs1.subsets_.items():
print('avg_score: ',v.get('avg_score'))
print('feature_names: ',v.get('feature_names'))
Best features:
'Avg_Session_Length', 'Time_on_App', 'Length_of_Membership'
Score:
-101.78054021192959
x_train_noweb = x_train.drop('Time_on_Website', axis=1)
x_test_noweb = x_test.drop('Time_on_Website', axis=1)
model = lr.fit(x_train_noweb, y_train)
pred_y = lr.predict(x_test_noweb)
mse = metrics.mean_squared_error(y_test, pred_y)
print("mse = ", mse)
print("rmse = ", np.sqrt(mse))
print("score = ", lr.score(x_test_noweb,y_test))
# Graph of predicted vs test
plt.scatter(y_test,pred_y)
plt.xlabel('Y Test')
plt.ylabel('Predicted Y')
plt.show()
The 'coef_' allows us to see how much X value is multiplied by for y result.
model = lr.fit(x_train, y_train)
pred_y = lr.predict(x_test)
coeffecients = pd.DataFrame(lr.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients
Time On App
leads to a higher revenue. For every
1
unit increase in Time on App, there's an increase of
39.1
in dollars spent.
Mobile_App:
39.1
(unit)
Website:
0.7
(unit)