#MachineLearning #RecommendationEngine

By Billy Gustave

Books Recommendation ¶

Business challenge/requirement:

BookRent is the largest online and offline book rental chain in India. TheCompany charges a fixed fee per month plus rental per book. So,the company makes more money when user rent more books.

You as an ML expert have to model recommendation engine so that user gets recommendation of books based on the behaviorof similar users. This will ensure that users are renting books based on their individual taste.

Company is still unprofitable and is looking to improve both revenue and profit.

Goal :

use recommendation engine to entice user to rent more books.
recommend books based on ratings
Data: BX-Book-Ratings.csv, BX-Books.csv, BX-Users.csv
Will use 100k sample (1mil is too big for my system)

Data Exploration ¶

import numpy as np, pandas as pd

df_user_bx_rtng = pd.read_csv('BX-Book-Ratings.csv', encoding='ansi', nrows=10000)
print(df_user_bx_rtng.shape)
df_user_bx_rtng.head()

(10000, 3)

df_bx = pd.read_csv('BX-Books.csv', encoding='ansi', low_memory=False)
print(df_bx.shape)
df_bx.head()

(271379, 5)

df_user = pd.read_csv('BX-Users.csv', encoding='ansi', low_memory=False)
print(df_user.shape)
df_user.head()

(278859, 3)

We won't be using BX-Users data since it doesn't have any information valuable for this project.

We will use book with matching ISBN IDs:

# merge the two data on ISBN
df = pd.merge(df_user_bx_rtng,df_bx,on='isbn')
print(df.shape)
df.head()

(8701, 7)

Checking for missing values

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8701 entries, 0 to 8700
Data columns (total 7 columns):
user_id                8701 non-null int64
isbn                   8701 non-null object
rating                 8701 non-null int64
book_title             8701 non-null object
book_author            8701 non-null object
year_of_publication    8701 non-null object
publisher              8701 non-null object
dtypes: int64(2), object(5)
memory usage: 543.8+ KB

No missing values

Data Engineering ¶

Train-Test-Split

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, random_state=7)

Rating Matrix

# get users and books
n_users = df.user_id.unique().shape[0]
n_books = df.isbn.unique().shape[0]

print('Num. of Users: '+ str(n_users))
print('Num of Books: '+str(n_books))

Num. of Users: 828
Num of Books: 8051

# I will use a dataframe instead since isbn is alphanumeric
# and conver to numpy array after fillig the matrix
train_to_convert = pd.DataFrame(columns=df.isbn.unique(),index=df.user_id.unique()).fillna(0)
test_to_convert = pd.DataFrame(columns=df.isbn.unique(),index=df.user_id.unique()).fillna(0)

# fill in our new matrix
for line in train.itertuples():
    train_to_convert.loc[line[1], line[2]] = line[3]
for line in test.itertuples():
    test_to_convert.loc[line[1], line[2]] = line[3]

# convert to array
train_matrix = train_to_convert.values
test_matrix = test_to_convert.values
print(train_matrix)
print(test_matrix)

[[0 0 0 ... 0 0 0]
 [0 5 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Modeling ¶

Creating Similarities

from sklearn.metrics import pairwise_distances, mean_squared_error
user_similarity = pairwise_distances(train_matrix, metric='cosine')
book_similarity = pairwise_distances(train_matrix.T, metric='cosine')

Prediction

# user prediction
mean_user_rating = train_matrix.mean(axis=1)[:,np.newaxis]
ratings_diff = (train_matrix - mean_user_rating)
user_pred = mean_user_rating + user_similarity.dot(ratings_diff)/np.array([np.abs(user_similarity).sum(axis=1)]).T

# book prediction
book_pred = train_matrix.dot(book_similarity)/np.array([np.abs(book_similarity).sum(axis=1)])

Error test

test = test_matrix[test_matrix.nonzero()].flatten()
user_pred = user_pred[test_matrix.nonzero()].flatten()
book_pred = book_pred[test_matrix.nonzero()].flatten()

from math import sqrt
print('User rmse: ', sqrt(mean_squared_error(user_pred,test)))
print('Book rmse: ', sqrt(mean_squared_error(book_pred,test)))

User rmse:  7.80791430244133
Book rmse:  7.807178542549171

About same resuts for both UBCF and IBCF

5000 :

User rmse: 7.770542207001438
Book rmse: 7.7678620620152845

1000 :

User rmse: 7.716320257089829
Book rmse: 7.697961798247359

	user_id	isbn	rating
0	276725	034545104X	0
1	276726	155061224	5
2	276727	446520802	0
3	276729	052165615X	3
4	276729	521795028	6

	isbn	book_title	book_author	year_of_publication	publisher
0	195153448	Classical Mythology	Mark P. O. Morford	2002	Oxford University Press
1	2005018	Clara Callan	Richard Bruce Wright	2001	HarperFlamingo Canada
2	60973129	Decision in Normandy	Carlo D'Este	1991	HarperPerennial
3	374157065	Flu: The Story of the Great Influenza Pandemic...	Gina Bari Kolata	1999	Farrar Straus Giroux
4	393045218	The Mummies of Urumchi	E. J. W. Barber	1999	W. W. Norton & Company

	user_id	Location	Age
0	1	nyc, new york, usa	NaN
1	2	stockton, california, usa	18.0
2	3	moscow, yukon territory, russia	NaN
3	4	porto, v.n.gaia, portugal	17.0
4	5	farnborough, hants, united kingdom	NaN

	user_id	isbn	rating	book_title	book_author	year_of_publication	publisher
0	276725	034545104X	0	Flesh Tones: A Novel	M. J. Rose	2002	Ballantine Books
1	276726	155061224	5	Rites of Passage	Judith Rae	2001	Heinle
2	276727	446520802	0	The Notebook	Nicholas Sparks	1996	Warner Books
3	278418	446520802	0	The Notebook	Nicholas Sparks	1996	Warner Books
4	276729	052165615X	3	Help!: Level 1	Philip Prowse	1999	Cambridge University Press

Billy Gustave

Books Recommendation

Books Recommendation ¶

Data Exploration ¶

Data Engineering ¶

Modeling ¶

Contact Me

www.linkedin.com/in/billygustave

billygustave.com

Billy Gustave