#MachineLearning #RecommendationEngine

By Billy Gustave

Books Recommendation

Business challenge/requirement:

BookRent is the largest online and offline book rental chain in India. TheCompany charges a fixed fee per month plus rental per book. So,the company makes more money when user rent more books.

You as an ML expert have to model recommendation engine so that user gets recommendation of books based on the behaviorof similar users. This will ensure that users are renting books based on their individual taste.

Company is still unprofitable and is looking to improve both revenue and profit.

Goal :

  • use recommendation engine to entice user to rent more books.
  • recommend books based on ratings
  • Data: BX-Book-Ratings.csv, BX-Books.csv, BX-Users.csv
  • Will use 100k sample (1mil is too big for my system)

Data Exploration

In [1]:
import numpy as np, pandas as pd
In [2]:
df_user_bx_rtng = pd.read_csv('BX-Book-Ratings.csv', encoding='ansi', nrows=10000)
print(df_user_bx_rtng.shape)
df_user_bx_rtng.head()
(10000, 3)
Out[2]:
user_id isbn rating
0 276725 034545104X 0
1 276726 155061224 5
2 276727 446520802 0
3 276729 052165615X 3
4 276729 521795028 6
In [3]:
df_bx = pd.read_csv('BX-Books.csv', encoding='ansi', low_memory=False)
print(df_bx.shape)
df_bx.head()
(271379, 5)
Out[3]:
isbn book_title book_author year_of_publication publisher
0 195153448 Classical Mythology Mark P. O. Morford 2002 Oxford University Press
1 2005018 Clara Callan Richard Bruce Wright 2001 HarperFlamingo Canada
2 60973129 Decision in Normandy Carlo D'Este 1991 HarperPerennial
3 374157065 Flu: The Story of the Great Influenza Pandemic... Gina Bari Kolata 1999 Farrar Straus Giroux
4 393045218 The Mummies of Urumchi E. J. W. Barber 1999 W. W. Norton & Company
In [4]:
df_user = pd.read_csv('BX-Users.csv', encoding='ansi', low_memory=False)
print(df_user.shape)
df_user.head()
(278859, 3)
Out[4]:
user_id Location Age
0 1 nyc, new york, usa NaN
1 2 stockton, california, usa 18.0
2 3 moscow, yukon territory, russia NaN
3 4 porto, v.n.gaia, portugal 17.0
4 5 farnborough, hants, united kingdom NaN

We won't be using BX-Users data since it doesn't have any information valuable for this project.

We will use book with matching ISBN IDs:

In [5]:
# merge the two data on ISBN
df = pd.merge(df_user_bx_rtng,df_bx,on='isbn')
print(df.shape)
df.head()
(8701, 7)
Out[5]:
user_id isbn rating book_title book_author year_of_publication publisher
0 276725 034545104X 0 Flesh Tones: A Novel M. J. Rose 2002 Ballantine Books
1 276726 155061224 5 Rites of Passage Judith Rae 2001 Heinle
2 276727 446520802 0 The Notebook Nicholas Sparks 1996 Warner Books
3 278418 446520802 0 The Notebook Nicholas Sparks 1996 Warner Books
4 276729 052165615X 3 Help!: Level 1 Philip Prowse 1999 Cambridge University Press

Checking for missing values

In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8701 entries, 0 to 8700
Data columns (total 7 columns):
user_id                8701 non-null int64
isbn                   8701 non-null object
rating                 8701 non-null int64
book_title             8701 non-null object
book_author            8701 non-null object
year_of_publication    8701 non-null object
publisher              8701 non-null object
dtypes: int64(2), object(5)
memory usage: 543.8+ KB

No missing values

Data Engineering

Train-Test-Split

In [7]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, random_state=7)

Rating Matrix

In [8]:
# get users and books
n_users = df.user_id.unique().shape[0]
n_books = df.isbn.unique().shape[0]
In [9]:
print('Num. of Users: '+ str(n_users))
print('Num of Books: '+str(n_books))
Num. of Users: 828
Num of Books: 8051
In [10]:
# I will use a dataframe instead since isbn is alphanumeric
# and conver to numpy array after fillig the matrix
train_to_convert = pd.DataFrame(columns=df.isbn.unique(),index=df.user_id.unique()).fillna(0)
test_to_convert = pd.DataFrame(columns=df.isbn.unique(),index=df.user_id.unique()).fillna(0)
In [11]:
# fill in our new matrix
for line in train.itertuples():
    train_to_convert.loc[line[1], line[2]] = line[3]
for line in test.itertuples():
    test_to_convert.loc[line[1], line[2]] = line[3]
In [12]:
# convert to array
train_matrix = train_to_convert.values
test_matrix = test_to_convert.values
print(train_matrix)
print(test_matrix)
[[0 0 0 ... 0 0 0]
 [0 5 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Modeling

Creating Similarities

In [13]:
from sklearn.metrics import pairwise_distances, mean_squared_error
user_similarity = pairwise_distances(train_matrix, metric='cosine')
book_similarity = pairwise_distances(train_matrix.T, metric='cosine')

Prediction

In [14]:
# user prediction
mean_user_rating = train_matrix.mean(axis=1)[:,np.newaxis]
ratings_diff = (train_matrix - mean_user_rating)
user_pred = mean_user_rating + user_similarity.dot(ratings_diff)/np.array([np.abs(user_similarity).sum(axis=1)]).T
In [15]:
# book prediction
book_pred = train_matrix.dot(book_similarity)/np.array([np.abs(book_similarity).sum(axis=1)])

Error test

In [16]:
test = test_matrix[test_matrix.nonzero()].flatten()
user_pred = user_pred[test_matrix.nonzero()].flatten()
book_pred = book_pred[test_matrix.nonzero()].flatten()
In [17]:
from math import sqrt
print('User rmse: ', sqrt(mean_squared_error(user_pred,test)))
print('Book rmse: ', sqrt(mean_squared_error(book_pred,test)))
User rmse:  7.80791430244133
Book rmse:  7.807178542549171

About same resuts for both UBCF and IBCF

5000 :

  • User rmse: 7.770542207001438
  • Book rmse: 7.7678620620152845

1000 :

  • User rmse: 7.716320257089829
  • Book rmse: 7.697961798247359

Contact Me

www.linkedin.com/in/billygustave

billygustave.com