Recommendation Engine for Robust Personalization in Finetuning Campaigns
Recommender systems are useful for curating items to users.
Recommender Systems come mostly in two flavors: Content-Based and Collaborative. There are hybrid models which use both.
Content-based works on data generated by a user. This data can be explicit (e.g. Likes) or implicit (e.g. click-throughs). The data generated forms a user profile containing interaction metadata.
This is a social graph approach. It's as simple as 'Users who liked X also like Y'. Users are grouped by similarity in taste. Items can be grouped as well.
This approach is often used to balance trade-offs between Content and Collaborative models. It is typically used to address the 'Cold Start' problem.
The cold start problem is essentially the idea that models with sparse data don't perform well. If they don't perform well, it's hard to retain user traction to feed into the recommender system.
A system ideally has a robust, dense and large corpus. This allows subsetting into candidates.
Candidates are scored and ranked to provide their recommendation to the user. The user then gives input to the system to develop precision for additional queries.
Items given explicit negative input should be removed. Items that are newer are more likely to be recommended. This helps the items become diverse, fresh and fair.
Polarization may be measured by the extent in which user's ratings disagree.
The following main packages are used in the scraping process:
boto3 BeautifulSoup requests fake_useragent
The job was divided into 2 stages:
1/ Getting product URLs
This was done by looping through each category page to extract the JSON objects containing the URLs. The task is split into 4 categories and these are performed seperately under 4 AWS Lambda functions.
The URLs are stored as CSV files on AWS S3.
2/ Getting product details (i.e. categories, brands, products names, prices, SKUs, descriptions, sizes, and image URLs)
After product URLs were obtained, requests were sent to access those URLs to obtain the product's detail, also in a JSON object. Because of the max runtime limit imposed on AWS Lambda (i.e. 15 mins), 23 AWS Lambda functions are called, each to scrape through approximately 1,500 product URLs.
The extracted information is stored as 23 CSV files on AWS S3, which is then combined and reformated with other Lambda functions.
At the time of scrapping, the data from the site includes original features such as:
- Number of Products: 28,992
- Categories with most items: t-shirts
- Categories with fewest items: waistcoats, tuxedos
- Brands with most items: Gucci & Off-White
- Brands with fewest items: Gosha Rubchinskiy & Éditions M.R
- Countries of Origin: 19+
- Max Price: $9,450
- Min Price: $9
D. Feature Engineering
Some original features such as 'categories', 'brands', and 'origins' are already well classified and easy to encode using Multi-Label Binarizer. However, features containing more text such as 'names', 'description' are more complicated and need more analysis and engineering.
The following main packages are used in the feature engineering process:
gensim nltk sklearn.preprocessing sklearn.feature_extraction.text sklearn.decomposition
Name & Description Features
Texts from 'name' and 'description' fields are combined and processed using text preprocessing techniques like lemmatization and stemming. Then, words describing colors are extracted from these texts to create another new feature for the recommendation system later.
Next, these texts are vectorized using TF-IDF (term frequency–inverse document frequency) & TF (term frequency) statistical measures in order to apply two difference topic modelling techniques: Non-negative Matrix Factorization and Latent Dirichlet Allocation with limit to 50 topics.
Non-negative Matrix Factorization was chosen as the eventual topic modelling technique for this task as the topics generated under NMF generally are better generalized and it is easier to guess the general products they are referring to.
After feature engineering stage, nine original features that were deemed useful for the recommendation system produces 1,069 new features that will be used to train the system.
- Sub Catogories: 116
- Brands: 397
- Text Topics: 50
- Colors: 35
- Countries of Origin: 19
- Materials: 292
- Remaining Sizes: 157
E. Similar Product Recommendation System
The following main packages are used in the model building process:
In order to build a content-based recommendation engine, two methods are tested: Cosine Similarity and Euclidean Distances. I opted not to use another popular similarity measure called Jaccard Similarity for now as it looks for exact intersection of features while the model currently deals with prices, one of the most important features for the engine. $150 and $151 are pretty much the same and they lie close to each other on a vector but to under Jaccard measure, they are treated as different.
Cosine similarity calculates similarity by measuring the cosine of angle between two vectors, in this case a vector containing data on the engineered features of a product.
Euclidean distance is similar to using a ruler to actually measure the distance between two points. There are potentially unlimted points on a plane with the same distance to an anchor point while the angle of their vectors would be very different.
Most recommendations from both models are pretty consistent with each other. However, since Consine Similarity looks at the angles of 2 vectors while Euclidean Distance looks at the actual distance between 2 points, I decided to employ a hybrid approach by taking the product of these two distances to yield better recommdendations for extreme cases.