Data Mining Principles TA Session 7 (February 26, 2021)

Agenda

  • Association Rules

  • Recommender Systems

Association Rules

  • Association rules, or affinity analysis is designed to find such general associations patterns between items in large databases

  • An association rule is a case where the conditional probability of product A given you also purchase product B is high—much higher than the unconditional probability for product A

Key Metrics

  • Lift

  • Support

  • Confidence

Lift in AR

  • The lift value is a measure of importance of a rule

  • The lift value of an association rule is the ratio of the confidence of the rule and the expected confidence of the rule

  • The expected confidence of a rule is defined as the product of the support values of the rule body and the rule head divided by the support of the rule body

  • Lift value > 1 - positive effect

  • Lift value < 0 - negative effect

  • Lift value = 1 - no effect

Support in AR

  • The support of an association rule is the percentage of groups that contain all of the items listed in that association rule

  • The percentage value is calculated from among all the groups that were considered

  • The support of a rule is the percentage equivalent of a/b, where the values are:

  • a: The number of groups containing all the items that appear in the rule

  • b: The total number of all the groups that are considered

  • You can specify that only rules that achieve a certain minimum level of support are included in your mining model and this ensures a highly meaningful result

Confidence in AR

  • The confidence of an association rule is a percentage value that shows how frequently the rule head occurs among all the groups containing the rule body

  • The confidence value indicates how reliable this rule is

  • The higher the value, the more likely the head items occur in a group if it is known that all body items are contained in that group

  • Thus, the confidence of a rule is the percentage equivalent of m/n, where the values are:

  • m: The number of groups containing the joined rule head and rule body

  • n: The number of groups containing the rule body

Limitations

  • Association rules can give you some useful insights, but they are limited to comparisons between pairs or small sets of products

  • The algorithm also remains fairly slow by modern standards

Recommender System

  • The construction of systems that support users in their (online) decision making is the main goal of the field of recommender systems

  • In particular, the goal of recommender systems is to provide easily accessible, high-quality recommendations for a large user community

  • They are everywhere: Amazon, Neflix, Google, etc

  • Basic idea - if users shared the same interests in the past – if they viewed or bought the same books, for instance – they will also have similar tastes in the future

  • So, if, for example, user A and user B have a purchase history that overlaps strongly and user A has recently bought a book that B has not yet seen, the basic rationale is to propose this book also to B

Key Idea

Types of Algorithms

  • Algorithms that employ usage data are called collaborative filtering

  • Algorithms that use content metadata and user profiles to calculate recommendations are called content based filtering

  • A mix of the two types is called hybrid recommenders

Collaborative Filtering

  • Collaborative filtering is a family of algorithms where there are multiple ways to find similar users or items and multiple ways to calculate rating based on ratings of similar users

  • Two approaches - memory-based approach and modelling approach

  • Memory-based approach - find similar users, using such techniques as cosine similarity and pearson correlation and take the weighted average of ratings

  • Model-based approach - use different ML algorithms

Memory-based Approach

  • User-based CF - a subset of appropriate users are chosen based on their similarity to the active user, and a weighted aggregate of their ratings is used to generate predictions for the active user at run-time

  • Item-based - a memory-based algorithm which explores the relationship between items as a function of how users have rated them

Memory-based approach

User-Item Matrix

  • A user-item (U-I) matrix is a matrix, which encodes the individual preferences of users for items in a collection, for recommender systems

Content-based Approach

  • The recommendation task then consists of determining the items that match the user’s preferences best

Content-based Approach

  • Content analyzer - when information has no structure (e.g. text), some kind of pre-processing step is needed to extract structured relevant information

  • Profile learner - this module collects data representative of the user preferences and tries to generalize this data, in order to construct the user profile

  • Filtering component - this module exploits the user profile to suggest relevant items by matching the profile representation against that of items to be recommended

Similarity metrics

  • Cosine similarity

  • Pearson’s correlation

Model-based CF

  • Matrix factorization

  • Clustering

  • Deep learning

Model-based CF (clustering)

  • In this strategy, similar users are clustered into segments and the similarity between the target user and a user segment is calculated

  • For each segment, an aggregate profile, consisting of the average rating for each item in the segment is computed and predictions are made using the aggregate profile rather than individual profiles

  • To make a recommendation for a target user u and target item i, a neighbourhood of user segments that have a rating for i and whose aggregate profile is most similar to u is chosen

  • A prediction for item i is made using the k nearest segments and associated aggregate profiles, rather than the k nearest neighbors

Model-based CF (matrix factorization)

  • One model-based approach to collaborative recommendation which has proven very successful recently, is the application of matrix factorization approaches based on singular value decomposition (SVD) and its variants

Challenges of CF

  • Totally new users (cold start)

  • Outliers (grey sheep)

  • Manipulations with reviews

  • Data sparsity

Sources

  • Business Data Science: Combining Machine Learning and Economics to Optimize, Automate, and Accelerate Business Decisions (Taddy)

  • Practical Recommender Systems (Falk)

  • Recommender Systems: An Introduction (Zanker et al.)

  • Recommender Systems Handbook (Ricci et al.)

  • Various Implementations of Collaborative Filtering (Grover)

  • Matrix Factorization Techniques for Recommender Systems (Koren et al.)