Clustering attempts to group objects that are similar through techniques that are considered unsupervised (EMC Education Services, 2015). Association rules is also an unsupervised learning method that is utilized with the aim of finding relationships in large datasets that otherwise would be difficult to determine.
The Apriori algorithm is one of the earliest algorithms created to discover these association rules and uses frequent itemset (items that appear frequently together) and support (percentage of transactions containing a particular value) to break down the process (EMC Education Services, 2015). The algorithm is a bottom-up process that discovers frequent itemsets by determining all possible combinations and then identifying the frequency of values.
Example
In a prior role as an analyst, I worked on a project with our digital team who were responsible for our website and all subsequent analysis of user traffic. As an analyst, I was in charge of taking the transactional data from the databases that were captured by our web servers and performing analytics on the data as required by the team. The goal was to identify users and their website use patterns. In the data, I was able to use IP addresses and device characteristics (browser, OS, screen size, etc).
I utilized an Apriori algorithm on the dataset because I was able to create a sort of mining technique using association rules between IP addresses and devices to accurately identify a single user and their sessions. IP addresses can change, especially if a user goes to a new location. In order to string their usage patterns, I will outline the process I used to create histograms, scatterplots, and other graphs using the visualization software Qlik.
First, I extracted the data and cleaned it the best I could so that way I had removed extra characters, renamed columns, and ordered them appropriately. Next, I created the rules that would help identify a singular user (IP/device characteristic rules). I used R to load the data onto our server and performed the Apriori command after implementing the rules and setting appropriate parameters for the algorithm. I extracted the results into a CSV file and loaded it into Qlik to create multiple graphs.
While I did not utilize a clustering method on the data back then, if I were doing the assignment today, I would approach it with a k-modes and categorize the users into clusters based on their usage patterns of length or types of activities performed on the website. This was a very challenging process and iterative in nature - it probably took me a full week of research and implementation to get the results accurate enough to show our leadership team. The toughest part was setting up the rules for the association to identify singular users.
Resources
EMC Education Services (Editor). (2015). Big Data, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data (1st ed.). Hoboken, NJ: John Wiley & Sons.
コメント