Background and Purpose
Building N-Gram Library from wiki datasets and creating Language Model based on its probability of occurrence. Using MySql Database for Querying.
Accessing Database and Displaying real-time Auto Completion on localhost via JQuery, PHP and Ajax.
Using item collaborative filtering algorithm to build co-ocurrence matrix by users' rating towards different movies from Netflix Prize Data Set. Implementing the matrix multiplication with Map Reduce jobs to find the recommender movie(s).
View on Wiki
The first step is parsing the input text file and passing it to the Map Reduce job. The mapper would split the words on distributed nodes and the reducer would sum up each word for 2 to N-1 gram.
Taking the intermedium key-value pair from the first Map Reduce job, copying them to the local database (mySql). Try another Map Reduce job to rank the top K possible words for N-Gram auto complete
Fig 1. Auto complete demo
Fig 2. Table in mySql
Two assumptions:
(1) More important websites are likely to receive more links from others;
(2) Websites with higher Page Rank will pass higher weight.
Matrix size: 6012 × 6012;
Vector size: 6012 × 1
Two Map Reduce jobs are required. First MR job deals with Matrix row × PR cell; second MR job sums up cell for each page to get their ranks.
Fig 1. left: Rank index before iteration; right: Rank index after iteration
Build co-occurrence matrix based on users' rating history
Build user-specific rating vector from his/her rating history
Matrix computation to find out the recommending result
Pre-processing for generating the co-occurrence matrix with 2-Gram method;
Matrix computation using Pagerank method with only iterating once and teleporting index to be 0.
Fig 1.Demo of recommender system left: User's rating; right: Recommender ranks
[1] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters.
[2] http://searchengineland.com/what-is-google-pagerank-a-guide-for-searchers-webmasters-11068