About project

Background and Purpose

N-Gram Autocomplete

Autocomplete for Search Engine

Building N-Gram Library from wiki datasets and creating Language Model based on its probability of occurrence. Using MySql Database for Querying.
Accessing Database and Displaying real-time Auto Completion on localhost via JQuery, PHP and Ajax.

View on Wiki

Google Page Rank

A Study on Goole Page Rank Algorithm

Creating the Transition matrix based on the directed graph. Iterating the pagerank vector from evenly distributed to convergency.
Implementing the matrix multiplication with Map Reduce jobs to find page ranks.

View on Wiki

Recommender System

Recommending movies you may like

Using item collaborative filtering algorithm to build co-ocurrence matrix by users' rating towards different movies from Netflix Prize Data Set. Implementing the matrix multiplication with Map Reduce jobs to find the recommender movie(s).

View on Wiki

N-Gram AutoComplete

N-Gram Library

Build N-Gram Library from input text file

The first step is parsing the input text file and passing it to the Map Reduce job. The mapper would split the words on distributed nodes and the reducer would sum up each word for 2 to N-1 gram.

Language Model

Build Language Model on local Database

Taking the intermedium key-value pair from the first Map Reduce job, copying them to the local database (mySql). Try another Map Reduce job to rank the top K possible words for N-Gram auto complete


Fig 1. Auto complete demo


Fig 2. Table in mySql

Google PageRank

Transition Matrix & PageRank Vector

Initializing the Transition Matrix and the Page Rank Vector

Two assumptions:
(1) More important websites are likely to receive more links from others;
(2) Websites with higher Page Rank will pass higher weight.

parallel matrix multiplication

Implementing matrix multiplication with Map Reduce

Matrix size: 6012 × 6012;
Vector size: 6012 × 1
Two Map Reduce jobs are required. First MR job deals with Matrix row × PR cell; second MR job sums up cell for each page to get their ranks.


Fig 1. left: Rank index before iteration; right: Rank index after iteration

Recommender System

Collaborative Filtering (Item)

Generating relevancy matrix via Item CF

Build co-occurrence matrix based on users' rating history
Build user-specific rating vector from his/her rating history
Matrix computation to find out the recommending result

processing with with Map Reduce

Implementing data processing & matrix multiplication

Pre-processing for generating the co-occurrence matrix with 2-Gram method;
Matrix computation using Pagerank method with only iterating once and teleporting index to be 0.


Fig 1.Demo of recommender system left: User's rating; right: Recommender ranks

Reference

[1] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters.
[2] http://searchengineland.com/what-is-google-pagerank-a-guide-for-searchers-webmasters-11068