Web Page Ranking With Hadoop
Objective
The objective of the Web Page Ranking With Hadoop project is ranking the web pages using Hadoop and MapReduce based on the keyword to improve the accuracy of the web page search results for the search query by the user.
Project Overview
The number of web pages in the internet is growing rapidly. So there is a need for analyzing that much of internetdata to get any valuable insight to return the best search results. The large data processing is needed to rank a webpage based on the keywords. Hence Hadoop framework is the best choice for data processing for storing all the web pages and for ranking web pages.Web Page ranking is used to define the relevance of the web page to the user query.
Searching the relevant information using links is one of the difficult tasks. It consumes lot of time and it will not produce exact or accurate results.In order to improve the efficiency in the web page searching and retrieving, improvement in existing system and an efficient algorithm based on keyword is needed to rank the web pages. Hadoop data processing framework is used for storing and retrieving web related data and page rank algorithm is used for ranking web pages.
Existing System
In the traditional web page ranking, web page searching is done based on the hyperlinks in the web page. It provides search result to the user, but it does not return the user expected search result.
Proposed System
The proposed Web Page Ranking With Hadoop project system rank the web pages based on the keywords strength (Number of keywords) in the web page document. MapReduce concept is used here to rank the web pages based on Mapper and Reducer. The web page with highest number of keywords in the document is returned to the user query. This process increases the efficiency of the search result and less time consuming.
The proposed Web Page Ranking With Hadoop project system focuses on creating best page ranking algorithm for Web pages using Hadoop. The proposed system architecture is shown in the figure.
Module 1: Data Preparation
Document data & Hadoop large data processing: Web page data are stored in the text format. Large numbers of text files are stored and processed using Hadoop framework.
Module 2: MapReduce
MapReduceconsists of 4 tasks, loading, parsing, transforming and filtering to rank the web pages.
Module 3: Page Ranking Algorithm
This algorithm focuses on ranking the web pages based on the keyword strength.
Module 4: Results Page
The final web page result is displayed in the user interface with the top level web page results to the user based on the query requested.
Web Page Ranking With Hadoop Benefits
- Fast and accurate web page results
- Less time consuming
Software Requirements
- Ubuntu OS
- MySQL
- Hadoop&MapReduce
- JDK
Hardware Requirements
- Hard Disk – 1 TB or Above
- RAM required – 8 GB or Above
- Processor – Core i3 or Above
Technology Used
- Big Data – Hadoop
Leave a Reply