Hadoop Projects

Wiki Page Ranking With Hadoop Project

February 15, 2018 by ProjectsGeek Leave a Comment

Wiki Page Ranking With Hadoop

Objective

To ranking the given wiki pages using Hadoop.

Project Overview

One of the biggest changes in this decade is the availability of efficient and accurate information on the web. Google is the largest search engine in this world owned by Google, Inc. The heart of the search engine is PageRank, the algorithm for ranking the web pages developed by Larry Page and Sergey Brin at Stanford University. Page Rank algorithm does not rank the whole website, but it’s determined for each page individually.

The page ranking is not a completely new concept in the internet search. But it’s becoming more important nowadays, to improve the web page ranking position in the search engine. Wikipedia has more than 3 million articles as of now and still its increasing everyday. Every article has links to many other articles. This is the significant factor in page ranking. Each article has incoming and outgoing links. So analyzing which page is the most important page than the other pages is the key. Page ranking does this and rank the pages based on its importance.

Proposed System

The proposed Wiki Page Ranking With Hadoop project system focuses on creating best page ranking algorithm for Wikipedia articles using Hadoop. The proposed system architecture is shown in the figure.

All the document data in the database are indexed. The web search engine searches the keyword in the indexing of the database.Then to search the information in the database, the web crawler is used. After finding the pages, the search engine shows the top web pages that are related to the query.

MapReduce consists of several iterative set of functions to perform the set of search results. Map() function gathers the query results from the search and reduce() function performs the add operation. The wiki page ranking project using Hadoop involves 3 important Hadoop steps.

Parsing
Calculating
Ordering

Parsing

In the parsing step, wiki xml is parsed into articles in Hadoop Job. During the mapping phase, get the name of the article and outgoing link connections are mapped. In the reduce phase, get the links to other pages.Then store the page, early rank and outgoing links.

Map(): Article name, outgoing link

Reduce(): Link to other pages

Store: Page, Rank, Outgoing link

Calculating

In the calculating step, the Hadoop job will calculate the new rank for the web pages.In the mapping phase, each outgoing link to the page with its rank and total outgoing links are mapped using map function.In the reduce phase, calculate the new page rank for the pages.Then store the page, new rank and outgoing links.Repeat these processes iterative to get the best results.

Map(): Rank, outgoing link

Reduce(): Page rank

Store: Page, Rank, Outgoing link

Ordering

Here the job function will map the identified page. Then store the rank and page. Now we can see top n pages in the web search.

Wiki Page Ranking With Hadoop Benefits

Fast and accurate web page results
Less time consuming

Software Requirements

Linux OS
MySQL
Hadoop & MapReduce

Hardware Requirements

Hard Disk – 1 TB or Above
RAM required – 8 GB or Above
Processor – Core i3 or Above

Technology Used

Big Data – Hadoop

Twitter Data Sentimental Analysis Using Hadoop Project

February 7, 2018 by ProjectsGeek Leave a Comment

Twitter Data Sentimental Analysis Using Hadoop

Objective

To analyze the sentiments of people as positive, negative or neutral using Hadoop for the Demonetization data to extract interesting patterns.

Project Overview

The Twitter Data Sentimental Analysis hadoop project is to analyse the sentiment by gathering tweets from different people and to check whether the people happy with the government scheme or not. Twitter Sentiment Analysis is the process of determining Tweets is positive, negative or neutral.It is known as opinion mining.

The data set is collected from tweets of citizens from twitter. Obviously data are in an unstructured format. Also a huge amount of tweets is generated. So here the big data come into action. The big data concepts like, Hadoop, MapReduce, Hadoop Distributed File System widely used for this type of applications.

Proposed System

The proposed Twitter Data Sentimental Analysis hadoop project system concentrates on sentiment analysis of the noteban data using hadoop. The sentiments collected from the twitter are classified as positive, negative, neutral. Positive opinion words are used to express desired states for the government scheme while negative opinion words are used to express undesired states for the government scheme. The proposed system architecture is shown in the figure.

Step 1: Twitter API

Twitter API is used as an authentication API to extract the tweets related noteban data.

Step 2: Data Preparation

The data are collected from twitter using Hadoop through twitter API for Indian government announcement noteban. Punctuation, stop words, special characters are removed using data preprocessing techniques.

Tokenization:
Lexical Dictionary
Acronym Dictionary
Emoticon Dictionary
Stop Words Dictionary

Tokenization

Tweets extracted from twitter are divided into into tokens. This is known as tokenization process. For example, ‘In the short run it took many life & shattered many household’is divided down into ‘In’ , ‘the’, ‘short’, ‘run’, ‘it’, ‘took’, ‘many’, ‘life’, ‘&’, ‘shattered’, ‘many’, ‘household’.

Lexical Dictionary:It is used to match the words in the tweet.

Acronym Dictionary:It is used to expand the abbreviations and acronyms. This dictionary will create words which are used for further analysis.

Emoticon Dictionary: it is used to convey the meaning for emoticon.

Stop Words Dictionary:The words which do not have any importance for sentiment analysis. So this word is identified and removed. Example: a, an, the, as, etc.,

Step 3: Sentiment Analysis

The sentiments collected from the twitter are classified as positive, negative, neutral. This sentiment analysis is performed statewise.

Example for positive tweet:

New india is born.

Example for negative tweet:

In the short run it took many life & shattered many household.

Step 4: Data Visualization

After the sentiment analysis, the analyzedsentiments are visualized using bar chart.

Software Requirements

Linux OS
MySQL
Hadoop & MapReduce
Twitter API Account

Hardware Requirements

Hard Disk – 1 TB or Above
RAM required – 4 GB or Above
Processor – Core i3 or Above

Technology Used

Big Data – Hadoop

Web Page Ranking With Hadoop Project

February 5, 2018 by ProjectsGeek Leave a Comment

Web Page Ranking With Hadoop

Objective

The objective of the Web Page Ranking With Hadoop project is ranking the web pages using Hadoop and MapReduce based on the keyword to improve the accuracy of the web page search results for the search query by the user.

Project Overview

The number of web pages in the internet is growing rapidly. So there is a need for analyzing that much of internetdata to get any valuable insight to return the best search results. The large data processing is needed to rank a webpage based on the keywords. Hence Hadoop framework is the best choice for data processing for storing all the web pages and for ranking web pages.Web Page ranking is used to define the relevance of the web page to the user query.

Searching the relevant information using links is one of the difficult tasks. It consumes lot of time and it will not produce exact or accurate results.In order to improve the efficiency in the web page searching and retrieving, improvement in existing system and an efficient algorithm based on keyword is needed to rank the web pages. Hadoop data processing framework is used for storing and retrieving web related data and page rank algorithm is used for ranking web pages.

Existing System

In the traditional web page ranking, web page searching is done based on the hyperlinks in the web page. It provides search result to the user, but it does not return the user expected search result.

Proposed System

The proposed Web Page Ranking With Hadoop project system rank the web pages based on the keywords strength (Number of keywords) in the web page document. MapReduce concept is used here to rank the web pages based on Mapper and Reducer. The web page with highest number of keywords in the document is returned to the user query. This process increases the efficiency of the search result and less time consuming.

The proposed Web Page Ranking With Hadoop project system focuses on creating best page ranking algorithm for Web pages using Hadoop. The proposed system architecture is shown in the figure.

Module 1: Data Preparation

Document data & Hadoop large data processing: Web page data are stored in the text format. Large numbers of text files are stored and processed using Hadoop framework.

Module 2: MapReduce

MapReduceconsists of 4 tasks, loading, parsing, transforming and filtering to rank the web pages.

Module 3: Page Ranking Algorithm

This algorithm focuses on ranking the web pages based on the keyword strength.

Module 4: Results Page

The final web page result is displayed in the user interface with the top level web page results to the user based on the query requested.

Web Page Ranking With Hadoop Benefits

Fast and accurate web page results
Less time consuming

Software Requirements

Ubuntu OS
MySQL
Hadoop&MapReduce
JDK

Hardware Requirements

Hard Disk – 1 TB or Above
RAM required – 8 GB or Above
Processor – Core i3 or Above

Technology Used

Big Data – Hadoop

Flight History Analysis Using Hadoop Project

February 4, 2018 by ProjectsGeek Leave a Comment

Flight History Analysis Using Hadoop

Objective

To analyze flight history data, which provides the reasons for flight delays, negative reviews by passengers.

Project Overview

Flight delays are a important issue in the flight industry, because it will lead to financial crisis in the business. This project identifies the factors influence the occurrence of flight delays. Research survey indicates that every year about 20% of flights are delayed or cancelled. It costs in very big way for both travelers and airlines.

The project is to analyze flight data history by gathering data from official web portal. The data that’s maintained in web portal is big in size and it is increasing everyday. So obviously big data analytics are the best way to analyze the data and extract the useful knowledge from the data set. Hadoop, MapReduce, Hadoop Distributed File System (HDFS) and HIVEare used here in this project as a big data concepts.

Proposed System

The proposed Flight History Analysis Using Hadoop system concentrates on analyzing flight data history to identify the reasons for negative feedback from users and reasons for flight delays. The proposed system architecture is shown in the figure.

Figure: Proposed System Architecture

Flight History Analysis Using Hadoop Queries

Reasons for flight delay
Reasons for negative feedback
How to improve the business model?

Module 1:Data Collection

The required data set is collected from the https://www.kaggle.com/open-flights/flight-route-database. The attributes of the data set are year, month, day, day of the week, airline name, origin airport, destination airport, scheduled departure, scheduled arrival, departure time, arrival time, departure delay, arrival delay and distance.

Module 2: Data Preparation

The collected raw data set is loaded into HDFS directory. This raw data is vulnerable to impurity data like inconsistent and noisy. So before applying machine learning techniques, first data cleaning methods are applied to the missing data and noisy data.

Module 3: Machine Learning

The prepossessed data set is divided into a training set and test set. Here, the training set is used to create models, while test set is used to test the accuracy of the machine learning algorithm. If the accuracy is acceptable, then this applies to the future data.

Machine Learning Classification identifies

Which attributes impact the flight delay?
What are the main reasons for negative feedback from passengers?
Is there any relation between variables that causes the flight delay?
What kind of offers can be provided for particular segmentation of passengers?
What kind of things need to be introduced to attract the new customers?

Module 4: Data Visualization

The extracted knowledge and patterns are visualized using Tableau – Business Intelligence tool.

Flight History Analysis Using Hadoop Benefits

This project will give the exact reason for the flight delay, which will be the important factor in the business.
Major financial losses can be avoided, with the usage of this project in real time.

Software Requirements

Linux OS
MySQL
Hadoop & MapReduce
Tableau

Hardware Requirements

Hard Disk – 500 GB or Above
RAM required – 4 GB or Above
Processor – Core i3 or Above

Technology Used

Big Data – Hadoop
Business Intelligence

Facebook Data Analysis Using Hadoop Project

February 1, 2018 by ProjectsGeek Leave a Comment

Facebook Data Analysis Using Hadoop

Objective

To analyze the Facebook data using Hadoop for the purpose of better decision making in the business.

Project Overview

Smart phone without social media usage in daily lifestyle of people is unthinkable. That much effect has been created in the lifestyle of people by smartphone and social media. There are many social media such as Facebook, Twitter, etc., As per 2017 statistics, nearly 1.37 billion daily active users for Facebook. Every user contributes some type of data to in structured or semi-structured or unstructured data format. Business owners utilize this data to understand customer need and their behavior to make profit in their business. Facebook data analysis is the process of collecting, analyzing Facebook data and visualizing extracted results to the end user.

The user data is collected from Facebook based on their activities. User behavior, number of likes, number of posts, type of posts, their comments, etc. are stored by the database server. Comments by the user in unstructured formats, while other data in structured and semi-structured format. Petabytes of data is generated by Facebook users. So Hadoop, MapReduce and related big data concepts used in this project to analyze the data.

Proposed System

The proposed system focuses on analyzing sentiments of Facebook users using Hadoop. The user sentiments collected are categorized into positive, negative, neutral.The proposed system architecture is shown in the figure.

Figure: Proposed System Architecture

Module 1: Facebook API

Facebook API is used as an authentication API to extract the user contents related to the query requested.

Module 2: Data Pre-Processing

Data Collection: The data are collected from Facebook using Hadoop through the Facebook API based on the requested query.

Data Preparation: The collected data consists of different emotions, stop words, acronyms, etc. But during analysis this type of data needs to be converted into the proper format to extract sentiments from the user behavior.

Tokenization
Various Dictionaries
- Acronym Dictionary
- Stop Words Dictionary
Emoticon

Consider one of the Facebook posts regarding new mobile features. Users opinion about the new phone might be positive or negative or neutral.

Example for Positive Sentiment

Looks are awesome.Battery backup is excellent. Camera is good.The display light quality is good.

Example for Neutral Sentiment

Although this is good mobile, looks good, but Problem is that it doesn’t provide separate Space for dual SIM & memory card together.

Example for Negative Sentiment

Not good one as expected. Camera quality very poor.

Tokenization

Comments extracted from Facebook are divided into tokens. This is known as tokenization process. For example, ‘Looks are awesome. Battery backup is excellent. Camera is good. The display light quality is good.’is divided down into ‘Looks’, ‘are’, ‘awesome’, ‘.’, ‘Battery’, ‘backup’, ‘is’, ‘excellent’, ‘.’, ‘Camera’, ‘is’, ‘good’, ‘.’, ‘The’, ‘display’, ‘light’, ‘quality’, ‘is’, ‘good’, ‘.’

Acronym Dictionary: It is used to give the required acronym for the words, if needed.

Stop Words Dictionary: It is used to remove the unrelated words in the sentiment analysis process. Example: A, An, The, Has, Are, Is.

Emoticon:This is used to detect the emoticons for the purpose of classifying the comment as positive or negative or neutral.

Module 3: Sentiment Analysis

The user sentiments collected from the Facebook are categorized into positive, negative, neutral. This sentiment analysis can be performed for different purposes based on the business objectives.

Module 4: Data Visualization

After the Facebook sentiment analysis, the extracted and analyzed sentiments are visualized using Tableau.

Software Requirements

Linux OS
Hadoop & MapReduce
Facebook API
HIVE
Tableau

Hardware Requirements

Hard Disk – 1 TB or Above
RAM required – 4 GB or Above
Processor – Core i3 or Above

Technology Used

Big Data – Hadoop

Climatic Data analysis using Hadoop Project

January 25, 2018 by ProjectsGeek Leave a Comment

Climatic Data analysis using Hadoop

Objective

To analyze the climatic data using Hadoop to extract significant knowledge for the purpose of better decision-making.

Project Overview

The climate is very important factor factors in many perspectives such as business, tourism, etc., This climatic data influence success of many operations. For example, temperature and rainfall in one area is not same for other area. So it is important to analyze the climatic data in different regions, for the purpose of better decision making and precaution measures.

This project tries an attempt to analyze the climatic data of various regions.Climate data are radically increasing in volume and complexity. So the concept of big data emerges here. Hadoop, MapReduce big data concept is used here to analyze the climatic data for the purpose of better understanding and decision-making.

Proposed System

The proposed system focuses on analyzing climate related data using Hadoop. The proposed system architecture is shown in the figure.

Figure: Proposed System Architecture

Step1: Data Preparation

Data Selection: The required data set is collected from the website https://knoema.com/atlas/India/topics/Climate-Change/datasets.

Data Loading: The collected data set loaded into Hadoop Distributed File System environment. Hadoop is the great tool to predict the climatic conditions, with processing of large and dynamic climate data.

Data Pre processing: The collected data set might consist of inconsistent data. If climatic analysis is performed on this data, it will produce wrong outcomes. Therefore, necessary pre processing techniques are applied before analyzing the data.

Step 2: Climatic Data Analysis

Climatic Data Analysis: Pre processed data set is now analyzed using machine learning and statistics. Prediction techniques like regression are applied to the climate data to predict the future climate in the city.

Step 3: Reports

Report Generation: After the climate data analysis, the necessary reports are generated and visualized. Bar charts and line charts are used along with the table format. Inferences and conclusions are derived from the analyzed data.

Statistics Questions

Find the mean and median of the maximum temperature on an hourly basis
Find the mean and median of the minimum temperatureon an hourly basis
Identify the rate of change of daily average temperature
Identify the rate of change of weekly average temperature
Identify the rate of change of monthly average temperature
Data points in the various cities

Machine Learning

Clustering: Identify the regions that are close to each other based on the climatic conditions.
Classification: The climate data set is classified, based on whether it will be a sunny day or rainy day day depending on the temperature.
Prediction: Predict the next day climate in the city with the possible rate of change compared to existing climate.

Software Requirements

Linux OS
MySQL
Hadoop & MapReduce

Hardware Requirements

Hard Disk – 1 TB or Above
RAM required – 4 GB or Above
Processor – Core i3 or Above

Technology Used

Big Data – Hadoop
Machine Learning
Statistics

Hadoop Projects

Wiki Page Ranking With Hadoop

Objective

Project Overview

Proposed System

Parsing

Calculating

Ordering

Wiki Page Ranking With Hadoop Benefits

Software Requirements

Hardware Requirements

Technology Used

Other Projects to Try:

Twitter Data Sentimental Analysis Using Hadoop

Objective

Project Overview

Proposed System

Software Requirements

Hardware Requirements

Technology Used

Other Projects to Try:

Web Page Ranking With Hadoop

Objective

Project Overview

Existing System

Proposed System

Web Page Ranking With Hadoop Benefits

Software Requirements

Hardware Requirements

Technology Used

Other Projects to Try:

Flight History Analysis Using Hadoop

Objective

Project Overview

Proposed System

Flight History Analysis Using Hadoop Queries

Software Requirements

Hardware Requirements

Technology Used

Other Projects to Try:

Facebook Data Analysis Using Hadoop

Objective

Project Overview

Proposed System

Tokenization

Software Requirements

Hardware Requirements

Technology Used

Other Projects to Try:

Climatic Data analysis using Hadoop

Objective

Project Overview

Proposed System

Statistics Questions

Machine Learning

Software Requirements

Hardware Requirements

Technology Used

Other Projects to Try:

Footer