Hadoop Tutorial

Tuesday, December 27, 2011

Multi-Node Cluster Setup

Administration Lab 6: Multi-node Cluster Setup

Create Additional VM Image

Repeat steps outlined in Lab 1 to create a new Virtual Machine and install JDK and Hadoop

Configure Bridged Adapter

Power off both Virtual Machines (Machine => ACPI Shutdown)

Highlight each VM and click on Settings => Network

In the “Attached To” field, select “Bridged Adapter

In the “Name” field, select the correct network card adapter (wired/wireless) that you use on your primary Operating System to connect to the Internet

Uncheck all devices except “Hard Disk”

Find out Master & Slave IPs

Run ifconfig on master and slave

Write down IPs for both hosts

Execute Installation from Lab1

Except the following steps:

sudo apt-get install hadoop-0.20-namenode
sudo apt-get install hadoop-0.20-jobtracker

They are not needed on slave node

Configuration For Both Nodes

/etc/hadoop-0.20/conf/mapred-site.xml

Provisioning IPs

/etc/hosts

Add entires for master and slave node to ensure proper network communication

Reformat namenode and delete data from data directory

sudosuhdfs

Execute hadoop-0.20 namenode format

Delete data from data directories on both machine

/var/lib/hadoop-0.20/cache/hdfs/dfs/data

Start Distributed Cluster

Start master node normally

Start datanode and tasktracker on slave node only

sudo /etc/init.d/hadoop-0.20-tasktracker start

sudo /etc/init.d/hadoop-0.20-datanode start

Verification

Check Web interface to verify that second node is connected

Run sleep example to make sure that mapreduce is working properly

Format Conversion

Programming Lab 3: Format Conversion

This lab demonstrates how to use different map-reduce formats for input and output
Also shows how to convert from one format to another

Problem to Solve

We will work with one set of data OrgChart and will represent that data in a different format:

Create Project and Link with Libraries

Copy provided libraries and java code from USB drives
Create project in NetBeans or Eclipse
Link with libraries
Create new class and copy provided code
Modify input and output directory
Run code and examine result
For detailed instructions on project creating refer to Lab1.

Task1. Parse existing Key Value Text Input

We need to parse existing Key Value Text input and output it in the Key Value Text.
Data representation is not going to change, but data will be sorted by its key
Set proper input format

Set proper output format

Don’t forget to change input path to the correct path on your workstation
Expected output

Task2. Generate Sequence File From Key Value Text Input

We need to parse existing Key Value Text input and output it in the Sequence File
Data representation is not going to change, but data item will be sorted by its key and it will be in a compressed form
Set proper input format

Set proper output format

Enable compression

Expected output will not be readable since it is compressed

Task3. Generate Map File From Key Value Text Input

We need to parse existing Key Value Text input and output it in the Map File
Data representation is not going to change, but data will be sorted by its key and it will be in compressed form. Also, Map file will create index for faster access of the data.
Set proper input format

Set proper output format

Compression remains enabled

Expected output will not be readable since it is compressed and map file output will generate index file

Task4. Generate Key Value Text Format From Map File

We need to parse existing Map File and output it in the Key Value Text input
Data representation is not going to change, but data will be sorted by its key and has to match the input data
Keep in mind that if you keep compression enabled the data will be compressed
Set proper input format

Set proper output format

We can choose to keep compression enabled

Expected output will not be readable since it is compressed and map file output will generate index file

Summary

This lab has demonstrated how to work with different data format
You should be comfortable in using any of them

Map Side Join

Programming Lab 4: Map Side Join

This lab demonstrates Map-Side join technique used to quickly and efficiently join data in multiple directories
In a lot of cases this is the fastest way to process data.

Problem to Solve

We will work with two sets of data OrgChart and Salary and will will need to join them together using Map Reduce join technique

Create Project and Link with Libraries

Copy provided libraries and java code from USB drives
Create project in NetBeans or Eclipse
Link with libraries
Create new class and copy provided code
Modify input and output directory to SalaryMap and OrgChartMap folders
For detailed instructions on project creating refer to Lab1.

Task1. Create Map Files Using Previous Lab

Using lab 3 code point input directory to lab4/input/Salary directory and point output directory to lab4/input/SalaryMap directory
Run the code and it will generate map file (sequence file with index)
Repeat the same procedure with OrgChart

Task2. Point MapSide Directories to Newly Generated Map Files

Change directory to point to newly generated map files
Run code and see the outcome

Please, note that unlike regular MR job, map-side joint quarantines order of the data.

Walk Through

Map-Side join takes array of the path to join

Set specific Map Side join input format

Walk Through Mapper

Mapper provides access to values via TupleWritable value

value.has(0) checks if you have left value

value.get(0) gives you the left value

value.has(1) checks if you have right value

value.get(1) gives you the right value

Task3. Adding Vacation Time

Generate Map File for vacation time
Extend Map-Reduce job to add another folder in join
Modify Mapper code to add vacation time into output value
Hint: to add two text object:

Monday, December 26, 2011

Reduce Side Join

Programming Lab 5: Reduce Side Join

This lab demonstrates reduce-side join with different reducers

Problem to Solve

We will work with two sets of data OrgChart and Salary, and we will need to join them together using Reduce Side join technique
However, in some cases position description can consist of two different words and we only will compare the first one

Create Project and Link with Libraries

Copy provided libraries and java code from USB drives
Create project in NetBeans or Eclipse
Link with libraries
Create new classes for each file with the same name as file in the directory or you can copy java files to your source directory
For detailed instructions on creating a project please refer to Programming Lab1

Walk Through

Please, make sure to adjust input path to your local machine

We must assign different mappers to different input folders

Engineer salary should be applicable to Engineer and Engineer 2
Position will be represented by Position class and we will compare and match only the first word
The position class will implement a comparator method

Expected Result

The expected result is the following:

Input Format

Programming Lab 2: Input Format

We will learn how to join different sets of data on the same key using regular Map-Reduce approach.
Also, this lab will show limitations of the regular Map-Reduce approach and what developer need to do in order to overcome those limitations

Problem to Solve

We have two sets of data : position information and salary information
We need to join them together!!!

Produce:

Create Project and Link with libraries

Copy provided libraries and java code from USB drives
Create project in NetBeans or Eclipse
Link with libraries
Create new class and copy provided code
Modify input and output directory
Run code and examine result
For detailed instructions on project creating refer to Lab1.
Our key is position in the company (CEO, Engineer, etc.)
Our data is separated by “=”. We will use KeyValueTextInputFormat.
We also need to set special separator since its different from default tab

Expected output

Please, note that order is not quarantine. We need to do special processing to guarantee order.

Task

Additional input folder contains data for vacation time
Additional input folder can be specified by calling FileInputFormat.addInputPath
We need to add that data for every record

Word Count

Programming Lab 1: Word Count

We will learn how to setup development environment for Hadoop projects
Run “Word Count ” applications
Create new application to count letters in text documents

Prerequisites

Java 1.6
Hadoop and log4j libraries
NetBeans or Eclipse

Create Project and Link with Libraries

Copy provided libraries and java code from USB drives
Create project in NetBeans or Eclipse (specific instructions on the next page)
Link with libraries
Create new class and copy provided code
Modify input and output directory
Run code and examine result

Create Project with NetBeans

Click on File -> New Project
Select Java Application type
Set main class to WordCount
Name your project HadoopLab1WordCount
Follow instructions on the screen
See the next page for a screenshot

Open a file from the provided material in ProgLabs/lab1/original
Copy everything except the package name
Insert code into newly created class right after the package name

Let’s link with appropriate libraries

Right click on project name and select properties
Pick Libraries option and click on the Compile tab
Click on Add JAR/Folder button
Add everything from ProgLabs/lib folder
NetBeans will re-evaluate dependencies and you should not see any errors at this point
Adjust input/output values in run method to ProgLabs/lab1/<<input|output>> accordingly

Link the project with appropriate libraries

Right click on project name and select properties
Pick Libraries option and click on Compile tab
Click on Add JAR/Folder button
Add everything from ProgLab/lib folder
NetBeans will reevaluate dependencies and you should not see any errors at this point

Adjust the input/output values in run method to ProgLabs/lab1/<<input|output>> accordingly
Right-click and run!

Create Project with Eclipse

Create New Project: Click on File -> New Project
Name project HadoopLab1WordCount
Select Java 1.6
Click Next and add libraries from ProgLabs/lib

Create a new class – WordCount
Right-click on the project -> select New -> Class
Copy code from Labs/ProgLabs/lab4/original into new class
Right click and run it!

Lets examine output directory

You will see two files: one indicating result of the Map Reduce job and the second one containing result of the job

Result

Adam 1

Brandon 1

Graig 1

Kim 1

Marty 1

Mike 1

Nancy 1

Nick 1

Nishani 1

Steve 2

Tracy 2

Vidur 1

Exercise: Let’s Count Letters!

Modify word count application to count letters in the document
Create another class that implements Reducer and switch application to use it in run method
Hint:

String ch = String.valueOf(line.charAt(i));

Wednesday, December 21, 2011

MapReduce Filesystem Recovery

Administration Lab 5: MapReduce Filesystem Recovery

Restore the Last State of VM

Open Virtual Box application

Start the last VM state

Stop Hadoop cluster

for x in /etc/init.d/hadoop-* ; do sudo $x stop; done

Procedure to Recover Namenode

Change user to hdfs

sudo su hdfs

Lets try to import image

hadoop-0.20 namenode –importCheckpoint

This will fail!!! (Why???)

Lets move data to home directory

mkdir ~/name

mv /var/lib/hadoop-0.20/cache/hadoop/dfs/name/* ~/name

ls /var/lib/hadoop-0.20/cache/hadoop/dfs/name

Lets try import again

hadoop-0.20 namenode –importCheckpoint

This will work!

Ctrl-C to terminate node and exit from shell

Start the cluster

for x in /etc/init.d/hadoop-* ; do sudo $x start; done

Checking Filesystem

hadoop-0.20 fsck <file name> -files -blocks -racks

Hadoop Tutorial

Tuesday, December 27, 2011

Multi-Node Cluster Setup

Format Conversion

Map Side Join

Monday, December 26, 2011

Reduce Side Join

Input Format

Word Count

Wednesday, December 21, 2011

MapReduce Filesystem Recovery

About Me

Blog Archive