Hadoop Tutorial: MapReduce Performance Tuning

Administration Lab 4: MapReduce Performance Tuning

Restore the Last State of VM

Open Virtual Box application

Start the last VM state

Bounce Hadoopcluster

for x in /etc/init.d/hadoop-* ; do sudo $x stop; done

for x in /etc/init.d/hadoop-* ; do sudo $x start; done

2 mappers, 2 reducers

1 CPU 2 Cores

In Windows: All Programs => Accessories => System Tools => System Information

Execute Waiting Job

hadoop-0.20 jar
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep -m 4 –r 4 -mt 10000 -rt 10000

-m number of mappers

-r number of reducers

-mt milliseconds to sleep at map step

-rt milliseconds to sleep at reduce step

Map Scheduling

Only two mappers will initialized in the same time

Go to port 50030 and click on running job and pick map step

Reduce Scheduling

Only two reducers will initialized in the same time

Go to port 50030 and click on running job and pick reduce step

Overall Job Summary

Take a note of your job running time

How can we improve that???

Reduce number of mappers and reducers

hadoop-0.20 jar
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep –m 2 –r 2 –mt 20000 –rt 20000

Increase number of mappers/reducers

Go to /etc/hadoop-0.20/conf (please, use tab for auto completion )

Open mapred-site.xml with sudo permissions

Increase number of reducers and mappers to 4

Config File Example

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>

</property>

<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>

</property>

FAQ: Where are Defaults ?

Defaults are located inside hadoop-core.jar

Locate hadoop-core.jar

Default: /usr/lib/hadoop-0.20/hadoop-core.jar

Copy jar to home directory: cp hadoop-core.jar ~/

Check content: jar tfhadoop-core.jar | grep default

Extract content: jar xfv hadoop-core.jar

Default files are a good source of information

<property>
<name>hadoop.job.history.location</name>
<value></value>
<description> If job tracker is static the history files are storedin this single well known place. If No value is set here, by default, it is in the local file system at ${hadoop.log.dir}/history.
</description>
</property>

<property>
<name>hadoop.job.history.user.location</name>
<value></value>
<description> User can specify a location to store the history files ofa particular job. If nothing is specified, the logs are stored inoutput directory. The files are stored in "_logs/history/" in the directory.User can stop logging by giving the value "none".
</description>
</property>

Bounce cluster and wait for safemode exit

for x in /etc/init.d/hadoop-0.20-*; do sudo $x stop; done;

Shortcut: history | grep stop
!<number of the command>

The same command with start

hadoop-0.20 dfsadmin –safemode wait

Note: You might see an error while stopping cluster. This is related to a current open bug that should be fixed in next release of hadoop. It is safe to ignore it and can proceed.

Config File Example

<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>

</property>

<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>

</property>

FAQ: Where are Defaults

Defaults are located inside hadoop-core.jar

Locate hadoop-core.jar

Default: /usr/lib/hadoop-0.20/hadoop-core.jar

Copy jar to home directory:

cp hadoop-core.jar ~/

Check content:

jar tfhadoop-core.jar | grep default

Extract content:

jar xfv hadoop-core.jar

Default files are a good source of information

Bounce cluster and wait for safemode exit

for x in /etc/init.d/hadoop-0.20-*; do sudo $x stop; done;

Shortcut: history | grep stop
!<number of the command>

The same command with start

hadoop-0.20 dfsadmin –safemode wait

Note: You might see an error while stopping cluster. This is related to a current open bug that should be fixed in next release of hadoop. It is safe to ignore it and can proceed.

New Capacity

You should see 4 mappers and 4 reducers

Let’s execute the same job

hadoop-0.20 jar
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep -m 4 -r4 -mt 10000 -rt 10000

All four tasks are initializing in same time

Mappers

Reducers

Total Summary

All Summary

	Request Map	Request Reduce	Avail Map	Avail Reduce	Average Map	Average Reduce	Total Time
Case 1	4	4	2	2	33 sec	33 sec	100 sec
Case 2	2	2	2	2	26 sec	34 sec	69 sec
Case 3	4	4	4	4	123 sec	47 sec	172 sec
Case 4	1	1	4	4	42 sec	47 sec	94 sec

FAQ: How to kill a job?

Retrive job id:

hadoop-0.20 job -list

Kill the job:

hadoop-0.20 job –kill <job-id>

Note: this is a hard kill, some additional clean up might be required

Hadoop Tutorial

Wednesday, December 21, 2011

MapReduce Performance Tuning

No comments:

Post a Comment

About Me

Blog Archive