Administration Lab 4: MapReduce Performance Tuning
Restore the Last State of VM
- Open Virtual Box application
- Start the last VM state
- Bounce Hadoopcluster
for x in /etc/init.d/hadoop-* ; do sudo $x stop; done
for x in /etc/init.d/hadoop-* ; do sudo $x start; done
- 2 mappers, 2 reducers
- 1 CPU 2 Cores
In Windows: All Programs => Accessories => System Tools => System Information
Execute Waiting Job
hadoop-0.20 jar
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep -m 4 –r 4 -mt 10000 -rt 10000
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep -m 4 –r 4 -mt 10000 -rt 10000
- -m number of mappers
- -r number of reducers
- -mt milliseconds to sleep at map step
- -rt milliseconds to sleep at reduce step
Map Scheduling
- Only two mappers will initialized in the same time
- Go to port 50030 and click on running job and pick map step
Reduce Scheduling
- Only two reducers will initialized in the same time
- Go to port 50030 and click on running job and pick reduce step
Overall Job Summary
- Take a note of your job running time
How can we improve that???
Reduce number of mappers and reducers
hadoop-0.20 jar
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep –m 2 –r 2 –mt 20000 –rt 20000
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep –m 2 –r 2 –mt 20000 –rt 20000
Increase number of mappers/reducers
- Go to /etc/hadoop-0.20/conf (please, use tab for auto completion )
- Open mapred-site.xml with sudo permissions
- Increase number of reducers and mappers to 4
Config File Example
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
<value>4</value>
</property>
FAQ: Where are Defaults ?
- Defaults are located inside hadoop-core.jar
- Locate hadoop-core.jar
- Default: /usr/lib/hadoop-0.20/hadoop-core.jar
- Copy jar to home directory: cp hadoop-core.jar ~/
- Check content: jar tfhadoop-core.jar | grep default
- Extract content: jar xfv hadoop-core.jar
Default files are a good source of information
<property>
<name>hadoop.job.history.location</name>
<value></value>
<description> If job tracker is static the history files are storedin this single well known place. If No value is set here, by default, it is in the local file system at ${hadoop.log.dir}/history.
</description>
</property>
<property>
<name>hadoop.job.history.user.location</name>
<value></value>
<description> User can specify a location to store the history files ofa particular job. If nothing is specified, the logs are stored inoutput directory. The files are stored in "_logs/history/" in the directory.User can stop logging by giving the value "none".
</description>
</property>
<name>hadoop.job.history.location</name>
<value></value>
<description> If job tracker is static the history files are storedin this single well known place. If No value is set here, by default, it is in the local file system at ${hadoop.log.dir}/history.
</description>
</property>
<property>
<name>hadoop.job.history.user.location</name>
<value></value>
<description> User can specify a location to store the history files ofa particular job. If nothing is specified, the logs are stored inoutput directory. The files are stored in "_logs/history/" in the directory.User can stop logging by giving the value "none".
</description>
</property>
Bounce cluster and wait for safemode exit
for x in /etc/init.d/hadoop-0.20-*; do sudo $x stop; done;
- Shortcut: history | grep stop
- !<number of the command>
- The same command with start
hadoop-0.20 dfsadmin –safemode wait
Note:
You might see an error while stopping cluster. This is related to a
current open bug that should be fixed in next release of hadoop. It is
safe to ignore it and can proceed.
Config File Example
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
<value>4</value>
</property>
FAQ: Where are Defaults
- Defaults are located inside hadoop-core.jar
- Locate hadoop-core.jar
- Default: /usr/lib/hadoop-0.20/hadoop-core.jar
- Copy jar to home directory:
cp hadoop-core.jar ~/
- Check content:
jar tfhadoop-core.jar | grep default
- Extract content:
jar xfv hadoop-core.jar
Default files are a good source of information
<property>
<name>hadoop.job.history.location</name>
<value></value>
<description> If job tracker is static the history files are storedin this single well known place. If No value is set here, by default, it is in the local file system at ${hadoop.log.dir}/history.
</description>
</property>
<property>
<name>hadoop.job.history.user.location</name>
<value></value>
<description> User can specify a location to store the history files ofa particular job. If nothing is specified, the logs are stored inoutput directory. The files are stored in "_logs/history/" in the directory.User can stop logging by giving the value "none".
</description>
</property>
<name>hadoop.job.history.location</name>
<value></value>
<description> If job tracker is static the history files are storedin this single well known place. If No value is set here, by default, it is in the local file system at ${hadoop.log.dir}/history.
</description>
</property>
<property>
<name>hadoop.job.history.user.location</name>
<value></value>
<description> User can specify a location to store the history files ofa particular job. If nothing is specified, the logs are stored inoutput directory. The files are stored in "_logs/history/" in the directory.User can stop logging by giving the value "none".
</description>
</property>
Bounce cluster and wait for safemode exit
for x in /etc/init.d/hadoop-0.20-*; do sudo $x stop; done;
- Shortcut: history | grep stop
- !<number of the command>
- The same command with start
hadoop-0.20 dfsadmin –safemode wait
Note:
You might see an error while stopping cluster. This is related to a
current open bug that should be fixed in next release of hadoop. It is
safe to ignore it and can proceed.
New Capacity
- You should see 4 mappers and 4 reducers
Let’s execute the same job
hadoop-0.20 jar
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep -m 4 -r4 -mt 10000 -rt 10000
/usr/lib/hadoop-0.20/hadoop-0.20.2-cdh3u0-*examples.jar sleep -m 4 -r4 -mt 10000 -rt 10000
All four tasks are initializing in same time
Mappers
Reducers
Total Summary
All Summary
Request
Map
|
Request
Reduce
|
Avail
Map
|
Avail
Reduce
|
Average
Map
|
Average
Reduce
|
Total
Time
| |
Case 1
|
4
|
4
|
2
|
2
|
33 sec
|
33 sec
|
100 sec
|
Case 2
|
2
|
2
|
2
|
2
|
26 sec
|
34 sec
|
69 sec
|
Case 3
|
4
|
4
|
4
|
4
|
123 sec
|
47 sec
|
172 sec
|
Case 4
|
1
|
1
|
4
|
4
|
42 sec
|
47 sec
|
94 sec
|
FAQ: How to kill a job?
- Retrive job id:
hadoop-0.20 job -list
- Kill the job:
hadoop-0.20 job –kill <job-id>
- Note: this is a hard kill, some additional clean up might be required
No comments:
Post a Comment