Analyze the Performance by running Memory Analyzer Tool

1. We took heap and thread dumps and ran reports during peak times when cpu and memory usage was very high.

Some Interesting observations on the main suspects of memory leaks or candidates for optimization.

a) Jetty server QueuedThreadPool holding large number of threds (20 ).

b) org.apache.hadoop.ipc.Client$Connection holding huge number of strong references

c) huge instance of  org.apache.hadoop.hdfs.util.DirectBufferPool (grew very big ..)

d) 5o instances of “org.apache.hadoop.conf.Configuration”

e )  One very large instance of “org.apache.hadoop.hdfs.server.blockmanagement.BlockManager”

f) 500 k instances of”org.apache.hadoop.hdfs.server.namenode.INodeFile”, 

Improving Performance :

Code enhacement Tips :

  • Avoid NumberFormats, String.split wherever possible.
  • Use Combiners wherever applicable,
  • Use VIntWritable or VLongWritable  wherever applicable,
  • Use IntWritable in place of Text (wherever make sense),  
  • Use sequencefile for intermediate stages,
  • Wrap data structures with weak references / soft references where we want to reclaim objects at the end of processing … or in next GC cycle.
  • Enforce GC at the end of a seemingly time-consuming / resource-consuming Job
  • While fetching streaming data / webservice data (e.g. crawlers) , can we use AsyncHttpClient, increase Content-Length, Connection-time out and receive data as compressed and then uncompress it.
    (many more  … )

Configuration Tips :

  • Enable heapdump on OOME so that we can directly find the root cause and leak suspects using MAT.

Add following setting in mainHadoopFlow.sh

-XX:HeapDumpPath=/tmp/myhadoopjob.hprof -XX:-HeapDumpOnOutOfMemoryError

  • Does current server program (running hadoopJob  – if say managed by a jetty ) support Non-blocking IO and server is fully multi-threaded ?
  • Ideally file size should not be smaller than block size. Use archive for storing multiple small files instead of creating directly into HDFS.
  • Construct a set of zip files once per reduce task, instead of large numbers of small files in the reduce task.
  • Refer to – http://www.cloudera.com/blog/2009/02/the-small-files-problem/ , http://stackoverflow.com/questions/11368907/hadoop-block-size-and-file-size-issue 
  • NameNode Handler Count should be at least (number of datanodes) * 20
  • If the NameNode and JobTracker are on big hardware, set dfs.namenode.handler.count to 64 and same withmapred.job.tracker.handler.count. If you’ve got more than 64 GB of RAM in this machine, you can double it again.
  • If there is more RAM available than is consumed by task instances, set io.sort.factor to 25 or 32 (up from 10).io.sort.mb should be 10 * io.sort.factor. Don’t forget, multiply io.sort.mb by the number of concurrent tasks to determine how much RAM you’re actually allocating here, to prevent swapping. (So 10 task instances withio.sort.mb = 320 means you’re actually allocating 3.2 GB of RAM for sorting, up from 1.0 GB.) An open ticket on the Hadoop bug tracking database suggests making the default value here 100.
  • io.file.buffer.size – this is one of the more “magic” parameters. You can set this to 65536 and leave it there. Currently it’s too low
  • mapred.child.ulimit should be 2–3x higher than the heap size specified in mapred.child.java.opts and left there to prevent runaway child task memory consumption.
  • Setting tasktracker.http.threads higher than 40 will deprive individual tasks of RAM, and won’t see a positive impact on shuffle performance until your cluster is approaching 100 nodes or more.
  • Currently its 80 so we see 80 live threads … we can reduce it to 40
  • Almost every Hadoop job that generates an non-negligible amount of map output will benefit from intermediate data compression with LZO. Although LZO adds a little bit of CPU overhead, the reduced amount of disk IO during the shuffle will usually save time overall.
  • Whenever a job needs to output a significant amount of data, LZO compression can also increase performance on the output side. Since writes are replicated 3x by default, each GB of output data you save will save 3GB of disk writes.
  • In order to enable LZO compression, check out recent guest blog from Twitter. Be sure to setmapred.compress.map.output to true.
  • Look for latency spikes and/or the threadsBlocked counts in it. If it is rising, we’re  probably just hitting handler limits on the NameNode and need to have that bumped up.

Ø  File descriptor limits :  A busy Hadoop daemon might need to open a lot of files. The open fd ulimit in Linux defaults to 1024, which might be too low. You can set to something more generous — maybe 16384.

Ø  To run jobs in parallel from the start, you can either configure a Fair Scheduler or a Capacity Scheduler based on your requirements.

The mapreduce.jobtracker.taskscheduler and the specific scheduler parameters have to be set for this to take effect in the mapred-site.xml.

The Fair Scheduler allocate the tasks evenly for all the submitted jobs so that short jobs completes soon eventhough they submitted after long running jobs. So jobs can run in parallel in FIFO mode also if the Cluster has more capacity. In case of Fair scheduling jobs always run in parallel.

The best way to manage parallelism is to configure Hadoop to use multiple task slots per machine.

Ø  Generate a  cloudera-recommended client configuration — https://ccp.cloudera.com/display/FREE373/Generating+Client+Configuration

Ø  Sometimes it helps to take multiple thread dumps per this tip.

Use “Poor Man’s Profiling” to see what your tasks are doing ( http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ )

Ø  More suggestions from http://www.slideshare.net/cloudera/mr-perf

Reference :  http://blog.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/

Namenode internal flow :  http://blogs.wandisco.com/2012/12/03/flow-of-a-client-rpc-request-through-the-hadoop-namenode/

http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/

Advertisements