How to improve HBase read /write performance ?    

[for queries like – select month(eventdate),eventname,count(1),sum(timespent) from eventlog group by month(eventdate),eventname]

The goal is to fetch 50 million records in 10 secs.

(1) Regions are evenly distributed across a four node clusters (minimum)

(2) Unique combinations of results should be small enough to fit into memory

(3) Implement co-processor for performing map-reduce query on server-side.  This blog nicely explains with code how does HtableInterface#coprocessorExec work –  https://blogs.apache.org/hbase/entry/coprocessor_introduction

Perform the necessary computing job on the server side and then sending the necessary computed result to client ! This saves lots of network latency.

Also it can be leveraged to create a ‘Database Trigger’ like feature, where aggregation result is computed and reflected in real-time as soon as data are updated !

(4) Otherwise, if you perofrm map-reduce query on client side, then  chunk the data on client-side and run the chunks in parallel [ nested for loops ‘  for(..results..) { for(..key..) } – can be replaced by : TableMapReduceUtil ]  and finally merge the results on the client.  http://hbase.apache.org/book/mapreduce.example.html

(5) Set the BatchSize and Catching on the Scan objects (http://hbase.apache.org/book.html#perf.reading)

11.9.1. Scan Caching

If HBase is used as an input source for a MapReduce job, for example, make sure that the input Scan instance to the MapReduce job has setCaching set to something greater than the default (which is 1). Using the default value means that the map-task will make call back to the region-server for every record processed. Setting this value to 500, for example, will transfer 500 rows at a time to the client to be processed. There is a cost/benefit to have the cache value be large because it costs more in memory for both client and RegionServer, so bigger isn’t always better.”

(6)  Do not disable Blockcache

“2.5.3.2. Disabling Blockcache

Do not turn off block cache (You’d do it by setting hbase.block.cache.size to zero). Currently we do not do well if you do this because the regionserver will spend all its time loading hfile indices over and over again. If your working set it such that block cache does you no good, at least size the block cache such that hfile indices will stay up in the cache (you can get a rough idea on the size you need by surveying regionserver UIs; you’ll see index block size accounted near the top of the webpage). “

http://hbase.apache.org/book/important_configurations.html

(7) Use filter based query.

Sample code for creating a Facade (HbaseFacade)

HBaseFacade# Map<String, Map<String, Map<String, byte[]>>> readRows(String tableName, String keyPrefix, long ts, String columnFamily, String qualifier) throws HBaseException {…}

http://grepcode.com/file/repo1.maven.org/maven2/org.infinispan/infinispan-cachestore-hbase/5.2.0.ALPHA1/org/infinispan/loaders/hbase/HBaseFacade.java#HBaseFacade

(8) Experiment on integrating Scala / Akka with Hbase for faster parallel query and data persistence with low GC overhead

For example, – https://github.com/GravityLabs/HPaste (Scala API for reading / inserting data Hbase )

(9) Consider using Async Hbase Client (fully asynchronous, non-blocking, thread-safe, high-performance API – https://github.com/stumbleupon/asynchbase/blob/master/src/HBaseClient.java )

(10) Building flexible Analytics Query Model on top of HBase (https://github.com/dlyubimov/HBase-Lattice)

Ref : http://hbase.apache.org/book/important_configurations.html # (Changing batch and cache size of scan correctly) http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/29098  , http://stackoverflow.com/questions/8932885/hbase-multithreaded-scan-is-really-slow , http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setBatch%28int%29 , http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setCaching%28int%29

Advertisements