Query / Index Optimization Tricks :

  • consider TTL for short-lived document curl -XPUT ‘localhost:9200/crunchbase/person/1?ttl=1d’
  • if a document from different index need to be referred inside another index, then refer it by MD5 hash
  • for strong consistency , check if all shards allocated after creating an index (using health stats api call)
  • check if a document exists using HEADS – curl -XHEAD -i ‘localhost:9200/crunchbase/person/1’
  • after heavy indexing of initial data load , consider explicit merge call : curl -XPOST ‘localhost:9200/crunchbase/_optimize?max_num_segments=2’
  • remember how text analysis works in ES : process query prior search (jumped to jump) , drops stop words , lowercases all tokens, reduce to stems, considers synonyms.
  • use analyzers to adapt search behavior to user behavior (domain terminology)
  • set dynamic to strict to prevent unmapped data Significant Term Aggregation takes lots of memory
    • take out data and run through N-Gram Bloom Filter (probabilistic model to find if we have seen an item with high confidence)
    • then perform significant term matching
  • use direct COUNT on a well-defined QUERY to avoid internal priority queue for sorting documents
  • curl -XGET ‘localhost:9200/_search?search_type=scan&scroll=10m&size=50’ -d ‘{ “query” : { “match_all” : {} } }’
    • ** scan is highly optimized by lucene
    • ** when finished scanning we should explicitly close the scroll
  • note ES offers powerful query boosting features
    {“multi_match” : { “fields” : [ “title^5″, “body” ], “query” : “world war” } }
  • remember : core filters (term, terms, range) implementations already use a BitSet as their internal representation, caching them is cheap. [ indices.cache.filter.size ]
    • ** one can use Warmer API to warm up the cache (register Queries with Warmer index)
    • {“filtered” : {“query” : {“match” : { “text” : “iphone android”}}, “filter” : {{“term” : { “age” : 25 }}}}}
  • suggestions is a cool feature
    • curl -XGET ‘localhost:9200/stackoverflow/_suggest’ -d ‘{“suggest” : {“title_suggestions” : {“text” : “age of war”, “term” : {“field” : “title”,”suggest_mode” : “always”, “size” : 2}}}}’
  • apply function_score to calculate the boost using custom logic
  • great search optimization : search on limited shards using routing (otherwise all shards will be hit)
    • curl -XPOST localhost:9200/sales/_search?routing=honda,toyota -d ‘{ “query” : { .. } }
    • index only searchable fields {.. “index”: “no” … } and rest just store , don’t index
    • analyze only required fields
    • save time by specifying which fields don’t need to be considered while calculating score (“norms” : {“enabled” : false} )
    • “doc_value” : true -> will store fields in disk to avoid potential OOM (at the cost of 10% + latency for agg queries on field data, which is Ok in most cases)
    • for range filters, round up the result to highest granularity ( gte : now – 1d / d )
    • leverages caches > numeric data should be integer or short unless long values
    • Use BOOL filters instead of (And/Or) and move heavy filters towards the end
    • Keep cold segments warm before indexing

Cluster Stability Settings :

  • dedicated search nodes (configure a set of nodes as simple client nodes) ** this will reduce the load on the search nodes
  • for write-heavy operation, increase the threadpool size in elasticsearch.yml
  • assign half of RAM to ES process
  • use CMS COllector and Young Gen = min(500Mnum_cores, 1/4heap_size)
  • ES_HEAP_SIZE = min(0.5max_mem,30GB) , FILE_SYSTEM_CACHE=0.5max_mem
  • bootstrap.mlockall = true (so that process doesn’t swap) , ulimit -l unlimited
  • file descriptor limit should be set as per limitations constrained in /etc/security/limits.conf ** monitor current open file descriptors : curl ‘localhost:9200/_nodes/stats/process,jvm’
  • for 503 status code (rejection of requests) increase threadpool buffer size
    • ** curl ‘localhost:9200/_nodes/thread_pool’ , curl ‘localhost:9200/_nodes/stats/thread_pool’
    • ** curl ‘localhost:9200/test_index/_stats/search,indexing
  • detect excessive sockets creation ** if total_opened connections keep increasings then client not using persistent connections
  • for write-heavy loads disable refresh.interval and reduce replication factor , increase #shards
  • queue to regulate high indexing load
  • set high file descriptors but monitor open file descriptors
  • monitor events using watches
  • zone aware replication survives zone outages (force during shard allocation)
  • for N nodes cluster set discovery.zen.minimum_master_nodes = N-1
  • for aws deployments automate the process of node discovery , tracking , scheduled backup, restore , upgrade using Raigad
  • use sensu remediation or monet service to restart process when certain issues take place
  • control index management operations using action.destructive_requires_name setting
  • take backup of data : curl -XPUT ‘localhost:9200/snapshot/daily_backup/backup1?wait_for_completion=true’ -d ‘{ “indices” : “+sample*,-sample_2″ }’
  • number of shards is fixed per index = num_of_shards * max_shard_size
  • rolling upgrade tips :
    • ** use ES 1.7+ so that we don’t need to disable the shards before upgrades.
    • take backup before upgrade
    • disable backup during the upgrade
    • dedicated master node (should not perform Indexing)
    • if we have N active indexes, then N shards per Index (shard size should be < 50 GB) and 1 replica (more replicas slower indexing)
    • ES may be down anytime due to high cpu usage, high memory usage , so be mindful of all types of optimizations
      • discovery.zen.minimum_master_nodes = N/2 + 1 where N is no of nodes
      • gateway : recover_after_time : 5m
      • expected_data_nodes: 2
        • thus don’t let elastic search recover all the data immediately , rather tell it to wait for 5 mins after first node is up, but once there are 2 nodes in the cluster, recovery will start immediately .
        • for fast bootstrapping , throw no of machines (shards) to index 30 days worth of data in 1 hour (otherwise indexing may take days)

Reference :  Microsoft’s experience with ElasticSearch

Advertisements