Use Direct kafka approach. It keeps 1-1 mapping of each kafka partition to RDD partition in streaming processing.

repeated updates for same set of records in database for few offsets in case of application restart don’t have any side-effects.

http://spark.apache.org/docs/latest/streaming-kafka-integration.html)

tune batch interval
val ssc = new StreamingContext(sc, Seconds(10))
tune concurrency 
sparkConf.set("spark.streaming.concurrentJobs","4")
tune max rate per partition
sparkConf.set("spark.streaming.kafka.maxRatePerPartition", “25”) 
Mesos config
sparkConf.set("spark.cores.max", “20")
sparkConf.set("spark.mesos.coarse", “true")
//sparkConf.set("spark.executor.instances", "4")
//sparkConf.set("spark.executor.cores", “10")
Set Memory
sparkConf.set("spark.driver.memory", "8g")
sparkConf.set("spark.executor.memory", “15g")
sparkConf.set(“spark.streaming.unpersist","true")

handle backpressure
sparkConf.set("spark.streaming.backpressure.enabled","true")

Additional details:

sparkConf.set(“spark.serializer”,”org.apache.spark.serializer.KryoSerializer”)
sparkConf.set(“spark.driver.extraJavaOptions”, “-XX:+UseG1GC ”)
sparkConf.set(“spark.executor.extraJavaOptions”, “-XX:+UseG1GC”)

http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

http://spark.apache.org/docs/latest/tuning.html#memory-management-overview

 

Advertisements