My Apache Spark cluster is running an application that is giving me lots of executor timeouts:
10:23:30,761 ERROR ~ Lost executor 5 on slave2.cluster: Executor heartbeat timed out after 177005 ms
10:23:30,806 ERROR ~ Lost executor 1 on slave4.cluster: Executor heartbeat timed out after 176991 ms
10:23:30,812 ERROR ~ Lost executor 4 on slave6.cluster: Executor heartbeat timed out after 176981 ms
10:23:30,816 ERROR ~ Lost executor 6 on slave3.cluster: Executor heartbeat timed out after 176984 ms
10:23:30,820 ERROR ~ Lost executor 0 on slave5.cluster: Executor heartbeat timed out after 177004 ms
10:23:30,835 ERROR ~ Lost executor 3 on slave7.cluster: Executor heartbeat timed out after 176982 ms
However, in my configuration I can confirm I successfully increased the executor heartbeat interval:
When I visit the logs of executors marked as EXITED
(i.e.: the driver removed them when it couldn't get a heartbeat), it appears that executors killed themselves because they didn't receive any tasks from the driver:
16/05/16 10:11:26 ERROR TransportChannelHandler: Connection to /10.0.0.4:35328 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
16/05/16 10:11:26 ERROR CoarseGrainedExecutorBackend: Cannot register with driver: spark://CoarseGrainedScheduler@10.0.0.4:35328
How can I turn off heartbeats and/or prevent the executors from timing out?
question from:
https://stackoverflow.com/questions/37260230/spark-cluster-full-of-heartbeat-timeouts-executors-exiting-on-their-own 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…