Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
661 views
in Technique[技术] by (71.8m points)

scala - Intermittent Timeout Exception using Spark

I've a Spark cluster with 10 nodes, and I'm getting this exception after using the Spark Context for the first time:

14/11/20 11:15:13 ERROR UserGroupInformation: PriviledgedActionException as:iuberdata (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1421)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.security.PrivilegedActionException: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    ... 4 more

This guy have had a similar problem but I've already tried his solution and didn't worked.

The same exception also happens here but the problem isn't them same in here as I'm using spark version 1.1.0 in both master or slave and in client.

I've tried to increase the timeout to 120s but it still doesn't solve the problem.

I'm doploying the environment throught scripts and I'm using the context.addJar to include my code into the classpath. This problem is intermittend, and I don't have any idea on how to track why is it happening. Anybody has faced this issue when configuring a spark cluster know how to solve it?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

We had a similar problem which was quite hard to debug and isolate. Long story short - Spark uses Akka which is very picky about FQDN hostnames resolving to IP addresses. Even if you specify the IP Address at all places it is not enough. The answer here helped us isolate the problem.

A useful test to run is run netcat -l <port> on the master and run nc -vz <host> <port> on the worker to test the connectivity. Run the test with an IP address and with the FQDN. You can get the name Spark is using from the WARN message from the log snippet below. For us it was host032s4.staging.companynameremoved.info. The IP address test for us passed and the FQDN test failed as our DNS was not setup correctly.

INFO 2015-07-24 10:33:45 Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@10.40.246.168:35455]
INFO 2015-07-24 10:33:45 Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher@10.40.246.168:35455]
INFO 2015-07-24 10:33:45 org.apache.spark.util.Utils: Successfully started service 'driverPropsFetcher' on port 35455.
WARN 2015-07-24 10:33:45 Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver@host032s4.staging.companynameremoved.info:50855]. Address is now gated for 60000 ms, all messages to this address will be delivered to dead letters.
ERROR 2015-07-24 10:34:15 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:skumar cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

Another thing which we had to do was to specify the spark.driver.host and spark.driver.port properties in the spark submit script. This was because we had machines with two IP addresses and the FQDN resolved to the wrong IP address.

Make sure your network and DNS entries are correct!!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...