Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
918 views
in Technique[技术] by (71.8m points)

amazon s3 - spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0

I run spark 1.4.1 in amazom aws emr 4.0.0

For some reson spark saveAsTextFile is very slow on emr 4.0.0 in comparison to emr 3.8 (was 5 sec, now 95 sec)

Actually saveAsTextFile says that it's done in 4.356 sec but after that I see lots of INFO messages with 404 error from com.amazonaws.latency logger for next 90 sec

spark> sc.parallelize(List.range(0, 1600000),160).map(x => x + "" + "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")

2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5 (saveAsTextFile at <console>:22) finished in 4.356 s
2015-09-01 21:16:17,637 INFO  [task-result-getter-2] cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all completed, from pool 
2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at <console>:22, took 4.547829 s
2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem (S3NativeFileSystem.java:listStatus(896)) - listStatus s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 3B2F06FD11682D22), S3 Extended Request ID: C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923], HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544], RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129], 
2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927], HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81], RequestSigningTime=[0.209], ResponseProcessingTime=[17.97], HttpClientSendRequestTime=[0.089], 
2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 62C6B413965447FD), S3 Extended Request ID: 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044], HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743], RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138], 
2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724], HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728], RequestSigningTime=[0.148], ResponseProcessingTime=[0.155], HttpClientSendRequestTime=[0.068], 
2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 4846575A1C373BB9), S3 Extended Request ID: aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.531], HttpRequestTime=[11.134], HttpClientReceiveResponseTime=[9.434], RequestSigningTime=[0.206], HttpClientSendRequestTime=[0.13], 
2015-09-01 21:16:17,786 INFO  [main] s3n.S3NativeFileSystem (S3NativeFileSystem.java:listStatus(896)) - listStatus s3n://foo-bar/tmp/test40_20/_temporary/0/task_201509012116_0005_m_000000 with recursive false
2015-09-01 21:16:17,798 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 8A91D9A08CE3C1FE), S3 Extended Request ID: u5RLzX1OvlIHBMCggSs3AGR96raYgD/Xu8RmoJuN/B+qZchoF1ZkbWIHRcqbzPNN], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[8A91D9A08CE3C1FE], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.472], HttpRequestTime=[11.147], HttpClientReceiveResponseTime=[9.594], RequestSigningTime=[0.168], HttpClientSendRequestTime=[0.088], 
2015-09-01 21:16:17,817 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[006EE9124BA77E28], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[19.185], HttpRequestTime=[16.691], HttpClientReceiveResponseTime=[15.039], RequestSigningTime=[0.17], ResponseProcessingTime=[2.141], HttpClientSendRequestTime=[0.11], 
2015-09-01 21:16:17,830 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 62F097583E42AB48), S3 Extended Request ID: EoJ7XNxQzKAm6yanlrf7ukIJOxYrhr5m8xEROkLc1wjFpPRgjuwY/JzznCshredZ], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[62F097583E42AB48], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[12.004], HttpRequestTime=[11.57], HttpClientReceiveResponseTime=[9.879], RequestSigningTime=[0.218], HttpClientSendRequestTime=[0.089], 
2015-09-01 21:16:17,844 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: A96FDB3E0E0E13FE), S3 Extended Request ID: Y1nnEJAd/wNtW+T2pFvg8HG5fzcjs+ztuLcXwFV3I6Bda4nKU+9rSdbTkoDtNwtu], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[A96FDB3E0E0E13FE], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[13.543], HttpRequestTime=[13.145], HttpClientReceiveResponseTime=[11.505], RequestSigningTime=[0.207], HttpClientSendRequestTime=[0.108], 
2015-09-01 21:16:17,911 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[4C105174ADF12A0B], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[66.408], HttpRequestTime=[63.949], HttpClientReceiveResponseTime=[62.298], RequestSigningTime=[0.211], ResponseProcessingTime=[2.049], HttpClientSendRequestTime=[0.085], 
2015-09-01 21:16:17,912 INFO  [main] s3n.S3NativeFileSystem (S3NativeFileSystem.java:rename(1182)) - rename s3n://foo-bar/tmp/test40_20/_temporary/0/task_201509012116_0005_m_000000/part-00000 s3n://foo-bar/tmp/test40_20/part-00000
2015-09-01 21:16:17,927 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 547162454610B1C3), S3 Extended Request ID: VgjjiHVtd/RutYxW3jPAZgos64j7JYfBmaMhkZvmyhkgD5ZuCAMSRMd/TrWQmTci], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[547162454610B1C3], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[15.214], HttpRequestTime=[14.764], HttpClientReceiveResponseTime=[13.047], RequestSigningTime=[0.243], HttpClientSendRequestTime=[0.124], 
2015-09-01 21:16:18,037 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 6F10454BF138C69F), S3 Extended Request ID: HSt8mkimmo9fK5qqTaU6OBGKXTQ1wvyctgMZSBsoIgxEFY+Yu5eq/Bn8fOCSsk3B], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[6F10454BF138C69F], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[108.944], HttpRequestTime=[108.542], HttpClientReceiveResponseTime=[106.874], RequestSigningTime=[0.171], HttpClientSendRequestTime=[0.067], 
2015-09-01 21:16:18,215 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[942D4DFF59A2B262], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[177.058], HttpRequestTime=[174.523], HttpClientReceiveResponseTime=[172.689], RequestSigningTime=[0.263], ResponseProcessingTime=[2.049], HttpClientSendRequestTime=[0.117], 
2015-09-01 21:16:18,235 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 712A1FF2554DDD5D), S3 Extended Request ID: RZZDuIrkdE/cdhAFijZix2juyAfZHyj7Mw2xJuyrEaJR5He0HREB30LATWvMJX3A], ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found], AWSRequestID=[712A1FF2554DDD5D], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], Exception=1, HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[20.187], HttpRequestTime=[19.728], HttpClientReceiveResponseTime=[18.001], RequestSigningTime=[0.238], HttpClientSendRequestTime=[0.125], 
2015-09-01 21:16:18,248 INFO  [main] amazonaws.latency (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200], ServiceName=[Amazon S3], AWSRequestID=[B386866C749DB8E0], ServiceEndpoint=[https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.628], HttpRequestTime=[11.091], HttpClientReceiveResponseTime=[9.513], RequestSigningTime=[0.24], ResponseProcessingTime=[0.139], HttpClientSendRequestTime=[0.079], 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

To solve the problem I added the following settings to mapred-site.xml as suggested by Neil Jonkers on user@spark.apache.org

<property>
  <name>mapred.output.direct.EmrFileSystem</name>
  <value>true</value>
</property>
<property>
  <name>mapred.output.direct.NativeS3FileSystem</name>
  <value>true</value>
</property>

It can be done by adding the following to aws command

classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true]

or by adding the following to configuration json file

  {
    "Classification": "mapred-site",
    "Properties": {
      "mapred.output.direct.EmrFileSystem": "true",
      "mapred.output.direct.NativeS3FileSystem": "true"
    }
  }

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...