Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
370 views
in Technique[技术] by (71.8m points)

Apache Spark does not delete temporary directories

After a spark program completes, there are 3 temporary directories remain in the temp directory. The directory names are like this: spark-2e389487-40cc-4a82-a5c7-353c0feefbb7

The directories are empty.

And when the Spark program runs on Windows, a snappy DLL file also remains in the temp directory. The file name is like this: snappy-1.0.4.1-6e117df4-97b6-4d69-bf9d-71c4a627940c-snappyjava

They are created every time the Spark program runs. So the number of files and directories keeps growing.

How can let them be deleted?

Spark version is 1.3.1 with Hadoop 2.6.

UPDATE

I've traced the spark source code.

The module methods that create the 3 'temp' directories are as follows:

  • DiskBlockManager.createLocalDirs
  • HttpFileServer.initialize
  • SparkEnv.sparkFilesDir

They (eventually) call Utils.getOrCreateLocalRootDirs and then Utils.createDirectory, which intentionally does NOT mark the directory for automatic deletion.

The comment of createDirectory method says: "The directory is guaranteed to be newly created, and is not marked for automatic deletion."

I don't know why they are not marked. Is this really intentional?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Three SPARK_WORKER_OPTS exists to support the worker application folder cleanup, copied here for further reference: from Spark Doc

  • spark.worker.cleanup.enabled, default value is false, Enable periodic cleanup of worker / application directories. Note that this only affects standalone mode, as YARN works differently. Only the directories of stopped applications are cleaned up.

  • spark.worker.cleanup.interval, default is 1800, i.e. 30 minutes, Controls the interval, in seconds, at which the worker cleans up old application work dirs on the local machine.

  • spark.worker.cleanup.appDataTtl, default is 7*24*3600 (7 days), The number of seconds to retain application work directories on each worker. This is a Time To Live and should depend on the amount of available disk space you have. Application logs and jars are downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space, especially if you run jobs very frequently.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...