hadoop - Access files that start with underscore in apache spark

Question

Welcome To Ask or Share your Answers For Others

hadoop - Access files that start with underscore in apache spark

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

hadoop - Access files that start with underscore in apache spark

I am trying to access gz files on s3 that start with _ in Apache Spark. Unfortunately spark deems these files invisible and returns Input path does not exist: s3n:.../_1013.gz. If I remove the underscore it finds the file just fine.

I tried adding a custom PathFilter to the hadoopConfig:

package CustomReader

import org.apache.hadoop.fs.{Path, PathFilter}

class GFilterZip extends PathFilter {
  override def accept(path: Path): Boolean = {
    true
  }
}
// in spark settings
sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[CustomReader.GFilterZip], classOf[org.apache.hadoop.fs.PathFilter])

but I still have the same problem. Any ideas?

System: Apache Spark 1.6.0 with Hadoop 2.3

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T21:38:08+0000

Files started with _ and . are hidden files.

And the hiddenFileFilter will be always applied. It is added inside method org.apache.hadoop.mapred.FileInputFormat.listStatus

check this answer, which files ignored as input by mapper?

Categories

hadoop - Access files that start with underscore in apache spark

hadoop - Access files that start with underscore in apache spark

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags