Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
446 views
in Technique[技术] by (71.8m points)

mapreduce - Hadoop input split size vs block size

I am going through hadoop definitive guide, where it clearly explains about input splits. It goes like

Input splits doesn’t contain actual data, rather it has the storage locations to data on HDFS

and

Usually,Size of Input split is same as block size

1) let’s say a 64MB block is on node A and replicated among 2 other nodes(B,C), and the input split size for the map-reduce program is 64MB, will this split just have location for node A? Or will it have locations for all the three nodes A,b,C?

2) Since data is local to all the three nodes how the framework decides(picks) a maptask to run on a particular node?

3) How is it handled if the Input Split size is greater or lesser than block size?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
  • The answer by @user1668782 is a great explanation for the question and I'll try to give a graphical depiction of it.

  • Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)

enter image description here

  • If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.

enter image description here

  • Block 1 contains the entire first record and a 28MB chunk of the second record.
  • If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.
  • This is the exact problem that input splits solve. Input splits respects logical record boundaries.

  • Lets Assume the input split size is 200MB

enter image description here

  • Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.

  • This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.

Hope this helps.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...