Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
666 views
in Technique[技术] by (71.8m points)

hadoop - Hive Create Multi small files for each insert in HDFS

following is already been achieved

  1. Kafka Producer pulling data from twitter using Spark Streaming.
  2. Kafka Consumer ingesting data into Hive External table(on HDFS).

while this is working fine so far. there is only one issue I am facing, while my app insert data into Hive table, it created small file with each row data per file.

below is the code

// Define which topics to read from
  val topic = "topic_twitter"
  val groupId = "group-1"
  val consumer = KafkaConsumer(topic, groupId, "localhost:2181")

//Create SparkContext
  val sparkContext = new SparkContext("local[2]", "KafkaConsumer")

//Create HiveContext  
  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)

  hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS twitter_data (tweetId BIGINT, tweetText STRING, userName STRING, tweetTimeStamp STRING,   userLang STRING)")
  hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS demo (foo STRING)")

Hive demo table already populated with one single record. Kafka consumer loop thru the data for topic ="topic_twitter" in process each row and populate in Hive table

val hiveSql = "INSERT INTO TABLE twitter_data SELECT STACK( 1," + 
    tweetID        +","  + 
    tweetText      +"," + 
    userName       +"," +
    tweetTimeStamp +","  +
    userLang + ") FROM demo limit 1"

hiveContext.sql(hiveSql)

below are the images from my Hadoop environment. twitter_data, demo Hie Tables in HDFS

last 10 files created in HDFS enter image description here

as you can see the file size is not more than 200KB, is there a way I merge these files in one file?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

[take 2] OK, so you can't properly "stream" data into Hive. But you can add a periodic compaction post-processing job...

  • create your table with 3 partitions e.g. (role='collectA'), (role='collectB'), (role='archive')
  • point your Spark inserts to (role='activeA')
  • at some point, switch to (role='activeB')
  • then dump every record that you have collected in the "A" partition into "archive", hoping that Hive default config will do a good job of limiting fragmentation

    INSERT INTO TABLE twitter_data PARTITION (role='archive') SELECT ... FROM twitter_data WHERE role='activeA' ; TRUNCATE TABLE twitter_data PARTITION (role='activeA') ;

  • at some point, switch back to "A" etc.

One last word: if Hive still creates too many files on each compaction job, then try tweaking some parameters in your session, just before the INSERT e.g.

set hive.merge.mapfiles =true;
set hive.merge.mapredfiles =true;
set hive.merge.smallfiles.avgsize=1024000000;

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...