Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
773 views
in Technique[技术] by (71.8m points)

sorting - Sorted Table in Hive (ORC file format)

I'm having some difficulties to make sure I'm leveraging sorted data within a Hive table. (Using ORC file format)

I understand we can affect how the data is read from a Hive table, by declaring a DISTRIBUTE BY clause in the create DDL.

CREATE TABLE trades
(
    trade_id INT,
    name STRING,
    contract_type STRING,
    ts INT
)
PARTITIONED BY (dt STRING)
CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 8 BUCKETS
STORED AS ORC;

This will mean that every time I make a query to this table, the data will be distributed by trade_id among the various mappers and afterward it will be sorted.

My question is:

I do not want the data to be split into N files (buckets), because the volume is not that much and I would stay with small files.

However, I do want to leverage sorted insertion.

INSERT OVERWRITE TABLE trades
PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY trade_id
SORT BY trade_id;

Do I really need to use CLUSTERED/SORT in the create DLL statement? Or does Hive/ORC knows how to leverage the fact that the insertion process already ensured that the data is sorted?

Could it make sense to do something like:

CLUSTERED BY (trade_id) SORTED BY (trade_id, time) INTO 1 BUCKETS
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Bucketed table is an outdated concept.

You do not need to write CLUSTERED BY in table DDL.

When loading table use distribute by partition key to reduce pressure on reducers especially when writing ORC, which requires intermediate buffers for building ORC and if each reducer loads many partitions it may cause OOM exception.

When the table is big, you can limit the max file size using bytes.per.reducer like this:

set hive.exec.reducers.bytes.per.reducer=67108864;--or even less

If you have more data, more reducers will be started, more files created. This is more flexible than loading fixed number of buckets.

This will also work better because for small tables you do not need to create smaller buckets.

ORC has internal indexes and bloom filters. Applying SORT you can improve index and bloom filters efficiency because all similar data will be stored together. Also this can improve compression depending on your data enthropy.

If distribution by partition key is not enough because you have some data skew and the data is big, you can additionally distribute by random. It is better to distribute by column if you have evenly distributed data. If not, distribute by random to avoid single long running reducer problem.

Finally your insert statement may look loke this:

set hive.exec.reducers.bytes.per.reducer=33554432; --32Mb per reducer

INSERT OVERWRITE TABLE trades PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY dt,                    --partition key is a must for big data
              trade_id,              --some other key if the data is too big and key is
                                     --evenly distributed (no skew)   
              FLOOR(RAND()*100.0)%20 --random to distribute additionally on 20 equal parts 

SORT BY contract_type; --sort data if you want filtering by this key 
                       --to work better using internal index

Do not use CLUSTERED BY in table DDL because using DISTRIBUTE BY, ORC w indexes and bloom filters + SORT during insert you can achieve the same in more flexible way.

Distribute + sort can reduce the size of ORC files extremely by x3 or x4 times. Similar data can be better compressed and makes internal indexes more efficient.

Read also this: https://stackoverflow.com/a/55375261/2700344 This is related answer about about sorting: https://stackoverflow.com/a/47416027/2700344


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...