command line interface - How to split a CSV or JSON file for optimal Snowflake ingestion?

Question

Welcome To Ask or Share your Answers For Others

command line interface - How to split a CSV or JSON file for optimal Snowflake ingestion?

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:42:16+0000

This is the best command line sequence I could come up with:

cat bigfile.json  | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz'

Replace the first step with anything that will output JSON or CSV to stdout, depending on the source file. If it's a plain file cat will do, if it's a .gz then gzcat, if it's a .zstd then unzstd --long=31 -c file.zst, etc.

Then split:

-C 1000000000 creates 1GB files, but respects end-lines for row integrity.
-d gives a numeric suffix to each file (I prefer this to the default letters_
-a4 makes the numeric suffix length 4 (instead of only 2)
- will read the output from the previous cat in the pipeline
output_prefix is the base name for all output files
--filter='gzip > $FILE.gz' compresses the 1GB files on the fly with gzip, so each final file will end up with a size around 100MB.

Snowflake can ingest .gz files, so this final compression step will help us moving the files around the network.

Categories

command line interface - How to split a CSV or JSON file for optimal Snowflake ingestion?

command line interface - How to split a CSV or JSON file for optimal Snowflake ingestion?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags