Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
101 views
in Technique[技术] by (71.8m points)

python - Split large text file(around 50GB) into multiple files

I would like to split a large text file around size of 50GB into multiple files. Data in the files are like this-[x= any integer between 0-9]

xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
xxx.xxx.xxx.xxx
...............
...............

There might be few billions of lines in the file and i would like write for example 30/40 millions per file. I guess the steps would be-

  • I've to open the file
  • then using readline() have to read the file line by line and write at the same time to a new file
  • and as soon as it hits the maximum number of lines it will create another file and starts writing again.

I'm wondering, how to put all these steps together in a memory efficient and faster way. I've seen some examples in stack but none of them totally helping what i exactly need. I would really appreciate if anyone could help me out.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This working solution uses split command available in shell. Since the author has already accepted a possibility of a non-python solution, please do not downvote.

First, I created a test file with 1000M entries (15 GB) with

awk 'BEGIN{for (i = 0; i < 1000000000; i++) {print "123.123.123.123"} }' > t.txt

Then I used split:

split --lines=30000000 --numeric-suffixes --suffix-length=2 t.txt t

It took 5 min to produce a set of 34 small files with names t00-t33. 33 files are 458 MB each and the last t33 is 153 MB.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...