Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
373 views
in Technique[技术] by (71.8m points)

splitting CSV file by columns

I have a really huge CSV files. There are about 1700 columns and 40000 rows like below:

x,y,z,x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1700 more)...,x1700
0,0,0,a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1700 more)...,a1700
1,1,1,b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1700 more)...,b1700
// (about 40000 more rows below)

I need to split this CSV file into multiple files which contain a less number of columns like:

# file1.csv
x,y,z
0,0,0
1,1,1
... (about 40000 more rows below)

# file2.csv
x1,x2,x3,x4,x5,x6,x7,x8,x9,...(about 1000 more)...,x1000
a1,a2,a3,a4,a5,a6,a7,a8,a9,...(about 1000 more)...,a1000
b1,b2,b3,b4,b5,b6,b7,b8,b9,...(about 1000 more)...,b1700
// (about 40000 more rows below)

#file3.csv
x1001,x1002,x1003,x1004,x1005,...(about 700 more)...,x1700
a1001,a1002,a1003,a1004,a1005,...(about 700 more)...,a1700
b1001,b1002,b1003,b1004,b1005,...(about 700 more)...,b1700
// (about 40000 more rows below)

Is there any program or library doing this?

I've googled for it , but programs that I found only split a file by rows not by columns.

Or which language could I use to do this efficiently?

I can use R, shell script, Python, C/C++, Java

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

A one-line solution for your example data and desired output:

cut -d, -f -3 huge.csv > file1.csv
cut -d, -f 4-1004 huge.csv > file2.csv
cut -d, -f 1005- huge.csv > file3.csv

The cut program is available on most POSIX platforms and is part of GNU Core Utilities. There is also a Windows version.

update in python, since the OP asked for a program in an acceptable language:

# python 3 (or python 2, if you must)
import csv
import fileinput

output_specifications = (  # csv file name, selector function
    ('file1.csv', slice(3)),
    ('file2.csv', slice(3, 1003)),
    ('file3.csv', slice(1003, 1703)),
)
output_row_writers = [
    (
        csv.writer(open(file_name, 'wb'), quoting=csv.QUOTE_MINIMAL).writerow,
        selector,
    ) for file_name, selector in output_specifications
]

reader = csv.reader(fileinput.input())
for row in reader:
    for row_writer, selector in output_row_writers:
        row_writer(row[selector])

This works with the sample data given and can be called with the input.csv as an argument or by piping from stdin.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...