Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
582 views
in Technique[技术] by (71.8m points)

sed - R data.table fread command : how to read large files with irregular separators?

I have to work with a collection of 120 files of ~2 GB (525600 lines x 302 columns). The goal is to make some statistics and put the results in a clean SQLite database.

Everything works fine when my script import with read.table(), but it's slow. So I've tried with fread, from the data.table package (version 1.9.2), but it give me this error :

Error in fread(txt, header = T, select = c("YYY", "MM", "DD",  : 
Not positioned correctly after testing format of header row. ch=' '

The first 2 lines and 7 rows of my data look like that :

 YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00

So, there is a first space at beginning, then only one space between date columns, then an arbitrary number of spaces between the others columns.

I've tried to use a command like this to convert spaces in comma :

DT <- fread(
            paste("sed 's/\s\+/,/g'", txt),
            header=T,
            select=c('HHHH','MM','DD','HH')
)

without success : the problem remains and it seems to be slow with the sed command.

Fread doesn't seems to like "arbitrary number of space" as separator or empty column at beginning. Any idea ?

Here is a (maybe) smallest reproducible example (newline char after 40790) :

txt<-print(" YYYY MM DD HH mm             19490             40790
 1991 10  1  1  0      1.046465E+00      1.568405E+00")

testDT<-fread(txt,
              header=T,
              select=c("YYY","MM","DD","HH")
)

Thanks for your help !

UPDATE : - The error doesn't occurs with data.table 1.8.* . With this version, the table is read as one unique line, which is not better.

UPDATE 2 - As mentioned in comments, I could use sed to format the table and then read it with fread. I've put a script in an answer above where I create a sample dataset and then, compare some system.time ().

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Just committed to devel, v1.9.5. fread() gains strip.white argument with default TRUE (as opposed to base::read.table(), because it's more desirable). The example data is now added to tests.

With this recent commit:

require(data.table) # v1.9.5, commit 0e7a835 or more recent
ans <- fread(" YYYY MM DD HH mm             19490             40790
   1991 10  1  1  0      1.046465E+00      1.568405E+00")
#      V1 V2 V3 V4 V5           V6           V7
# 1: YYYY MM DD HH mm 19490.000000 40790.000000
# 2: 1991 10  1  1  0     1.046465     1.568405
sapply(ans, class)
#          V1          V2          V3          V4          V5          V6          V7 
# "character" "character" "character" "character" "character"   "numeric"   "numeric" 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...