Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
498 views
in Technique[技术] by (71.8m points)

ruby - How do I robustly parse malformed CSV?

I'm processing data from government sources (FEC, state voter databases, etc). It's inconsistently malformed, which breaks my CSV parser in all sorts of delightful ways.

It's externally sourced and authoritative. I must parse it, and I cannot have it re-input, validated on input, or the like. It is what it is; I don't control the input.

Properties:

  1. Fields contain malformed UTF-8 (e.g. Foo xAB bar)
  2. The first field of a line specifies the record type from a known set. Knowing the record type, you know how many fields there are and their respective data types, but not until you do.
  3. Any given line within a file might use quoted strings ("foo",123,"bar") or unquoted (foo,123,bar). I haven't yet encountered any where it's mixed within a given line (i.e. "foo",123,bar) but it's probably in there.
  4. Strings may include internal newline, quote, and/or comma character(s).
  5. Strings may include comma separated numbers.
  6. Data files can be very large (millions of rows), so this needs to still be reasonably fast.

I'm using Ruby FasterCSV (known as just CSV in 1.9), but the question should be language-agnostic.

My guess is that a solution will require preprocessing substitution with unambiguous record separator / quote characters (eg ASCII RS, STX). I've started a bit here but it doesn't work for everything I get.

How can I process this kind of dirty data robustly?

ETA: Here's a simplified example of what may be in single file:

"this","is",123,"a","normal","line"
"line","with "an" internal","quote"
"short line","with
an
"internal quote", 1 comma and
linebreaks"
un "quot" ed,text,with,1,2,3,numbers
"quoted","number","series","1,2,3"
"invalid xAB utf-8"
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It is possible to subclass Ruby's File to process each line of the the CSV file before it is passed to the Ruby's CSV parser. For example, here's how I used this trick to replace non-standard backslash-escaped quotes " with standard double-quotes ""

class MyFile < File
  def gets(*args)
    line = super
    if line != nil
      line.gsub!('"','""')  # fix the " that would otherwise cause a parse error
    end
    line
  end
end

infile = MyFile.open(filename)
incsv = CSV.new(infile)

while row = incsv.shift
  # process each row here
end

You could in principle do all sorts of additional processing, e.g. UTF-8 cleanups. The nice thing about this approach is you handle the file on a line by line basis, so you don't need to load it all into memory or create an intermediate file.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...