Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
557 views
in Technique[技术] by (71.8m points)

c# - .NET Performance: Large CSV Read, Remap, Write Remapped

I've done some research and found that the most efficient way for me to read and write multi-gig (+5GB) files is to use something like the following code:

using (FileStream fs = File.Open(file, FileMode.Open, FileAccess.Read, FileShare.Read))
using (BufferedStream bs = new BufferedStream(fs, 256 * 1024))
using (StreamReader sr = new StreamReader(bs, Encoding.ASCII, false, 256 * 1024))
{
    StreamWriter sw = new StreamWriter(outputFile, true, Encoding.Unicode, 256 * 1024);
    string line = "";

    while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
    {
        //Try to clean csv then split
        line = Regex.Replace(line, "[\s\dA-Za-z]["][\s\dA-Za-z]", ""); 
        string[] fields = Regex.Split(line, ",(?=(?:[^"]*"[^"]*")*[^"]*$)");
        //I know there are libraries for this that I will switch out 
        //when I have time to create the classes as it seems they all
        //require a mapping class

        //Remap 90-250 properties
        object myObj = ObjectMapper(fields);

        //Write line
        bool success = ObjectWriter(myObj);
    }

    sw.Dispose();
}

CPU is averaging around 33% for each of 3 instances on an Intel Xeon 2.67 GHz. I was able to output 2 files in ~26 hrs that were just under 7GB while the process was running 3 instances using:

Parallel.Invoke(
    () => new Worker().DoWork(args[0]),
    () => new Worker().DoWork(args[1]),
    () => new Worker().DoWork(args[2])
);

The third instance is generating a MUCH larger file being, so far, +34GB and am coming up on day 3, ~67 hrs in.

From what I've read, I think performance may be increased slightly by getting the buffer lowered to a sweet spot.

My questions are:

  1. Based on what is stated, is this typical performance?
  2. Besides what I mentioned above, are there any other improvements you can see?
  3. Are the CSV mapping and reading libraries much faster that regex?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

So, first of all, you should profile your code to identify bottlenecks.

Visual Studio comes with a built-in profiler for this purpose, which can clearly identify hot-spots in your code.

Given that your process is CPU bound, this is likely to prove very effective.

However, if I had to guess at why it's slow, I would imagine it's because you are not re-using your regexes. A regex is relatively expensive to construct, so re-using it can see massive performance improvements.

var regex1 = new Regex("[\s\dA-Za-z]["][\s\dA-Za-z]", RegexOptions.Compiled);
var regex2 = new Regex(",(?=(?:[^"]*"[^"]*")*[^"]*$)", RegexOptions.Compiled);
while (sr.BaseStream != null && (line = sr.ReadLine()) != null)
{
    //Try to clean csv then split
    line = regex1.Replace(line, ""); 
    string[] fields = regex2.Split(line);
    //I know there are libraries for this that I will switch out 
    //when I have time to create the classes as it seems they all
    //require a mapping class

    //Remap 90-250 properties
    object myObj = ObjectMapper(fields);

    //Write line
    bool success = ObjectWriter(myObj);
}

However, I would strongly encourage you to use a library like Linq2Csv - it will likely be more performant, as it will have had several rounds of performance tuning, and it will handle edge-cases that your code doesn't.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...