Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
821 views
in Technique[技术] by (71.8m points)

c# - How can I improve the performance of retrieving values from SharedStringTable in OpenXml Excel spreadsheet tools?

I'm using DocumentFormat.OpenXml to read an Excel spreadsheet. I have a performance bottleneck with the code used to look up the cell value from the SharedStringTable object (it seems to be some sort of lookup table for cell values):

var returnValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;

I've created a dictionary to ensure I only retrieve a value once:

if (dictionary.ContainsKey(parsedValue))
{
    return dictionary[parsedValue];
}

var fetchedValue = sharedStringTablePart.SharedStringTable.ChildElements.GetItem(parsedValue).InnerText;
dictionary.Add(parsedValue, fetchedValue);
return fetchedValue;

This has cut down the performance time by almost 50%. However my metrics indicate that it still takes 208 seconds for the line of code fetching the value from the SharedStringTable object to execute 123,951 times. Is there any other way of optimising this operation?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I would read the whole shared string table into your dictionary in one go rather than looking up each value as required. This will allow you to move through the file in order and stash the values ready for a hashed lookup which will be more efficient than scanning the SST for each value you require.

Running something like the following at the start of your process will allow you to access each value using dictionary[parsedValue].

private static void LoadDictionary()
{
    int i = 0;

    foreach (var ss in sharedStringTablePart.SharedStringTable.ChildElements)
    {
        dictionary.Add(i++, ss.InnerText);
    }
}

If your file is very large, you might see some gains using a SAX approach to read the file rather than the DOM approach above:

private static void LoadDictionarySax()
{
    using (OpenXmlReader reader = OpenXmlReader.Create(sharedStringTablePart))
    {
        int i = 0;
        while (reader.Read())
        {
            if (reader.ElementType == typeof(SharedStringItem))
            {
                SharedStringItem ssi = (SharedStringItem)reader.LoadCurrentElement();
                dictionary.Add(i++, ssi.Text != null ? ssi.Text.Text : string.Empty);
            }
        }
    }
}

On my machine, using a file with 60000 rows and 2 columns it was around 300 times quicker using the LoadDictionary method above instead of the GetValue method from your question. The LoadDictionarySax method gave similar performance but on a larger file (100000 rows with 10 columns) the SAX approach was around 25% faster than the LoadDictionary method. On an even larger file (100000 rows, 26 columns), the LoadDictionary method threw an out of memory exception but the LoadDictionarySax worked without issue.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...