Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
990 views
in Technique[技术] by (71.8m points)

python - ExcelFile Vs. read_excel in pandas

I'm diving into pandas and experimenting around. As for reading data from an Excel file. I wonder what's the difference between using ExcelFile to read_excel. Both seem to work (albeit slightly different syntax, as could be expected), and the documentation supports both. In both cases, the documentation describes the method the same: "Read an Excel table into DataFrame" and "Read an Excel table into a pandas DataFrame". (documentation for read_excel, and for excel_file)

I'm seeing answers here on SO that uses either, w/o addressing the difference. Also, a Google search didn't produce a result that discusses this issue.

WRT my testing, these seem equivalent:

path = "test/dummydata.xlsx"
xl = pd.ExcelFile(path)
df = xl.parse("dummydata")  # sheet name

and

path = "test/dummydata.xlsx" 
df = pd.io.excel.read_excel(path, sheetname=0)

other than the fact that the latter saves me a line, is there a difference between the two, and is there a reason to use either one?

Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

There's no particular difference beyond the syntax. Technically, ExcelFile is a class and read_excel is a function. In either case, the actual parsing is handled by the _parse_excel method defined within ExcelFile.

In earlier versions of pandas, read_excel consisted entirely of a single statement (other than comments):

return ExcelFile(path_or_buf,kind=kind).parse(sheetname=sheetname,
                                              kind=kind, **kwds)

And ExcelFile.parse didn't do much more than call ExcelFile._parse_excel.

In recent versions of pandas, read_excel ensures that it has an ExcelFile object (and creates one if it doesn't), and then calls the _parse_excel method directly:

if not isinstance(io, ExcelFile):
    io = ExcelFile(io, engine=engine)

return io._parse_excel(...)

and with the updated (and unified) parameter handling, ExcelFile.parse really is just the single statement:

return self._parse_excel(...)

That is why the docs for ExcelFile.parse now say

Equivalent to read_excel(ExcelFile, ...) See the read_excel docstring for more info on accepted parameters

As for another answer which claims that ExcelFile.parse is faster in a loop, that really just comes down to whether you are creating the ExcelFile object from scratch every time. You could certainly create your ExcelFile once, outside the loop, and pass that to read_excel inside your loop:

xl = pd.ExcelFile(path)
for name in xl.sheet_names:
    df = pd.read_excel(xl, name)

This would be equivalent to

xl = pd.ExcelFile(path)
for name in xl.sheet_names:
    df = xl.parse(name)

If your loop involves different paths (in other words, you are reading many different workbooks, not just multiple sheets within a single workbook), then you can't get around having to create a brand-new ExcelFile instance for each path anyway, and then once again, both ExcelFile.parse and read_excel will be equivalent (and equally slow).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.9k users

...