Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
200 views
in Technique[技术] by (71.8m points)

python - Can I access read_csv()'s dtype inference engine when creating a DataFrame from a nested list?

This follows from a discussion with piRSquared here, where I found that read_csv seems to have its own type inference methods that appear to be broader in their ability to obtain the correct type. It also appears to be more fault-tolerant in the case of missing data, electing for NaN instead of throwing ValueError as its default behaviour.

There's a lot of cases where the inferred datatypes are perfectly acceptable for my work but this functionality doesn't seem to be exposed when instantiating a DataFrame, or anywhere else in the API that I can find, meaning that I have to manually deal with dtypes unnecessarily. This can be tedious if you have hundreds of columns. The closest I can find is convert_objects() but it doesn't handle the bools in this case. The alternative I could use is to dump to disk and read it back in, which is grossly inefficient.

The below example illustrates the default behaviour of read_csv vs. the default behaviour of the conventional methods for setting dtype (correct in V 0.20.3). Is there a way to access the type inference of read_csv without dumping to disk? More generally, is there a reason why read_csv behaves like this?

Example:

import numpy as np
import pandas as pd
import csv

data = [['string_boolean', 'numeric', 'numeric_missing'], 
        ['FALSE', 23, 50], 
        ['TRUE', 19, 12], 
        ['FALSE', 4.8, '']]

with open('my_csv.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerows(data)

# Reading from CSV
df = pd.read_csv('my_csv.csv')
print(df.string_boolean.dtype) # Automatically converted to bool
print(df.numeric.dtype) # Float, as expected
print(df.numeric_missing.dtype) # Float, doesn't care about empty string

# Creating directly from list without supplying datatypes
df2 = pd.DataFrame(data[1:], columns=data[0])
df2.string_boolean = df2.string_boolean.astype(bool) # Doesn't work - ValueError
df2.numeric_missing = df2.numeric_missing.astype(np.float64) # Doesn't work

# Creating but forcing dtype doesn't work
df3 = pd.DataFrame(data[1:], columns=data[0], 
                   dtype=[bool, np.float64, np.float64])

# The working method
df4 = pd.DataFrame(data[1:], columns=data[0])
df4.string_boolean.map({'TRUE': True, 'FALSE': False})
df4.numeric_missing = pd.to_numeric(df4.numeric_missing)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

One solution is to use a StringIO object. The only difference is that it keeps all the data in memory, instead of writing to disk and reading back in.

Code is as follows (note: Python 3!):

import numpy as np
import pandas as pd
import csv
from io import StringIO

data = [['string_boolean', 'numeric', 'numeric_missing'],
        ['FALSE', 23, 50],
        ['TRUE', 19, 12],
        ['FALSE', 4.8, '']]

with StringIO() as fobj:
    writer = csv.writer(fobj)
    writer.writerows(data)
    fobj.seek(0)
    df = pd.read_csv(fobj)

print(df.head(3))
print(df.string_boolean.dtype) # Automatically converted to bool
print(df.numeric.dtype) # Float, as expected
print(df.numeric_missing.dtype) # Float, doesn't care about empty string

The with StringIO() as fobj isn't really necessary: fobj = String() will work just as fine. And since the context manager will close the StringIO() object outside its scope, the df = pd.read_csv(fobj) has to be inside it.
Note also the fobj.seek(0), which is another necessity, since your solution simply closes and reopens a file, which will automatically set the file pointer to the start of the file.

A note on Python 2 vs Python 3

I actually tried to make the above code Python 2/3 compatible. That became a mess, because of the following: Python 2 has an io module, just like Python 3, whose StringIO class makes everything unicode (also in Python 2; in Python 3 it is, of course, the default).
That is great, except that the csv writer module in Python 2 is not unicode compatible.
Thus, the alternative is to use the (older) Python 2 (c)StringIO module, for example as follows:

try:
    from cStringIO import StringIO
except ModuleNotFoundError:  # Python 3
    from io import StringIO

and things will be plain text in Python 2, and unicode in Python 3.
Except that now, cStringIO.StringIO does not have a context manager, and the with statement will fail. As I mentioned, it is not really necessary, but I was keeping things as close as possible to your original code.
In other words, I could not find a nice way to stay close to the original code without ridiculous hacks.

I've also looked at avoiding the CSV writer completely, which leads to:

text = '
'.join(','.join(str(item).strip("'") for item in items) 
                 for items in data)

with StringIO(text) as fobj:
    df = pd.read_csv(fobj)

which is perhaps neater (though a bit less clear), and Python 2/3 compatible. (I don't expect it to work for everything that the csv module can handle, but here it works fine.)


Why can't pd.DataFrame(...) do the conversion?

Here, I can only speculate.

I would think the reasoning is that when the input are Python objects (dicts, lists), the input is known, and in hands of the programmer. Therefore, it is unlikely, perhaps even illogical, that that input would contain strings such as 'FALSE' or ''. Instead, it would normally contain the objects False and np.nan (or math.nan), since the programmer would already have taken care of the (string) translation.
Whereas for a file (CSV or other), the input can be anything: your colleague might send an Excel CSV file, or someone else sends you a Gnumeric CSV file. I don't know how standardised CSV files are, but you'd probably need some code to allow for exceptions, and overall for the conversion of the strings to Python (NumPy) format.

So in that sense, it is actually illogical to expect pd.DAtaFrame(...) to accept just anything: instead, it should accept something that is properly formatted.

You might argue for a convenience method that takes a list like yours, but a list is not a CSV file (which is just a bunch of characters, including newlines). Plus, I expect there is the option for pd.read_csv to read the files in chunks (perhaps even line by line), which becomes harder if you'd feed it a string with newlines instead (you can't really read that line by line, as you would have to split it on newlines and keep all the lines in memory. And you already have the full string in memory somewhere, instead of on disk. But I digress).

Besides, the StringIO trick is just a few lines to precisely perform this trick.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...