My starting point was a problem with NumPy's function loadtxt:
X = np.loadtxt(filename, delimiter=",")
that gave a MemoryError
in np.loadtxt(..)
. I googled it and came to this question on StackOverflow. That gave the following solution:
def iter_loadtxt(filename, delimiter=',', skiprows=0, dtype=float):
def iter_func():
with open(filename, 'r') as infile:
for _ in range(skiprows):
next(infile)
for line in infile:
line = line.rstrip().split(delimiter)
for item in line:
yield dtype(item)
iter_loadtxt.rowlength = len(line)
data = np.fromiter(iter_func(), dtype=dtype)
data = data.reshape((-1, iter_loadtxt.rowlength))
return data
data = iter_loadtxt('your_file.ext')
So I tried that, but then encountered the following error message:
> data = data.reshape((-1, iter_loadtext.rowlength))
> ValueError: total size of new array must be unchanged
Then I tried to add the number of rows and maximum number of cols to the code with the code fragments down here, which I partly got from another question and partly wrote myself:
num_rows = 0
max_cols = 0
with open(filename, 'r') as infile:
for line in infile:
num_rows += 1
tmp = line.split(",")
if len(tmp) > max_cols:
max_cols = len(tmp)
def iter_func():
#didn't change
data = np.fromiter(iter_func(), dtype=dtype, count=num_rows)
data = data.reshape((num_rows, max_cols))
But this still gave the same error message though I thought it should have been solved. On the other hand I'm not sure if I'm calling data.reshape(..)
in the correct manner.
I commented the rule where date.reshape(..)
is called to see what happened. That gave this error message:
> ValueError: need more than 1 value to unpack
Which happened at the first point where something is done with X
, the variable where this problem is all about.
I know this code can work on the input files I got, because I saw it in use with them. But I can't find why I can't solve this problem. My reasoning goes as far as that because I'm using a 32-bit Python version (on a 64-bit Windows machine), something goes wrong with memory that doesn't happen on other computers. But I'm not sure. For info: I'm having 8GB of RAM for a 1.2GB file but my RAM is not full according to Task Manager.
What I want to solve is that I'm using open source code that needs to read and parse the given file just like np.loadtxt(filename, delimiter=",")
, but then within my memory. I know the code originally worked in MacOsx and Linux, and to be more precise: "MacOsx 10.9.2 and Linux (version 2.6.18-194.26.1.el5 (brewbuilder@norob.fnal.gov) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) 1 SMP Tue Nov 9 12:46:16 EST 2010)."
I don't care that much about time. My file contains +-200.000 lines on which there are 100 or 1000 (depending on the input files: one is always 100, one is always 1000) items per line, where one item is a floating point with 3 decimals either negated or not and they are separated by ,
and a space. F.e.: [..] 0.194, -0.007, 0.004, 0.243, [..]
, and 100 or 100 of those items of which you see 4, for +-200.000 lines.
I'm using Python 2.7 because the open source code needs that.
Does any of you have the solution for this? Thanks in advance.
See Question&Answers more detail:
os