Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
289 views
in Technique[技术] by (71.8m points)

python - inconsistent record marker while reading fortran unformatted file

I'm trying to read a very big Fortran unformatted binary file with python. This file contains 2^30 integers.

I find that the record markers is confusing (the first one is -2147483639), anyway I have achieved to recover the data structure ( those wanted integers are all similar, thus differ from record markers ) and write the code below ( with help of here ).

However, we can see the markers at the begin and the end of each record are not the same. Why is that?

Is it because the size of the data is too long ( 536870910 = (2^30 - 4) / 2 ) ? But ( 2^31 - 1 ) / 4 = 536870911 > 536870910.

Or just some mistakes made by the author of the data file?

Another question, what's the type of the marker at begin of a record , int or unsigned int?

fp = open(file_path, "rb")

rec_len1, = struct.unpack( '>i', fp.read(4) )
data1 = np.fromfile( fp, '>i', 536870910)
rec_end1, = struct.unpack( '>i', fp.read(4) )

rec_len2, = struct.unpack( '>i', fp.read(4) )
data2 = np.fromfile( fp, '>i', 536870910)
rec_end2, = struct.unpack( '>i', fp.read(4) )

rec_len3, = struct.unpack( '>i', fp.read(4) )
data3 = np.fromfile( fp, '>i', 4)
rec_end3, = struct.unpack( '>i', fp.read(4) )
data = np.concatenate([data1, data2, data3])

(rec_len1,rec_end1,rec_len2,rec_end2,rec_len3,rec_end3)

here's the values of record lenth readed as showed above:

(-2147483639, -2176, 2406, 589824, 1227787, -18)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Finally, things seem to be more clear.

Here is a Intel Fortran Compiler User and Reference Guides, see the section Record Types:Variable-Length Records.

For a record length greater than 2,147,483,639 bytes, the record is divided into subrecords. The subrecord can be of any length from 1 to 2,147,483,639, inclusive.

The sign bit of the leading length field indicates whether the record is continued or not. The sign bit of the trailing length field indicates the presence of a preceding subrecord. The position of the sign bit is determined by the endian format of the file.

A subrecord that is continued has a leading length field with a sign bit value of 1. The last subrecord that makes up a record has a leading length field with a sign bit value of 0. A subrecord that has a preceding subrecord has a trailing length field with a sign bit value of 1. The first subrecord that makes up a record has a trailing length field with a sign bit value of 0. If the value of the sign bit is 1, the length of the record is stored in twos-complement notation.

After many essays, I realized that I was mislead by twos-complement notation, the record marker just change the sign according to the rules above, instead changing to its twos-complement notation when the sign bit is 1. Anyway it's also possible that my data was created with a diffrent compiler.

Below is the solution.

The data is larger than 2GB, so it's devided into several subrecords. As we see the first record start marker is -2147483639, so the lenth of the first record is 2147483639 which is exactly the maximum length of subrecord, not 2147483640 as I thought nor 2147483638 the twos-complement notation of -2147483639.

If we skip 2147483639 bytes to read the record end marker, you will get 2147483639, as it's the first subrecord whose end marker is positive.

Below is the code to check the record markers:

fp = open(file_path, "rb")
while 1:
    prefix, = struct.unpack( '>i', fp.read(4) )
    fp.seek(abs(prefix), 1)    #or read |prefix| bytes data as you want
    suffix, = struct.unpack( '>i', fp.read(4) )
    print prefix, suffix
    if abs(suffix) - abs(prefix): 
        print "suffix != prefix!"
        break
    if prefix > 0: break

And screen prints

-2147483639 2147483639
-2147483639 -2147483639
18 -18

We can see the record begin marker and end marker always are the same except the sign. Length of the three records are 2147483639, 2147483639, 18 bytes, not nessary to be multiple of 4. So the first record ends with the first 3 bytes of certain integer and the second record begins with the rest 1 byte.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...