Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
195 views
in Technique[技术] by (71.8m points)

python - how to create cutomized dataset for google tensorflow attention ocr?

I am able to create TFRecord file according to this question. But I don't know whether I should write all images into a single TFRecord file or create multiple TFRecord files. Also, I don't quite understand the config file for datesets. What content should be in "charset_filename" file? Should it be a collection of all posible chracters in the dataset? When generating TFRecord file, we converted charcters to integer ids, should this file include characters or their ids?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

whether I should write all images into a single TFRecord file or create multiple TFRecord files

It depends on size of the training data and has impact on parallel prefetching to fill queues. I'd recommend ~1000 samples per shard (a tfrecord file with a suffix num-of-total, e.g. /path/to/my/dataset-00000-of-00512).

What content should be in "charset_filename" file?

It is a text file which defines the mapping between integer ids and corresponding characters. It has the following format: <id><TAB><character> one of rows in the file should define an id for the <nul> character - a special character the model outputs when it reached end of sequence to pad the output to a fixed length.

For example, here is an excerpt from the FSNS dataset's charset file:

0    
133 <nul>
1   l
2   ’
3   é
4   t

Note that the <SPACE> character has id=0.

Should it be a collection of all posible chracters in the dataset?

yes. This file should define id-to-character mappings for all characters in the dataset.

When generating TFRecord file, we converted charcters to integer ids, should this file include characters or their ids?

both. Each line in the file should be in the form <id><TAB><character>.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...