Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
166 views
in Technique[技术] by (71.8m points)

python - TFRecordReader seems extremely slow , and multi-threads reading not working

My training process use tfrecord format for train&eval dataset.

I test the benchmark of reader , only 8000records/second. and io speed(see from iotop command) just 400KB-500KB/s.

I'm using the cpp version of protobuf here

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md#protobuf-library-related-issues

If possible, provide a minimal reproducible example (We usually don't have time to read hundreds of lines of your code)

def read_and_decode(filename_queue):
     reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    return serialized_example
  serialized_example = read_and_decode(filename_queue)
  batch_serialized_example = tf.train.shuffle_batch(
      [serialized_example],
      batch_size=batch_size,
      num_threads=thread_number,
      capacity=capacity,
      min_after_dequeue=min_after_dequeue)
  features = tf.parse_example(
      batch_serialized_example,
      features={
          "label": tf.FixedLenFeature([], tf.float32),
          "ids": tf.VarLenFeature(tf.int64),
          "values": tf.VarLenFeature(tf.float32),
      })

What other attempted solutions have you tried?

I try to set num_threads in tf.train.shuffle_batch but not working.

It seems that when set to 2 threads, it work at 8000records/s, when enlarge the thread number, it get slower. (I remove all ops that cost cpus. Just read data.)

My sever are 24 core cpus.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The issue here is that there's a fixed cost overhead to each session.run, and filling the queue with many tiny examples to the queue will be slow.

In particular, each session.run is about 100-200 usec, so you can only do about 5k-10k session.run calls per second.

This problem is obvious if doing Python profiling (python -m cProfile), but hard to see if starting from timeline profile, or CPU profile.

The work-around is to use enqueue_many to add things to your queue in batches. I took your benchmark from https://gist.github.com/ericyue/7705407a88e643f7ab380c6658f641e8 and modified it to enqueue many items per .run call, and that gives 10x speed-up.

The modification is to modify tf.batch call as follows:

if enqueue_many:
    reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB))
    queue_batch = []
    for i in range(enqueue_many_size):
        _, serialized_example = reader.read(filename_queue)
        queue_batch.append(serialized_example)
    batch_serialized_example = tf.train.shuffle_batch(
        [queue_batch],
        batch_size=batch_size,
        num_threads=thread_number,
        capacity=capacity,
        min_after_dequeue=min_after_dequeue,
        enqueue_many=True)

For complete source, check here: https://github.com/yaroslavvb/stuff/blob/master/ericyue-slowreader/benchmark.py

It's hard to optimize it to go much faster since now most of the time is spent in queue operations. Looking at stripped down version which just adds integers to a queue, you also get similar speed, and looking at timeline, time is spent in dequeue ops.

enter image description here

Each dequeue op takes about 60 usec, but there's on average 5 runnning in parallel, so you get 12 usec per dequeue. So that means you'll get <200k examples per second in the best case.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...