Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
339 views
in Technique[技术] by (71.8m points)

python - Tensorflow Deep MNIST: Resource exhausted: OOM when allocating tensor with shape[10000,32,28,28]

This is the sample MNIST code I am running:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

import tensorflow as tf
sess = tf.InteractiveSession()

x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])

W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))

y = tf.nn.softmax(tf.matmul(x,W) + b)

def weight_variable(shape):
  initial = tf.truncated_normal(shape, stddev=0.1)
  return tf.Variable(initial)

def bias_variable(shape):
  initial = tf.constant(0.1, shape=shape)
  return tf.Variable(initial)


def conv2d(x, W):
  return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
  return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1], padding='SAME')


W_conv1 = weight_variable([5, 5, 1, 32])
b_conv1 = bias_variable([32])


x_image = tf.reshape(x, [-1,28,28,1])

h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 32, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
h_pool2 = max_pool_2x2(h_conv2)

W_fc1 = weight_variable([7 * 7 * 64, 1024])
b_fc1 = bias_variable([1024])

h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64])
h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)

W_fc2 = weight_variable([1024, 10])
b_fc2 = bias_variable([10])

y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y_conv), reduction_indices=[1]))
train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

init = tf.initialize_all_variables()
config = tf.ConfigProto()
config.gpu_options.allocator_type = 'BFC'
with tf.Session(config = config) as s:
  sess.run(init)

for i in range(20000):
  batch = mnist.train.next_batch(50)
  if i%100 == 0:
    train_accuracy = accuracy.eval(feed_dict={
        x:batch[0], y_: batch[1], keep_prob: 1.0})
    print("step %d, training accuracy %g"%(i, train_accuracy))
  train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})

print("test accuracy %g"%accuracy.eval(feed_dict={
    x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

The GPU I am using is: GeForce GTX 750 Ti

Error:

...
...
...
step 19900, training accuracy 1
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (256):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (512):   Total Chunks: 1, Chunks in use: 0 768B allocated for chunks. 1.20MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1024):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2048):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4096):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8192):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16384):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (32768):     Total Chunks: 1, Chunks in use: 0 36.8KiB allocated for chunks. 4.79MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (65536):     Total Chunks: 1, Chunks in use: 0 78.5KiB allocated for chunks. 4.79MiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (131072):    Total Chunks: 1, Chunks in use: 0 200.0KiB allocated for chunks. 153.1KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (262144):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (524288):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (1048576):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (2097152):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (4194304):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (8388608):   Total Chunks: 1, Chunks in use: 0 11.86MiB allocated for chunks. 390.6KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (16777216):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (33554432):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (67108864):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (134217728):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:639] Bin (268435456):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
I tensorflow/core/common_runtime/bfc_allocator.cc:656] Bin for 957.03MiB was 256.00MiB, Chunk State: 
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a40000 of size 1280
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a40500 of size 1280
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a40a00 of size 31488
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a48500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a48600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a48700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a48800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a48900 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a48a00 of size 4096
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a49a00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a49b00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a49c00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a49d00 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a4aa00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a4ab00 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a7cb00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x601a7cc00 of size 12845056
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026bcc00 of size 4096
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026bdc00 of size 40960
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026c7c00 of size 31488
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026cf700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026cf800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026cf900 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026cfa00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026cfb00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026cfc00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026cfd00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026cfe00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026cff00 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026d0000 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026d0100 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026d0500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026d0600 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026d1300 of size 40960
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6026db300 of size 80128
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x602702600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x602734700 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x603342700 of size 4096
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x603343700 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x60334d700 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x60334d800 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x60334d900 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x60334da00 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x60334e700 of size 3328
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x60334f400 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x60334f500 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x60334f600 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x603381600 of size 204800
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6033b3600 of size 256
I tensorflow/core/common_runtime/bfc_allocator.cc:674] Chunk at 0x6033b3700 of size 256
I tensorflow/core/common_runtime/bfc_

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Here is how I solved this problem: the error means that the GPU runs out of memory during accuracy evaluation. Hence it needs a smaller sized dataset, which can be achieved by using data in batches. So, instead of running the code on the whole test dataset it needs to be run in batches as mentioned in this post: How to read data in batches when using TensorFlow

Hence, for accuracy evaluation on test dataset, instead of this loc :

print("test accuracy %g"%accuracy.eval(feed_dict={ x: mnist.test.images, y_: mnist.test.labels, keep_prob: 1.0}))

this can be used :

for i in xrange(10):
    testSet = mnist.test.next_batch(50)
    print("test accuracy %g"%accuracy.eval(feed_dict={ x: testSet[0], y_: testSet[1], keep_prob: 1.0}))

When i ran 1000 epochs for training and used 10 batches of batch_size = 50 for accuracy evaluation, I got the following results:

step 0, training accuracy 0.04
step 100, training accuracy 0.88
step 200, training accuracy 0.9
step 300, training accuracy 0.88
step 400, training accuracy 0.94
step 500, training accuracy 0.96
step 600, training accuracy 0.94
step 700, training accuracy 0.96
step 800, training accuracy 0.9
step 900, training accuracy 1
test accuracy 1
test accuracy 0.92
test accuracy 1
test accuracy 1
test accuracy 0.94
test accuracy 0.96
test accuracy 0.92
test accuracy 0.96
test accuracy 0.92
test accuracy 0.94

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...