Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
350 views
in Technique[技术] by (71.8m points)

windows 10 - Tensorflow crashes with CUBLAS_STATUS_ALLOC_FAILED

I'm running tensorflow-gpu on Windows 10 using a simple MINST neural network program. When it tries to run, it encounters a CUBLAS_STATUS_ALLOC_FAILED error. A google search doesn't turn up anything.

I c:f_jenkinshomeworkspace
elease-windevicegpuoswindowsensorflowcorecommon_runtimegpugpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.253
pciBusID 0000:0f:00.0
Total memory: 4.00GiB
Free memory: 3.31GiB
I c:f_jenkinshomeworkspace
elease-windevicegpuoswindowsensorflowcorecommon_runtimegpugpu_device.cc:906] DMA: 0
I c:f_jenkinshomeworkspace
elease-windevicegpuoswindowsensorflowcorecommon_runtimegpugpu_device.cc:916] 0:   Y
I c:f_jenkinshomeworkspace
elease-windevicegpuoswindowsensorflowcorecommon_runtimegpugpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:0f:00.0)
E c:f_jenkinshomeworkspace
elease-windevicegpuoswindowsensorflowstream_executorcudacuda_blas.cc:372] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
W c:f_jenkinshomeworkspace
elease-windevicegpuoswindowsensorflowstream_executorstream.cc:1390] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "C:UsersAnonymousAppDataLocalProgramsPythonPython35libsite-packagesensorflowpythonclientsession.py", line 1021, in _do_call
    return fn(*args)
  File "C:UsersAnonymousAppDataLocalProgramsPythonPython35libsite-packagesensorflowpythonclientsession.py", line 1003, in _run_fn
    status, run_metadata)
  File "C:UsersAnonymousAppDataLocalProgramsPythonPython35libcontextlib.py", line 66, in __exit__
    next(self.gen)
  File "C:UsersAnonymousAppDataLocalProgramsPythonPython35libsite-packagesensorflowpythonframeworkerrors_impl.py", line 469, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : a.shape=(100, 784), b.shape=(784, 256), m=100, n=256, k=784
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](_recv_Placeholder_0/_7, Variable/read)]]
         [[Node: Mean/_15 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_35_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

For TensorFlow 2.2 none of the other answers worked when the CUBLAS_STATUS_ALLOC_FAILED problem was encountered. Found a solution on https://www.tensorflow.org/guide/gpu:

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

I ran this code before any further calculations are made and found that the same code that produced CUBLAS error before now worked in same session. The sample code above is a specific example that sets the memory growth across a number of physical GPUs but it also solves the memory expansion problem.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...