윈도우에 tensorflow gpu를 돌리기

12월 29, 2019 0 Comments

안녕하십니까. 이번 포스팅에서는 윈도우상에서 텐서플로우를 GPU를 통하여 돌리는 것을 포스팅 하려고 합니다.

목적

window 컴퓨터 위에서 딥러닝 프레임워크인 tensorflow를 그래픽 카드를 사용하여 돌릴 수 있도록 환경을 구축 한다.

필요 요건

그래픽 카드를 구매하시고 장착을 잘 하신다음 아래와 같이 장치 관리자의 디스플레이 어뎁터에서 그래픽 카드가 인식 되도록 합니다.

아래에서 UHD 630은 내장 그래픽 카드이고 GT 730은 외장 그래픽 카드 입니다.

그래픽 카드 장착 완료

cuda 설치

cuda는 GPU를 사용하는데에 도움을 주는 라이브러리 입니다.

cuda를 설치하지 않고 tensorflow를 실행시키려고 할시에는 아래와 같은 에러를 리턴 합니다.

(tensorflow_gpu) C:\Users\mgim>python

Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 14:00:49) [MSC v.1915 64 bit (AMD64)] on win32

Type "help", "copyright", "credits" or "license" for more information.

>>> import tensorflow

2019-12-23 14:55:23.325444: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found

그렇기에 cuda를 설치 합니다.

https://developer.nvidia.com/cuda-downloads

위의 링크에서 cuda를 다운 받습니다.

최신 버전을 다운받으실시에는 tensorflow에서 support가 안될수 있으니 이전 버전을 다운받는 Legacy Releases를 통해서 다운받아야 합니다.

현재 기준 10.2 버전이 최신 이지만 저는 10.0 버전을 다운받았습니다.

Why Cuda 10.0

10.2 버전은 현재 cudnn에서 support를 안해주고 cuda 10.1을 사용하시면 아래와 같은 에러를 보게 됩니다. 그렇기 때문에 10.0을 선택하셔야 번거롭지 않습니다.

(tensorflow_gpu2) C:\repo\GPUBenchmark>python simple.py

2019-12-26 11:10:19.193712: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found

WARNING:tensorflow:From C:\Users\mgim\AppData\Local\Continuum\miniconda3\envs\tensorflow_gpu2\lib\site-packages\tensorflow_core\python\compat\v2_compat.py:65: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.

Instructions for updating:

non-resource variables are not supported in the long term

2019-12-26 11:10:25.977147: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2

2019-12-26 11:10:25.985561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll

2019-12-26 11:10:26.012401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:

name: GeForce GT 730 major: 3 minor: 5 memoryClockRate(GHz): 0.9015

pciBusID: 0000:01:00.0

2019-12-26 11:10:26.017182: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.

2019-12-26 11:10:26.021950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0

Traceback (most recent call last):

File "simple.py", line 58, in <module>

with tf.Session(config=tf.ConfigProto(log_device_placement=log_device_placement)) as sess:

File "C:\Users\mgim\AppData\Local\Continuum\miniconda3\envs\tensorflow_gpu2\lib\site-packages\tensorflow_core\python\client\session.py", line 1585, in __init__

super(Session, self).__init__(target, graph, config=config)

File "C:\Users\mgim\AppData\Local\Continuum\miniconda3\envs\tensorflow_gpu2\lib\site-packages\tensorflow_core\python\client\session.py", line 699, in __init__

self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)

tensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: cudaGetErrorString symbol not found.

환경 변수 추가

위에서 cuda를 설치하고 난 후에 환경변수를 추가해 줍니다.

윈도우는 보통 자동으로 되던것 같은데 cuda는 직접 설정해야하는듯 합니다.

저의 경우는 추가해줘야 하는 path는

C:\Program Files\NVIDIA Corporation\NvStreamSrv

입니다.

이는 cudart64_100.dll 가 있는 경로 입니다.

그래픽 드라이버 설치

여기까지 하고 다시 실행시키면 아래와 같이 다시 에러가 발생 합니다. CUDA_ERROR_UNKNOWN이라고 되어 있는데 그래픽 드라이버를 설치하시면 됩니다.

2019-12-24 15:18:55.320413: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll

2019-12-24 15:18:55.369889: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

2019-12-24 15:18:55.382782: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: D-SEL-16506841

2019-12-24 15:18:55.387027: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: D-SEL-16506841

https://www.nvidia.co.kr/Download/index.aspx?lang=kr

위의 공식 홈페이지에서 자신에게 맞는 환경을 선택한 다음 드라이버를 설치 합니다.

그래픽 드라이버가 없을시에 nvidia-smi 명령어를 실행하면 다음과 같이 리턴 됩니다.

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. This can also be happening if non-NVIDIA GPU is running as primary display, and NVIDIA GPU is in WDDM mode.

cudnn 설치

tensorflow 코드를 돌릴시에 잘 돌아가는듯 하나 아래와 같이 중간에 에러가 발생 합니다.

2019-12-26 14:24:10.581710: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure

2019-12-26 14:24:10.588119: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

https://developer.nvidia.com/rdp/cudnn-archive

cuDNN

위 사이트에서 cuda 10.0에 맞는 cudnn을 설치 한 후에 해당 다운로드 파일을 cuda path에 알맞게 넣어 줍니다.

메모리 이슈

여기까지 하시면 보통은 잘 돌아게 됩니다.

경우에 따라 다음과 같이 메모리 이슈가 발생할 수 있습니다.

2019-12-26 15:00:26.772893: I tensorflow/core/common_runtime/placer.cc:54] Placeholder_1: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0

2019-12-26 15:00:26.800123: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll

2019-12-26 15:00:30.065069: E tensorflow/stream_executor/stream.cc:332] Error recording event in stream: error recording CUDA event on stream 0x1615f1209d0: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.

2019-12-26 15:00:30.065068: E tensorflow/stream_executor/cuda/cuda_driver.cc:892] failed to alloc 1073741824 bytes on host: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated

2019-12-26 15:00:30.076501: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated

2019-12-26 15:00:30.083017: W .\tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 1073741824

2019-12-26 15:00:30.089549: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1

내용을 보면 CUDA_ERROR_LAUNCH_TIMEOUT이라고 작성 되어 있습니다.

cuda가 실행 되다가 멈춘것 같은 느낌이 드는데요.

작업 관리자를 실행시켜서 GPU 현황을 보았습니다.

아래와 같이 전용 GPU 메모리 사용량이 올라가다가 내려간 현상을 볼 수 있었는데요.

해당 GPU 연산이 메모리를 사용하다가 더이상 받을 수 없어서 작업을 중단 하였다고 추측 할 수 있습니다.

이런 경우에는 더 많은 메모리를 가진 GPU를 사용 하거나 연산을 좀더 최적화 시켜봐야 할듯 합니다.

메모리가 올라가다가 중단 되었다.

이 블로그 검색

김띵준의 Programming Story