Search This Blog

Monday, April 11, 2016

DIY Deep Learning Machine Nvidia Geforce GTX 750 Ti CUDA

I want build an new Deep Learning Machine to replace the existing old Intel Xeon node.

For Better Power Efficiency, I will build the new GPU system inside of CPU.
The old Intel Xeon just too much for a home lab hydro bill.

Here are the parts list and price I paid in CAD.

[table id=1 /]

The reason I pick Geforce GTX 750 Ti is the performance per watt and performance per dollar.

The GeForce 700 series are first generation Maxwell-based chip.
GM107-chips have 1305.6 Single Precision Gflops on paper.

I could go with GeForce 900 Series, but the price and power consume just not good enough for me.
Do your own math on https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units
Using [Processing Power]/[TDP]

Also few points:
I choice Standard not the OC [OverClock] version for stable reason.
I choice CentOS 7 for stable reason.

This is how I installed the CUDA 7.5 package.

uname -a
yum update
reboot
uname -a
uname -m && cat /etc/*release

yum install gcc
gcc --version

wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-7.5-18.x86_64.rpm

rpm -i cuda-repo-rhel7-7.5-18.x86_64.rpm
yum clean all
yum update

yum install kernel-devel
rpm -i cuda-repo-rhel7-7.5-18.x86_64.rpm

wget https://dl.fedoraproject.org/pub/epel/7/x86_64/d/dkms-2.2.0.3-30.git.7c3e7c5.el7.noarch.rpm
rpm -i dkms-2.2.0.3-30.git.7c3e7c5.el7.noarch.rpm

yum install cuda

rpm -qa|grep nvidia|sort
rpm -qa|grep cuda|sort
cd /usr/local/

cd cuda
ls

cd /etc/ld.so.conf.d/
cat nvidia-lib64.conf
cat /proc/driver/nvidia/version

cd nvidia/

vi .bash_profile


Then install the CUDA samples:

cuda-install-samples-7.5.sh /home/cuda/
cd /home/cuda/

cd NVIDIA_CUDA-7.5_Samples/

make


Check the device:
[root@itx release]# ./deviceQuery
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 750 Ti"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1084 MHz (1.08 GHz)
Memory Clock rate: 2700 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 750 Ti
Result = PASS



[root@itx ~]# nvidia-smi
Mon Apr 11 15:58:37 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.79 Driver Version: 352.79 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti Off | 0000:01:00.0 Off | N/A |
| 40% 23C P0 1W / 38W | 7MiB / 2047MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

[root@itx release]# nvidia-smi -L
GPU 0: GeForce GTX 750 Ti (UUID: GPU-eac6aba8-b464-24e2-12fd-27e57bb6e42c)


Run CUDA nbody performance testing 4X times:


[root@itx release]# ./nbody -benchmark -numbodies=256000 -device=0

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "GeForce GTX 750 Ti
> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 18766.451 ms
= 34.922 billion interactions per second
= 698.438 single-precision GFLOP/s at 20 flops per interaction

gpuDeviceInit() CUDA Device [0]: "GeForce GTX 750 Ti
> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 18790.738 ms
= 34.877 billion interactions per second
= 697.535 single-precision GFLOP/s at 20 flops per interaction

> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "GeForce GTX 750 Ti
> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 18768.137 ms
= 34.919 billion interactions per second
= 698.375 single-precision GFLOP/s at 20 flops per interaction

> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "GeForce GTX 750 Ti
> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 18795.025 ms
= 34.869 billion interactions per second
= 697.376 single-precision GFLOP/s at 20 flops per interaction

No comments:

Post a Comment