MILLION DOLLAR SERVER: Deep Learning

I want build an new Deep Learning Machine to replace the existing old Intel Xeon node.

For Better Power Efficiency, I will build the new GPU system inside of CPU.
The old Intel Xeon just too much for a home lab hydro bill.

Here are the parts list and price I paid in CAD.

[table id=1 /]

The reason I pick Geforce GTX 750 Ti is the performance per watt and performance per dollar.

The GeForce 700 series are first generation Maxwell-based chip.
GM107-chips have 1305.6 Single Precision Gflops on paper.

I could go with GeForce 900 Series, but the price and power consume just not good enough for me.
Do your own math on https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units
Using [Processing Power]/[TDP]

Also few points:
I choice Standard not the OC [OverClock] version for stable reason.
I choice CentOS 7 for stable reason.

This is how I installed the CUDA 7.5 package.

uname -a
yum update
reboot
uname -a
uname -m && cat /etc/*release

yum install gcc
gcc --version

wget http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-repo-rhel7-7.5-18.x86_64.rpm

rpm -i cuda-repo-rhel7-7.5-18.x86_64.rpm 
yum clean all
yum update

yum install kernel-devel
rpm -i cuda-repo-rhel7-7.5-18.x86_64.rpm 

wget https://dl.fedoraproject.org/pub/epel/7/x86_64/d/dkms-2.2.0.3-30.git.7c3e7c5.el7.noarch.rpm
rpm -i dkms-2.2.0.3-30.git.7c3e7c5.el7.noarch.rpm 

yum install cuda

rpm -qa|grep nvidia|sort
rpm -qa|grep cuda|sort
cd /usr/local/

cd cuda
ls

cd /etc/ld.so.conf.d/
cat nvidia-lib64.conf 
cat /proc/driver/nvidia/version

cd nvidia/

vi .bash_profile

Then install the CUDA samples:


cuda-install-samples-7.5.sh /home/cuda/
cd /home/cuda/

cd NVIDIA_CUDA-7.5_Samples/

make

Check the device:

[root@itx release]# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 750 Ti"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2048 MBytes (2147287040 bytes)
  ( 5) Multiprocessors, (128) CUDA Cores/MP:     640 CUDA Cores
  GPU Max Clock rate:                            1084 MHz (1.08 GHz)
  Memory Clock rate:                             2700 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 750 Ti
Result = PASS


[root@itx ~]# nvidia-smi
Mon Apr 11 15:58:37 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.79     Driver Version: 352.79         |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  Off  | 0000:01:00.0     Off |                  N/A |
| 40%   23C    P0     1W /  38W |      7MiB /  2047MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

[root@itx release]# nvidia-smi -L
GPU 0: GeForce GTX 750 Ti (UUID: GPU-eac6aba8-b464-24e2-12fd-27e57bb6e42c)

Run CUDA nbody performance testing 4X times:


[root@itx release]# ./nbody -benchmark -numbodies=256000 -device=0

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "GeForce GTX 750 Ti
> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 18766.451 ms
= 34.922 billion interactions per second
= 698.438 single-precision GFLOP/s at 20 flops per interaction

gpuDeviceInit() CUDA Device [0]: "GeForce GTX 750 Ti
> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 18790.738 ms
= 34.877 billion interactions per second
= 697.535 single-precision GFLOP/s at 20 flops per interaction

> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "GeForce GTX 750 Ti
> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 18768.137 ms
= 34.919 billion interactions per second
= 698.375 single-precision GFLOP/s at 20 flops per interaction

> 1 Devices used for simulation
gpuDeviceInit() CUDA Device [0]: "GeForce GTX 750 Ti
> Compute 5.0 CUDA device: [GeForce GTX 750 Ti]
number of bodies = 256000
256000 bodies, total time for 10 iterations: 18795.025 ms
= 34.869 billion interactions per second
= 697.376 single-precision GFLOP/s at 20 flops per interaction

MILLION DOLLAR SERVER

Labels

Search This Blog

Sunday, April 21, 2024

Google Coral Edge TPU

IBM Artificial Intelligence Unit System-on-chip designed.

Tenstorrent - Grayskull™ AI/ML accelerators as PCIe cards

Monday, April 11, 2016

DIY Deep Learning Machine Nvidia Geforce GTX 750 Ti CUDA