CUDA Tutorial

Ⅰ. Introdcution of the device available at the Colorado School of Mines (CSM)

The Colorado School of Mines (CSM) invites CSM students to apply for an account on the Mines Nvidia Tesla S1070 system

Request an account on The CSM tesla node

Special purpose computing accelerators, such as graphic processing units, (GPU) have become an important platform for parallel high performance scientific computing. CSM recently took delivery of and installed a Nvidia Tesla S1070 system. This 1U rack mountable system contains 4 of the Nvidia Quadro FX 5600 GPU cards and has a peak computing performance of 4 trillion floating point operations per second, or 4 Teraflops. This new Tesla system is front-ended by one of the Mio nodes. Each of the 4 graphics processing units (GPU) on the Tesla has 240 processing cores and 4 Gbytes of memory for a total of 960 cores and 16 Gbytes. The individual GPUs are connected to the front-end node via a PCI connector.

Ⅱ. Basic concepts of NVIDIA GPU and CUDA programming

CUDA is a high level language. NVIDIA is committed to supporting CUDA as hardware changes. Hardware is projected to change radically in the future. Primarily, the processor count may go from hundreds to tens of thousands. Program algorithm, architecture and source code can remain largely unchanged. Increase problem size to use more processors. Increase a 3D grid by a factor of 5 to go from hundreds to tens of thousands of processors. CUDA is C with a few straight forward extensions. The extensions to the C programming language are four-fold: Function type qualifiers to specify whether a function executes on the host or on the device and whether it is callable from the host or from the device __global__, __device__, __host__. Variable type qualifiers to specify the memory location of a variable __device__, __shared__. A new directive to specify how a kernel is executed on the device from the host. Four built-in variables that specify the grid and block dimensions and the block and thread indices - gridDim, blockIdx, blockDim, threadIdx. NVIDIA promises to support CUDA for the foreseeable future. CUDA encapsulates hardware model, so you don't have to worry about hardware model changes, all the conveniences of C vs assembly. Learning the hardware and developing parallel algorithms is still difficult. But the infrastructure for writing, developing,debugging and maintaining source code is straight forward and similar to conventional serial programming.

Steps in a CUDA code

Host Code (xx.cpp and

  1. Initialize/acquire device (GPU)
  2. Allocate memory on GPU
  3. Copy data from host to GPU
  4. Execute kernel on GPU
  5. Copy data from GPU to host
  6. Deallocate memory on GPU
  7. Release device
  8. Run "Gold" version on host
  9. Compare results

Kernel Code (

A kernel is a function callable from the host and executed on the CUDA device -- simultaneously by many threads in parallel. How to call a kernel involves specifying the name of the kernel plus an execution configuration. An execution configuration just means defining the number of parallel threads in a group and the number of groups to use when running the kernel for the CUDA device.

Nvidia CUDA Programming Basics:

CUDA Programming Model:

Such a function can be compiled to run on the device. The resulting program is called a Kernel. The batch of threads that executes a kernel is organized as a grid of thread blocks.

The Memory model:

Each CUDA device has several memories that can be used by programmers to achieve high Computation to Global Memory Access (CGMA) ratio and thus high execution speed in their kernels. Variables that reside in registers and shared memories can be accessed at very high speed in a highly parallel manner. Registers are allocated to individual threads; each thread can only access its own registers. A kernel function typically uses registers to hold frequently accessed variables that are private to each thread. Shared memories are allocated to thread blocks; all threads in a block can access variables in the shared memory locations allocated to the block. Shared memories are efficient means for threads to cooperate by sharing the results of their work. At the middle of the table, we see global memory constant memory. These are the memories that the host code can write (W) and read (R) by calling API functions. The global memory can be accessed by all the threads at anytime of program executtion. The constant memory allows read-only access by the device and provides faster and more parallel data access paths for CUDA kernel execution than the global memory.

Below is the table of types of CUDA memory:

Different Types of CUDA Memory

Memory Location Cached Access Who
Local Off-chip No Read/Write One Tthread
Shared On-chip N/A Read/write All threads in a block
Global Off-chip No Read/write All threads + CPU
Constant Off-chip Yes Read All threads + CPU
Texture Off-chip Yes Read All threads + CPU

CUDA defines registers, shared memory, and constant memory that can be accessed at higher speed and in a more parallel manner than the global memory. Using these memories effectively will likely require re-design of the algorithm. It is important for CUDA programmers to be aware of the limited sizes of these special memories. Their capacities are implementation dependent. Once their capacities are exceeded, they become limiting factors for the number of threads that can be assigned to each SM.

Rules of Hardware resource:

CUDA API Basics:

CUDA is an Extension to the C Programming Language. It adds function type qualifiers to specify execution on host or device and variable type qualifiers to specify the memory location on the device. There are four built-in variables that specify the grid and block dimensions and the block and thread indices.

Function type qualifiers:


Executed on the device Callable from the device only.


Executed on the device, Callable from the host only.


Executed on the host, Callable from the host only.

Variable Type Qualifiers :


It resides in global memory space, and has the lifetime of an application. It's accessible from all the threads within the grid and from the host through the runtime library.


(optionally used together with __device__)

It resides in constant memory space, and has the lifetime of an application. It's accessible from all the threads within the grid and from the host through the runtime library.


(optionally used together with __device__)

It resides in the shared memory space of a thread block, and has the lifetime of the block. It's only accessible from all the threads within the block.

Execution Configuration (EC):

EC must be specified for any call to a


function. Defines the dimension of the grid and blocks specified by inserting an expression between function name and argument list:function:

__global__ void Func(float* parameter)

must be called like this:

Func<<< Dg, Db, Ns >>>(parameter)

Where Dg, Db, Ns are :

Dg is of type dim3, dimension and size of the grid

Dg.x * Dg.y = number of blocks being launched;

Db is of type dim3, dimension and size of each block

Db.x * Db.y * Db.z = number of threads per block;

Ns is of type size_t, number of bytes in shared memory that is dynamically allocated in addition to the statically allocated memory

Ns is an optional argument which defaults to 0.

Built-in Variables

gridDim is of type dim3, dimensions of the grid.

blockIdx is of type uint3, block index within the grid.

blockDim is of type dim3, dimensions of the block.

threadIdx is of type uint3, thread index within the block.

Ⅲ. Basic Usage Instructions (enviroment setup)

The CSM Tesla node is connected to a font end node on Mio, cuda1. To access the Tesla you must first ssh to For example:

[jouser@ra5 ~]$ ssh mio
jouser@mio's password:
Last login: Mon Mar 28 14:07:32 2011 from
[jouser@mio ~]$

You will want to add the following lines to you .bash_profile file on cuda1 to enable you to access the cuda compilers.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/development/gpu/3.2/cuda/lib64
export PATH=$PATH:/opt/development/gpu/3.2/cuda/bin

Note: You may need to reload your shell before you can use these variables.

source .bash_profile

To create the standard CUDA examples run the following command from your home directory hitting return to provide the default answers to the questions.

First go to you home directory.

[jouser@mio ~]$ cd

Then copy GPU SDK installation file to your home direcory.

[jouser@mio ~]$ cp /opt/development/gpu/3.2/ .


[jouser@mio ~]$ wget
[jouser@mio ~]$ chmod +x ./

Run the file

[jouser@mio ~]$ ./

This will create the directory, NVIDIA_GPU_Computing_SDK. Enter this directory:

[jouser@mio ~]$ cd NVIDIA_GPU_Computing_SDK

Before compiling the SDK examples, go to the directory below to run the makefile to build some libraries:

[jouser@mio ~]$ cd NVIDIA_GPU_Computing_SDK/C/common/
[jouser@mio common]$ make	
[jouser@mio common]$ cd ~/NVIDIA_GPU_Computing_SDK/C/shared/
[jouser@mio shared]$ make	

Go to the "lib" directory to build a symbolic link to "cutil" library

[jouser@mio shared]$ cd ~/NVIDIA_GPU_Computing_SDK/C/lib/
[jouser@mio lib]$ ln -s libcutil_x86_64.a libcutil.a	

Got to ~/NVIDIA_GPU_Computing_SDK/C/common and edit the file Modify line 278 adding the text "-L/opt/development/gpu/3.2/libcuda" before -lcuda

Line 278 changes from:



     LIB += -L/opt/development/gpu/3.2/libcuda  -lcuda   ${OPENGLLIB} $(PARAMGLLIB) $(RENDERCHECKGLLIB) $(CUDPPLIB) ${LIB} 

Then go up to the "C" directory

[jouser@mio lib]$ cd ~/NVIDIA_GPU_Computing_SDK/C/

Then build the examples using make

[jouser@mio C]$ make

You will see a number of warnings. Then go to the directory containing the built examples $HOME/NVIDIA_GPU_Computing_SDK/C/bin/linux/release :

[jouser@mio C]$ cd $HOME/NVIDIA_GPU_Computing_SDK/C/bin/linux/release

This directory should contain a number of example programs. The Tesla board is designed primarily for computation and does not support video out. It you try to run any of the examples that produce video out you will get an error.

NVIDIA SDK comes with many sample projects. After typing the command "make" under the directory 'C' to build examples in NVIDIA SDK, you can find all the executable files in the directory $HOME/NVIDIA_GPU_Computing_SDK/C/bin/linux/release. Try to play with these examples to help understand CUDA more. Actually, we can use the structure, makefile, macros, etc of the sample projects when writing my own code. It's fairly simple and straight forward and the easiest way to get your first code running.

How to run the programs on the GPU enabled node on Mio?

First, go to your home directory, then creat a new empty directory run_device.

[jouser@mio ~]$ cd
[jouser@mio ~]$ mkdir run_device
[jouser@mio ~]$ cd run_device
[jouser@mio run_device]$

Second, creat a file named run_cuda with the following content or download it form here.

##PBS -l nodes=1:ppn=4
#PBS -l nodes=1:cuda
#PBS -l walltime=16:00:00
#PBS -N testIO
#PBS -o cuda_out.$PBS_JOBID
#PBS -e cuda_err.$PBS_JOBID
#PBS -r n
##PBS -m abe
##PBS -M

cp ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/deviceQuery .

echo running on: `cat $PBS_NODEFILE`

echo " gives a report of the devices that were found"
./deviceQuery $lt; /dev/null

Third, use the command qsub to submmit the job and use qstat -u username to watch your job in the Queue.

[jouser@mio run_device]$ qsub run_cuda
[jouser@mio run_device]$ qstat -u jouser
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
9159.mio.mines.e     jouser    batch    testIO              --      1 1:c    --  16:00 Q   --

9159 is job ID.

When the status of job is "C", you could find the result or error information in or

Here is another example to make and then run the programs on the GPU enabled node on Mio:

We have put together simple procedure to download, build, and run a number GPU examples. What you need to do is download the files, run make and then run the programs on the GPU enabled node on Mio.

You will download a tar file that contains source, a makefile, an environmental settings file (pgi) and a batch script (cudarun).

You can use the files "runfile" and "makefile" as templates for your programs.

(1) On "", create a new empty directory.

mkdir guide

(2) Go to the directory.

cd guide

(3) Download the files

wget -o wget.out

(3b optional) You might want to do an "ls" to make sure the file "quick.tar" actually downloaded.

(4) Untar the files

tar -xzf cuda.tgz

(4b optional) You might want to do an "ls" to make sure you have the files cudarun, makefile, more.tgz, NVIDIA.tgz, pgi, and rcs09.tgz

(5) Setting up your enviornment

The file pgi contains the $PATH and $MANPATH setting to reference the latest version of the Portland Group compilers. These are required to build one of the examples.

To enable you to build the examples you need to source this file pgi.

source pgi

(5b optional) Setting up your enviornment permanently

If you plan on using this new version of the Portland Group compilers in the future you should add this file to your .bashrc file. One way to do this is the following:

cp ~/.bashrc ~/.bashrc-old-pg
cat pgi >> ~/.bashrc

(6) Make the executables


Submitting job the to the GPU node

(7) Submit your job using qsub.

qsub cudarun -l nodes=cuda1

Note the parameter -l nodes=cuda1 is not required.

After a few moments your job will run and put the output into the file "cuda_out.*" You will have something like the following:

[tkaiser@mio ~/guide]$ls -lt
total 732
-rw------- 1 tkaiser tkaiser   14620 Mar 18 15:47
-rw------- 1 tkaiser tkaiser      37 Mar 18 15:47
-rw-rw-r-- 1 tkaiser tkaiser   11880 Mar 18 15:47 out.dat
-rw-rw-r-- 1 tkaiser tkaiser    6085 Mar 18 15:47 deviceQuery.txt
-rw-rw-r-- 1 tkaiser tkaiser     164 Mar 18 15:47 SdkMasterLog.csv
[tkaiser@mio ~/guide]$ 
with the file cuda_out.* containing something like:
[tkaiser@ra ~/guide]$cat cuda_out.*
running on: cuda1

 gives a report of the devices that were found
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

There are 4 devices supporting CUDA

Device 0: "Tesla T10 Processor"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.20
  CUDA Capability Major/Minor version number:    1.3
  Total amount of global memory:                 4294770688 bytes
  Multiprocessors x Cores/MP = Cores:            30 (MP) x 8 (Cores/MP) = 240 (Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes

Press Enter to Quit...

 hello world in cuda.  prints a list of threads
     0        0   0        0   0   0
     1        0   0        1   0   0
     2        0   0        0   1   0
 tests to see if we can run mpi
processor 0  sent 5678
processor 1  got 5678

 test to see if a cula linear algebra routine can be called
  matrix generated   0.1620000004768372  of size  799
 beam=  94.2636414  94.7636414  95.2636337
 799  did solve   1.3629999998956919     using :cgesv     
 matrix generation time   0.1620000004768372
 matrix solve time    1.3629999998956919
 total computaion time    1.6750000007450581

 test to see if Portland Group compiler works
                     name=Tesla T10 Processor
           totalGlobalMem=               4294770688
        sharedMemPerBlock=                    16384
             regsPerBlock=        16384
                 warpSize=           32
                 memPitch=               2147483647
[tkaiser@mio cuda]$ 
[tkaiser@mio ~/guide]$ 

Start to write Your Own Parallel Code:

Ⅳ. Makefile example for CUDA program

Downlaod example Makefile here.

# CUDA Sample Makefile                                       #
#                                                            #
# Author      : Chong Ding:           #
# Version     : 1.0                                          #
# Date        : 09/15/2010                                   #
# Discription : generic Makefile for making CUDA programs    #

BIN               := mv 
CXXFLAGS          := -O3 -g

CUDA_INSTALL_PATH ?= /opt/development/gpu/3.2/cuda

INCD = -I"$(CUDA_SDK_PATH)/C/common/inc" -I"$(CUDA_INSTALL_PATH)/include" -I"./"
LIBS = -L"/opt/development/gpu/3.2/libcuda" -lcuda -L"$(CUDA_INSTALL_PATH)/lib64" -lcudart -lcublas -lcufft -L"$(CUDA_SDK_PATH)/C/common/lib" $(CUDA_SDK_PATH)/C/lib/libcutil$(LIBSUFFIX).a -lstdc++ -lpthread
NVCCFLAGS := --ptxas-options=-v -O3 -G -g 

# files
CPP_SOURCES       := mv_gold.cpp 
CU_SOURCES        :=
HEADERS           := $(wildcard *.h)
CPP_OBJS          := $(patsubst %.cpp, %.o, $(CPP_SOURCES))
CU_OBJS           := $(patsubst, %.cu_o, $(CU_SOURCES))

%.cu_o :
	$(NVCC) $(NVCCFLAGS) -c $(INCD) -o $@ $<

%.o: %.cpp
	$(CXX) -c $(CXXFLAGS) $(INCD) -o $@ $<


	rm -f $(BIN) *.o *.cu_o

There four modes can be compiled into by NVCC: 1.release, 2.debug, 3.emurelease, 4.emudebug. 3 and 4 are both Emulation mode that can be used to debug your CUDA program on the main CPU insteading of running program program on GPU.

But, Emulation mode is only meant for debugging. It's very slow and spawns thousands of threads on the CPU.


- release configuration by typing "make".

- debug configuration by typing "make dbg=1".

- emurelease configuration by typing "make emu=1".

- emudebug configuration by typing "make emu=1 dbg=1".

Example (Matrix-Vector Multiplication)

Matrix-Vector Multiplication

Source Code of Matrix-Vector Multiplication. Type command unzip -d ./mv to extract.

Y=A*X, where A is a M-by-N matrix, X is a N-element vector (or N-by-1 matrix), the result Y should be a M-element vector (or M-by-1 matrix). M and N can be any integers and are only limited by the global memory size of the GPU.

a. Global Memory:

First, we can implenment the Matrix-Vector Multiplication in Global Memory version.

Kernel Code:
__global__ void MVKernel_gm(Matrix A, Matrix X, Matrix Y)
	  int bx = blockIdx.x; 
          //int by = blockIdx.y;
	  int tx = threadIdx.x; 
          //int ty = threadIdx.y;
  // Calculate the row index of the Pd element and M
  int Row = bx * BLOCK_SIZE + tx;
  // Calculate the column idenx of Pd and N
  //int Col = bx * BLOCK_SIZE + tx;
  float Pvalue = 0;

  for (unsigned int k = 0; k < A.width; k++) 
      if(Row < A.height)         
      Pvalue += A.elements[Row*A.width+k] * X.elements[k];
      //Pvalue += 0;


  if(Row < A.height)  		
    Y.elements[Row] = Pvalue;

b. Shared Memory

Kernel Code:
__global__ void MVKernel_shm(Matrix A, Matrix X, Matrix Y)
  __shared__ float Xds[BLOCK_SIZE];
	int bx = blockIdx.x; 
  //int by = blockIdx.y;
	int tx = threadIdx.x; 
  //int ty = threadIdx.y;
  // Calculate the row index
  //int Row = by * BLOCK_SIZE + ty;
  // Calculate the column index
  int Row = bx * BLOCK_SIZE + tx;
  float Pvalue = 0;

  for (unsigned int m = 0; m < (A.width-1)/BLOCK_SIZE+1; ++m)
      if(m*BLOCK_SIZE + tx < A.width)
      	Xds[tx] = X.elements[m*BLOCK_SIZE + tx]; 
      	Xds[tx] = 0;   
      for (unsigned int k = 0; k < BLOCK_SIZE; k++) 
             if(Row<A.height && m*BLOCK_SIZE +k<A.width)
    		Pvalue += A.elements[m*BLOCK_SIZE+Row*A.width+k] * Xds[k];    		
  if(Row < A.height)  
    Y.elements[Row] = Pvalue; 

Compiling and running with code walk-through:

This sample code implements three different optimizations. The first one is to use global memory. The second one is to use shared memory to split X into multiple chunks. The third one is to use CUBLAS library function cublasSgemv.

Arguments Explaination of the example code:

No arguments: The application will create a randomized matrix A and a vector X. A CPU implementation of the matrix-vector multiplication will be used to generate a correct solution which will be compared with your programs output. If it matches (within a certain tolerance), if will print out "Test PASSED" to the screen before exiting.

One argument: The application will use the random initialization to create the input matrices, and write the device-computed output to the file specified by the argument.

Three arguments: The application will read input matrices from provided files. The first argument should be a file containing two integers. The first and second integers will be used as A.height and A.width, respectively. The second and third function arguments will be expected to be files which have exactly enough entries to fill matrices A and X respectively. No output is written to file.

Four arguments: The application will read its inputs using the files provided by the first three arguments, and write its output to the file provided in the fourth. Note that if you wish to use the output of one run of the application as an input, you must delete the first line in the output file, which displays the accuracy of the values within the file. The value is not relevant for this application.

Compliling the code:

[jouser@mio mv]$ ls -l
total 9140
-rw-rw-r-- 1 jouser jouser    9072 Sep 21 20:22 dataA1.txt
-rw-rw-r-- 1 jouser jouser 9283560 Sep 21 20:22 dataA2.txt
-rw-rw-r-- 1 jouser jouser     285 Sep 21 20:22 dataX1.txt
-rw-rw-r-- 1 jouser jouser    8659 Sep 21 20:22 dataX2.txt
-rw-rw-r-- 1 jouser jouser    1498 Sep 21 21:30 Makefile
-rw-rw-r-- 1 jouser jouser    8869 Sep 21 21:25
-rw-rw-r-- 1 jouser jouser     985 Sep 21 20:22 mv_gold.cpp
-rw-rw-r-- 1 jouser jouser     334 Sep 21 20:47 mv.h
-rw-rw-r-- 1 jouser jouser    1810 Sep 21 20:22
-rw-rw-r-- 1 jouser jouser    1484 Sep 21 20:22 README.txt
-rw-rw-r-- 1 jouser jouser     283 Mar 28 15:49 run_cuda

.cpp - host c routine

.cu - host/gpu cuda routine

[jouser@mio mv]$ make
g++ -c -O3 -g -I"/home/jouser/NVIDIA_GPU_Computing_SDK/C/common/inc" -I"/usr/local/cuda/include" -I"./" -o mv_gold.o mv_gold.cpp
/usr/local/cuda/bin/nvcc --ptxas-options=-v -O3 -G -g -c -I"/home/jouser/NVIDIA_GPU_Computing_SDK/C/common/inc" -I"/usr/local/cuda/include" -I"./" -o mv.cu_o warning: missing return statement at end of non-void function "MVOnDevice"

ptxas info    : Compiling entry function '_Z12MVKernel_shm6MatrixS_S_' for 'sm_10'
ptxas info    : Used 13 registers, 1096+16 bytes smem, 8 bytes cmem[1]
ptxas info    : Compiling entry function '_Z11MVKernel_gm6MatrixS_S_' for 'sm_10'
ptxas info    : Used 11 registers, 72+16 bytes smem, 8 bytes cmem[1] warning: missing return statement at end of non-void function "MVOnDevice"

g++ -o mv   mv.cu_o  mv_gold.o  -I"/home/jouser/NVIDIA_GPU_Computing_SDK/C/common/inc" -I"/usr/local/cuda/include" -I"./" -L"/usr/local/cuda/lib64" -lcuda -lcudart -lcublas -L"/home/jouser/NVIDIA_GPU_Computing_SDK/C/common/lib" /home/jouser/NVIDIA_GPU_Computing_SDK/C/lib/libcutil.a -lstdc++ -lpthread
[jouser@mio mv]$ ./mv
----Please choose one method to run the matrix-vector multiplication
----Enter 1 or 2 or 3, then press ENTER button.
----(1). Global memory
----(2). shared memory
----(3). CUBLAS
----You choose number here: 2
----Iteration number is 3000:
Size of MATRIX A is 687 by 1001.
--------You choose shared memory method------------

Execution time of GPU implementation is 1.460000
Execution time of CPU implementation is 2.143333
[jouser@mio mv]$

Ⅴ. Useful CUDA Library

Very efficient CUDA libraries are supplied: Linear algebra, FFT, CULA. User can call these routines from kernel without writing any parallel code


The CUBLAS Library

CUBLAS is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA CUDA (compute unified device architecture) driver. It allows access to the computational resources of NVIDIA GPUs. The library is self-contained at the API level, that is, no direct interaction with the CUDA driver is necessary. CUBLAS attaches to a single GPU and does not auto-parallelize across multiple GPUs.

The basic model by which applications use the CUBLAS library is to create matrix and vector objects in GPU memory space, fill them with data, call a sequence of CUBLAS functions, and, finally, upload the results from GPU memory space back to the host. To accomplish this, CUBLAS provides helper functions for creating and destroying objects in GPU space, and for writing data to and retrieving data from these objects.

For maximum compatibility with existing Fortran environments, CUBLAS uses column-major storage and l-based indexing. Since C and C++ use row-major storage, applications cannot use the native array semantics for two-dimensional arrays. Instead, macros or inline functions should be defined to implement matrices on top of onedimensional arrays.

Header and library files

CUBLAS consists of these files below. An emulation library file is used for developing on non-TESLA capable environment by emulation. Do not use it when you want real computing by TESLA on TSUBAME.

* C header file: cublas.h

* CUBLAS library: / (for emulation)

Example is shown in last CUDA samlpe to implement Matrix-Vector Multiplication.


CUFFT Library

This document describes CUFFT, the NVIDIA CUDA Fast Fourier Transform (FFT) library. The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets, and it is one of the most important and widely used numerical algorithms, with applications that include computational physics and general signal processing. The CUFFT library provides a simple interface for computing parallel FFTs on an NVIDIA GPU, which allows users to leverage the floating-point power and parallelism of the GPU without having to develop a custom, GPUbased FFT implementation.

FFT libraries typically vary in terms of supported transform sizes and data types. For example, some libraries only implement Radix-2 FFTs, restricting the transform size to a power of two, while other implementations support arbitrary transform sizes. This version of the CUFFT library supports the following features:

Header and library files

CUFFT consists of these files below. An emulation library file is used for developing on non-TESLA capable environment by emulation. Do not use it when you want real computing by TESLA on TSUBAME.

* C header file: cufft.h

* CUFFT library: / (for emulation)


Download to extract to local directory. Type commmand unzip -d ./cufft_demo to extract.

[jouser@mio cufft_demo]$ make
/usr/local/cuda/bin/nvcc --ptxas-options=-v -O3 -G -g  -c -I"/home/jouser/NVIDIA_GPU_Computing_SDK/C/common/inc" -I"/usr/local/cuda/include" -I"./" -o cufft_demo.cu_o
ptxas info    : Compiling entry function '__cuda_dummy_entry__' for 'sm_10'
ptxas info    : Used 0 registers
g++ -o cufft_demo   cufft_demo.cu_o   -I"/home/jouser/NVIDIA_GPU_Computing_SDK/C/common/inc" -I"/usr/local/cuda/include" -I"./" -L"/usr/local/cuda/lib64" -lcuda -lcudart -lcublas -lcufft -L"/home/jouser/NVIDIA_GPU_Computing_SDK/C/common/lib" /home/jouser/NVIDIA_GPU_Computing_SDK/C/lib/libcutil.a -lstdc++ -lpthread
[jouser@mio cufft_demo]$ ./cufft_demo > cufft_demo.out&
[1] 6670
[jouser@mio cufft_demo]$

Command top is used to display top CPU processes. All the print is saved in the file named cufft_demo.out.

For more information about using CUBLAS and CUFFT, please refer to the tutorial of CUBLAS and CUFFT.pdf from Prof. Dan M. Davis(


CULA Linear Algebra Library

CULA is a GPU-accelerated linear algebra library that utilizes the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics.

CULAPACK Functions

CULA contains a "CULAPACK" interface that is comprised of over 150 mathematical routines from the industry standard for computational linear algebra, LAPACK. Our CULA library includes many popular routines including system solvers, least squares solvers, orthogonal factorizations, eigenvalue routines, and singular value decompositions.

This link shows a complete routines CULA contains and their perforamnce.

CULA Performance

CULA offers performance up to a magnitude faster than optimized CPU-based linear algebra solvers.

CULA Interfaces

CULA is available in a variety of different interfaces to integrate directly into your existing code. Programmers can easily call GPU-acclerated CULA from their C/C++, FORTRAN, MATLAB, or Python codes. This can all be done with no GPU programming experience.

Cross Platform

CULA is available for every system supported by NVIDIA's CUDA. This includes 32- and 64-bit versions of Linux, Windows, and OS X.

The actual speed-up depends heavily on the algorithm, the size of your data set, and what you are benchmarking against. When benchmarking against a standalone LAPACK implementation, CULA routines typically show a 40x to 200x speed-up. In most cases, there is a 3x to 10x speed-up when comparing CULA to Intel Math Kernel Library implementation of LAPACK running on their latest Core i7 processor. We have assembled several performance charts comparing CULA to other implementations of LAPACK.

Comparing results from linear algebra routines can be a non-trivial task. Different algorithms need to be validated in different fashions. However, when comparing various norms, residuals, and reconstructions, we typically see results that are accurate to machine precision.

CUDA1.MINES.EDU contains the Basic version of the CULA Linear Algebra Library from EMPhotonics which is free.

CULA Basic Functions

CULA Basic includes the following functions in single and single-complex precisions only.

Function Name Description

To use the CULA package add the following lines to your .bash_profile file:

export CULA_ROOT="/opt/development/gpu/3.2/cula"
export CULA_INC_PATH="$CULA_ROOT/include"
export CULA_LIB_PATH_64="$CULA_ROOT/lib64"

We are currently using CULA version 1.1.

The example, cudaonly.f90 shows how to use the complex LU factorization routine, cula_cgesv, to solve two sets of complex linear equations. The first line of the program gives the compile command. This program is based on the LAPACK cgesv example program.

The CULA Programmers Guide 1.1 and CULA Reference Manual 1.1 are also available.

cudaonly.f90 can be downloaded.


[jouser@mio ~]$ cd
[jouser@mio ~]$ mkdir cula_test
[jouser@mio ~]$ cd cula_test
[jouser@mio cula_test]$ ls
[jouser@mio cula_test]$ gfortran cudaonly.f90 -O3 -I$CULA_ROOT/include -L$CULA_ROOT/lib64 -lcula -lcula_fortran -lcublas -lcudart -o nrfcuda
[jouser@mio cula_test]$ ls
cudaonly.f90  lnumz.mod  nrfcuda  pmatrix.mod
[jouser@mio cula_test]$ ./nrfcuda
 initializing cula
 calling cula_cgesv
 shutting down cula

 details of lu factorization
 ( -4.30, -7.10) ( -6.47,  2.52) ( -6.51, -2.67) ( -5.86,  7.38)
 (  0.49,  0.47) ( 12.26, -3.57) ( -7.87, -0.49) ( -0.98,  6.71)
 (  0.25, -0.15) ( -0.60, -0.37) (-11.70, -4.64) ( -1.35,  1.38)
 ( -0.83, -0.32) (  0.05,  0.58) (  0.93, -0.50) (  2.66,  7.86)

 ( -1.09, -0.18) (  1.28,  1.21)
 (  0.97,  0.52) ( -0.22, -0.97)
 ( -0.20,  0.19) (  0.53,  1.36)
 ( -0.59,  0.92) (  2.22, -1.00)
[jouser@mio cula_test]$

Ⅵ. CUDA Debugger

CUDA-GDB is a ported version of GDB: The GNU Debugger. It has been installed on cuda1. The goal of its design is to present the user with an all-in-one debugging environment that is capable of debugging native host code as well as CUDA code. Therefore, it is an extension to the standard i386 port that is provided in the GDB release. As a result, standard debugging features are inherently supported for host code, and additional features have been provided to support debugging CUDA code.

The goal of CUDA-GDB is to provide developers a mechanism of debugging a CUDA application on actual hardware in realtime. Therefore, a user will be able to verify program correctness without variations often introduced by simulation and/or emulation environments.

CUDA-GDB helps user to

Following are some basic CUDA-GDB commands:

Debugging Commands:

Please refer to for full reference.

GDB quick reference: gdb-refcard.pdf

The walkthrough in Section 5.2 assumes the following source code ( ), which performs a simple 8-bit bit reversal on a data set. Type command unzip -d ./bitreverse to extract.


#include <stdio.h>
#include <stdlib.h>

// Simple 8-bit bit reversal Compute test

#define N 256

__global__ void bitreverse(unsigned int *data)
    unsigned int *idata = data;

    unsigned int x = idata[threadIdx.x];

    x = ((0xf0f0f0f0 & x) >> 4) | ((0x0f0f0f0f & x) << 4);
    x = ((0xcccccccc & x) >> 2) | ((0x33333333 & x) << 2);
    x = ((0xaaaaaaaa & x) >> 1) | ((0x55555555 & x) << 1);

    idata[threadIdx.x] = x;

int main(void)
    unsigned int *d = NULL; int i;
    unsigned int idata[N], odata[N];
    for (i = 0; i < N; i++)
         idata[i] = (unsigned int)i;

    cudaMalloc((void**)&d, sizeof(int)*N);
    cudaMemcpy(d, idata, sizeof(int)*N,

    bitreverse<<<1, N>>>(d);

    cudaMemcpy(odata, d, sizeof(int)*N,

    for (i = 0; i < N; i++)
        printf("%u -> %u\n", idata[i], odata[i]);

    return 0;

Sample CUDA-GDB session

Note that this example can be done only when you run the code on the node cuda1. Make sure you login to

Step 1: Compile the CUDA application for debugging by entering the following command at a shell prompt:

[jouser@cuda1 bitreverse]$ make

Note: The above command assumes the source filename to be and that no additional compiler flags are required for compilation.

Step 2: Start the CUDA debugger by entering the following command at a shell prompt:

[jouser@cuda1 bitreverse]$ cuda-gdb bitreverse

Step 3: Set breakpoints. We will set both host (main) and GPU (bitreverse) breakpoints here. Also, we will set a breakpoint at a particular line in the device function (

(cuda-gdb) break main
Breakpoint 1 at 0x8051e8c: file, line 23.
(cuda-gdb) break bitreverse
Breakpoint 2 at 0x805b4f6: file, line 9.
(cuda-gdb) break
Breakpoint 3 at 0x805b4fb: file, line 18.

Step 4: Run the CUDA application.

(cuda-gdb) run

The application will begin executing until it reaches the first breakpoint set in Step 3 (main).

Breakpoint 1, main() at
unsigned int *d = NULL; int i;

At this point, commands can be entered to advance execution and/or print program state. For this walkthrough, we will continue to the device kernel.

(cuda-gdb) continue
[Current CUDA Thread <<<(0,0),(0,0,0)>>>]
Breakpoint 2, bitreverse() at
unsigned int *idata = data;

CUDA-GDB has detected that we have reached a CUDA device kernel, so it prints the current CUDA thread of focus. This can be verified by the thread command:

(cuda-gdb) thread
[Current Thread 2 (Thread 1584864 (LWP 9146))]
[Current CUDA Thread <<<(0,0),(0,0,0)>>>]

The above output tells us that the host thread of focus has LWP ID of 9146 and the current CUDA Thread has block coordinates (0,0) and thread coordinates (0,0,0). This can be further identified by printing the block and thread indices:

(cuda-gdb) print blockIdx
$1 = {x = 0, y = 0}
(cuda-gdb) print threadIdx
$2 = {x = 0, y = 0, z = 0)

We can also print the grid and block dimensions:

(cuda-gdb) print gridDim
$3 = {x = 1, y = 1}
(cuda-gdb) print blockDim
$4 = {x = 256, y = 1, z = 1)

Since thread (0,0,0) will be reversing the value of 0, we will switch to a different thread to show more interesting data:

(cuda-gdb) thread <<<170>>>
Switching to <<<(0,0),(170,0,0)>>> bitreverse () at
         unsigned int *idata = data;

Now, we will advance execution to verify the data value that thread (170,0,0) should be working on:

(cuda-gdb) next
[Current CUDA Thread <<<(0,0),(170,0,0)>>>]
bitreverse () at
unsigned int x = idata[threadIdx.x];
(cuda-gdb) next
[Current CUDA Thread <<<(0,0),(170,0,0)>>>]
bitreverse () at
x = ((0xf0f0f0f0 & x) >> 4) | ((0x0f0f0f0f &x) << 4);
(cuda-gdb) print x
$5 = 170
(cuda-gdb) print/x x
$6 = 0xaa

We have verified that thread (170,0,0) is working on the correct data (170). Now we will use our last breakpoint (set at to verify that our logic is correct to reverse our original data:

(cuda-gdb) continue
[Current CUDA Thread <<<(0,0),(170,0,0)>>>]
Breakpoint 3, bitreverse() at
idata[threadIdx.x] = x;
(cuda-gdb) print x
$7 = 85
(cuda-gdb) print/x x
$8 = 0x55

Now we will delete our breakpoints and continue the program to completion:

(cuda-gdb) delete b
Delete all breakpoints? (y or n) y
(cuda-gdb) continue
Program exited normally.
(cuda-gdb) quit
[jouser@cuda1 bitreverse]$

This concludes the CUDA-GDB walkthrough.

Ⅶ. Using Multiple GPUs

1. Choosing among multiple GPUs

If you know the device number of the GPU you want to use, call cudaSetDevice(N). For a more robust solutions, include the code shown below at the beginning of your program to automatically select the best GPU on any machine.

When running a CUDA program on a machine with multiple GPUs, by default CUDA kernels will execute on whichever GPU is installed in the primary graphics card slot. On some machines, the default GPU may be significantly slower and potentially not provide advanced features like atomic instructions or double-precision floating point. If you know that you want to execute on the Nth GPU, calling cudaSetDevice(N) will make your code execute on the correct GPU. However, this is not a robust solution because the cards may be switched in the future and the Nth card on a different machine might not be the card you want to use.

For a more robust solution, include the following code somewhere at the beginning of your program:

int num_devices, device;
if (num_devices > 1) {
      int max_multiprocessors = 0, max_device = 0;
      for (device = 0; device < num_devices; device++) {
              cudaDeviceProp properties;
              cudaGetDeviceProperties(&properties, device);
              if (max_multiprocessors < properties.multiProcessorCount) {
                      max_multiprocessors = properties.multiProcessorCount;
                      max_device = device;

This code automatically selects the card with the largest number of multiprocessors, which should pick the "best" card in most cases. A few notes about this code:

The template SDK project which many people start working with includes a call to CUDA_DEVICE_INIT. This function (actually a macro) will automatically set the device back to the default device. The most obvious solution is to simply remove the CUDA_DEVICE_INIT call. Alternatively, adding a call to cudaFree(0) after the code above but before the CUDA_DEVICE_INIT will also solve the problem. A third approach is to move the code above to after the CUDA_DEVICE_INIT call. The reason this works is because it is the last call to cudaSetDevice before the first interaction with the GPU (for example, calling cudaFree) that decides which device your code will execute on. For this reason, you must ensure that the code above is added before any CUDA-related function calls, such as cudaMalloc or cudaMemCpy.

Compiling this code with GCC has been problematic; there probably is a workaround, but the simplest approach is to include this code only in a .cu file, not a .c file (this probably will not apply to most CUDA developers).

2. Using multiple GPUs

Since NVIDIA's GPU driver allows only one CPU thread talk to one GPU device at a time, you'll need to use multiple CPU threads to cooperate multiple GPU devices in one single program. The key is that each GPU must be controlled by a separate thread on the host CPU.

a) Example 1

An simple example of using multiple GPUs can be found in the CUDA SDK exmaples.

Or download from the following links:

Download - Windows (x86)

Download - Windows (x64)

Download - Linux/Mac

b) Example 2

An example of using multiple GPU devices with OpenMP and CUDA

Compile and run the sample code interactively on cuda1

Login to cuda1 from mio.


Build CUDA program with OpenMP support

nvcc -Xcompiler -fopenmp -lgomp -o cudaOpenMP

The above command compiles with OpenMP supports from GNU compiler version 4.1.2

Run the sample program interactively


number of host CPUs:    4
number of CUDA devices: 2
   0: Tesla C1060
   1: Tesla C1060
CPU thread 1 (of 2) uses CUDA device 1
CPU thread 0 (of 2) uses CUDA device 0

For more information: an example of the evolution of the tool sets is that the latest version of CUDA (4.0) now has the capability of allowing multiple GPUs to communicate directly over the PCIe bus. Prior to this version if a GPU needed to send data to another GPU it first had to send the data to the CPU which then was likely copied to another CPU and then sent to the target GPU. This means that a great deal of data has to be moved across the PCIe bus which means it takes more time and the application runs slow. In addition, there is a memory copy involving at least two cores on the host system which also takes time and uses memory on the host system. All of this data movement and data copies results in reduced performance and increased host resources (i.e. you have to buy more memory).

Now with CUDA 4.0 it is possible to have one GPU send data to another GPU within the same host over the PCIe bus without the involvement of any CPUs. This cuts the data movement across the PCIe bus in half and eliminates the memory copy on the host system. There are other far-reaching consequences of this capability that we are likely to see in the coming years as GPU design evolves.

Ⅷ. Combining MPI and CUDA

The picture below shows an overview of the MPI communication, GPU memory transfers, and the intra-GPU 1D decomposition used for overlapping.

1. Example 1

simpleMPI is simple example demonstrating how to use MPI in combination with CUDA. This executable is not pre-built with the SDK installer.

Please download from the following links:

Download - Windows (x86)

Download - Windows (x64)

Download - Linux/Mac

2. Example 2

Combining MPI and CUDA

The simplest way forward is to use nvcc for everything. The nvcc compiler wrapper is somewhat more complex than the typical mpicc compiler wrapper, so it's easier to make MPI code into .cu and compile with nvcc than the other way around. A sample makefile might resemble:

Creat new empty directory mpi-gpu.

mkdir mpi-gpu
cd mpi-gpu


  MPICC      := nvcc -Xptxas -v 
  MPI_INCLUDES    := /usr/mpi/intel/mvapich2-1.2p1/include 
  MPI_LIBS   := /usr/mpi/intel/mvapich2-1.2p1/lib 
%.o : 
  $(MPICC) -I$(MPI_INCLUDES) -o $@ -c $< 
mpi_hello_gpu : vecadd.o  mpi_hello_gpu.o 
  $(MPICC) -L$(MPI_LIBS) -lmpich -o $@ *.o 
clean : 
  rm vecadd.o mpi_hello_gpu.o 
all : mpi_hello_gpu 

You might want to change

MPI_INCLUDES    := /usr/mpi/intel/mvapich2-1.2p1/include 
MPI_LIBS   := /usr/mpi/intel/mvapich2-1.2p1/lib


MPI_INCLUDES    := /opt/lib/openmpi/1.4.2/intel/11.1/include 
MPI_LIBS   := /opt/lib/openmpi/1.4.2/intel/11.1/lib

if you try to run this code on Mio.

Source code files follow.

  #include <mpi.h> 
  #include <stdio.h> 
  #include <stdlib.h> 
  #define PPN 4 
  #define INTARRAYLEN 65535 
  #define BCASTREPS 1000 
int main(int argc, char  *argv[]) 
  int bcastme[INTARRAYLEN], ranksum; 
  int rank, size, len; 
  int gpudevice; 
  int vecadd(int, int); 
        MPI_Init(&argc, &argv); 
  MPI_Comm_rank(MPI_COMM_WORLD,  &rank); 
  MPI_Comm_size(MPI_COMM_WORLD,  &size); 
  MPI_Get_processor_name(name, &len); 
     // do some MPI work, showing MPI and CUDA being run from one  routine 
  if (rank == 0) { bcastme[3]=3; } 
  for (int i=0; i<BCASTREPS; i++) 
     // modulo is useful in determining unique gpu device ids if  ranks 
  // are packed into nodes and not assigned in round robin  fashion 
  gpudevice= rank % PPN; 
  printf("rank %d of %d on %s received bcastme[3]=%d [gpu  %d]\n", rank, size, name,bcastme[3], gpudevice); 
  vecadd(gpudevice, rank); 
     // more MPI work showing MPI is functional after CUDA 
  MPI_Reduce(&rank, &ranksum, 1,  MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); 
  if (rank == 0) { printf("ranksum= %d\n", ranksum); } 

Parameters for passing the MPI rank and selecting a gpu were added to vecadd.

  // Kernel definition 
  __global__ void  vecAdd(float* A, float* B, float* C) 
  int i = threadIdx.x; 
  C[i] = A[i] + B[i]; 
#include <stdio.h> 
  #define SIZE 10 
  #define KERNELINVOKES  5000000 
  int vecadd(int  gpudevice, int rank) 
  int devcheck(int, int); 
  devcheck(gpudevice, rank); 
    float A[SIZE], B[SIZE], C[SIZE]; 
  // Kernel invocation 
    float *devPtrA; 
  float *devPtrB; 
  float *devPtrC; 
  int memsize= SIZE * sizeof(float); 
    cudaMalloc((void**)&devPtrA, memsize); 
  cudaMalloc((void**)&devPtrB, memsize); 
  cudaMalloc((void**)&devPtrC, memsize); 
  cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice); 
  cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice); 
  for (int i=0; i<KERNELINVOKES; i++) 
  vecAdd<<<1,  gpudevice>>>(devPtrA, devPtrB, devPtrC); 
  cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost); 
  // calculate only up to  gpudevice to show the unique output 
  // of each rank's kernel  launch 
  for (int i=0; i<gpudevice; i++) 
  printf("rank %d: C[%d]=%f\n",rank,i,C[i]); 
int devcheck(int  gpudevice, int rank) 
  int device_count=0; 
  int device;   // used with cudaGetDevice() to verify cudaSetDevice() 
  cudaGetDeviceCount( &device_count); 
  if (gpudevice >= device_count) 
  printf("gpudevice >=  device_count ... exiting\n"); 
  cudaError_t cudareturn; 
  cudaDeviceProp deviceProp; 
  cudaGetDeviceProperties(&deviceProp,  gpudevice); 
  if (deviceProp.warpSize <= 1) 
  printf("rank %d: warning, CUDA Device Emulation (CPU)  detected, exiting\n", rank); 
  if (cudareturn == cudaErrorInvalidDevice) 
  perror("cudaSetDevice returned  cudaErrorInvalidDevice"); 
  printf("rank %d:  cudaGetDevice()=%d\n",rank,device); 

Ⅸ. Tips For Efficiency and useful links

1. Tips For Efficiency:

In order to boost computation as much as possible when write your own CUDA programs, these tips should be followed:

2. Measuring kernel runtime:

There are a number of possible ways to measure the runtime of a CUDA kernel (or any other operation). The most portable option is to use CUDA's built-in timer functions, which will work across different operating systems. To measure runtime, you need to add two steps to your code:

a) Before calling the kernel, you need to create and start a timer:

unsigned int timer;

b) After calling the kernel, you need to make sure that the kernel has finished and then stop and read the timer.

printf("Kernel: %f seconds\n", cutGetTimerValue(timer) / 1000);

The call to the cudaThreadSynchronize function is necessary because CUDA kernel calls are non-blocking, meaning that the statement immediately following the kernel invocation can be executed before the kernel has actually completed. The cudaThreadSynchronize function explicitly forces the program to wait until the kernel has completed.

Note that the value returned by the cutGetTimerValue function is the elapsed time in milliseconds, so you need to divide by 1,000 in order to get the time in seconds. Also, if you reuse a timer by calling cutStartTimer and cutStopTimer multiple times on the same variable without calling cutDeleteTimer, the cutGetTimerValue function will return the average time across the multiple runs, not the total elapsed time.

3. Useful links:

Useful Links about CUDA programming skill, Library usage and Combination with MPI, Matlab, etc.

Author: Chong Ding

For questions please contact Dr. Timothy Kaiser at or Chong Ding at