The CUDA Developer SDK provides examples with source code, utilities, and white papers to help you get started writing software with CUDA. The SDK includes dozens of code samples covering a wide range of applications including:
This code is released free of charge for use in derivative works, whether academic, commercial, or personal. (Full License)
The NVIDIA CUDA Toolkit is required to run and compile code samples. Please obtain the CUDA Toolkit here
|
Sobol Quasirandom Number Generator
This sample implements Sobol Quasirandom Sequence Generator. |
|
or later
Browse Files
|
|
|
Niederreiter Quasirandom Sequence Generator
This sample implements Niederreiter Quasirandom Sequence Generator and Inverse Cumulative Normal Distribution function for Standart Normal Distribution generation. |
|
or later
Browse Files
|
|
|
CUDA Context Thread Management
Simple program illustrating how to the CUDA Context Management API. CUDA contexts can be created separately and attached independently to different threads. |
|
or later
Browse Files
|
|
|
DCT8x8
This sample demonstrates how Discrete Cosine Transform (DCT) for blocks of 8 by 8 pixels can be performed using CUDA: a naive implementation by definition and a more traditional approach used in many libraries. As opposed to implementing DCT in a fragment shader, CUDA allows for an easier and more efficient implementation.
|
|
or later
Whitepaper
Browse Files
|
|
|
Simple Vote Intrinsics
Simple program which demonstrates how to use the Vote (any, all) intrinsic instruction in a CUDA kernel. Requires Compute Capability 1.2 or higher. |
|
or later
Browse Files
|
|
|
Simple Atomic Intrinsics
A simple demonstration of global memory atomic instructions. Requires Compute Capability 1.1 or higher. |
|
or later
Browse Files
|
|
|
Marching Cubes Isosurfaces
This sample extracts a geometric isosurface from a volume dataset using the marching cubes algorithm. It uses the scan (prefix sum) function from the CUDPP library to perform stream compaction. |
|
or later
Browse Files
|
|
|
Monte Carlo Option Pricing with multi-GPU support
This sample evaluates fair call price for a given set of European options using the Monte Carlo approach, taking advantage of all CUDA-capable GPUs installed in the system. This sample use double precision hardware if a GTX 200 class GPU is present. |
|
or later
Whitepaper
Browse Files
|
|
|
FFT Ocean Simulation
This sample simulates an Ocean heightfield using CUFFT and renders the result using OpenGL. |
|
or later
Browse Files
|
|
|
256-bin Histogram
This sample demonstrates three efficient implementations of 256-bin histogram with implementations that run with Compute 1.0, 1.1, and 1.2 class hardware. |
|
or later
Whitepaper
Browse Files
|
|
|
64-bin Histogram
This sample demonstrates efficient implementation of 64-bin histogram.
|
|
or later
Browse Files
|
|
|
Separable Convolution
This sample implements a separable convolution filter of a 2D signal with a gaussian kernel. |
|
or later
Whitepaper
Browse Files
|
|
|
Texture-based Separable Convolution
Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against convolutionSeparable. |
|
or later
Browse Files
|
|
|
FFT-Based 2D Convolution
This sample demonstrates how 2D convolutions with very large kernel sizes can be efficiently implemented using FFT transformations. |
|
or later
Whitepaper
Browse Files
|
|
|
MersenneTwister
This sample implements Mersenne Twister random number generator and Cartesian Box-Muller transformation on the GPU. |
|
or later
Whitepaper
Browse Files
|
|
|
Monte Carlo Option Pricing
This sample evaluates fair call price for a given set of European options using Monte Carlo approach. This sample use double precision hardware if a GTX 200 class GPU is present. |
|
or later
Whitepaper
Browse Files
|
|
|
Black-Scholes Option Pricing
This sample evaluates fair call and put prices for a given set of European options by Black-Scholes formula. |
|
or later
Whitepaper
Browse Files
|
|
|
Binomial Option Pricing
This sample evaluates fair call price for a given set of European options under binomial model. This sample will also take advantage of double precision if a GTX 200 class GPU is present. |
|
or later
Whitepaper
Browse Files
|
|
|
Image denoising
This sample demonstrates two adaptive image denoising technqiues: KNN and NLM, based on computation of both geometric and color distance between texels. While both techniques are implemented in the DirectX SDK using shaders, massively speeded up variation of the latter techique, taking advantage of shared memory, is implemented in addition to DirectX counterparts. |
|
or later
Whitepaper
Browse Files
|
|
|
DirectX Texture Compressor (DXTC)
High Quality DXT Compression using CUDA.
This example shows how to implement an existing computationally-intensive CPU compression algorithm in parallel on the GPU, and obtain an order of magnitude performance improvement. |
|
or later
Whitepaper
Browse Files
|
|
|
Post-Process in OpenGL
This sample shows how to post-process an image rendered in OpenGL using CUDA. |
|
or later
Browse Files
|
|
|
Recursive Gaussian Filter
This sample implements a Gaussian blur using Deriche's recursive method. The advantage of this method is that the execution time is independent of the filter width. |
|
or later
Browse Files
|
|
|
Box Filter
Fast image box filter using CUDA with OpenGL rendering. |
|
or later
Browse Files
|
|
|
Bitonic Sort
Bitonic sort is a very simple parallel sorting algorithm that is very
efficient when sorting a small number of elements:
http://citeseer.ist.psu.edu/blelloch98experimental.html
This implementation is based on:
http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
|
|
or later
Browse Files
|
|
|
Matrix Transpose
Efficient matrix transpose. |
|
or later
Browse Files
|
|
|
Scalar Product
This sample calculates scalar products of a given set of input vector pairs. |
|
or later
Browse Files
|
|
|
Clock
This example shows how to use the clock function to measure the performance of kernel accurately. |
|
or later
Browse Files
|
|
|
Simple multi-GPU
This application demonstrates how to use the CUDA API to use multiple GPUs.
|
|
or later
Browse Files
|
|
|
Aligned Types
A simple test, showing huge access speed gap between aligned and misaligned structures. |
|
or later
Browse Files
|
|
|
simpleZeroCopy
This sample illustrates how to use Zero MemCopy, kernels can read and write directly to pinned system memory. This sample requires GPUs that support this feature (MCP79 and GT200). |
|
or later
Whitepaper
Browse Files
|
|
|
threadFenceReduction
This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic.
to produce a single value in a single kernel (as opposed to two or more kernel calls as shown in the "reduction" SDK sample). Single-pass reduction requires global atomic instructions (Compute Capability 1.1 or later) and the _threadfence() intrinsic (CUDA 2.2 or later). |
|
or later
Browse Files
|
|
|
Device Query
This sample enumerates the properties of the CUDA devices present in the system. |
|
or later
Browse Files
|
|
|
Smoke Particles
Smoke simulation with volumetric shadows using half-angle slicing technique. Uses CUDA for procedural simulation and sorting and OpenGL for rendering. |
|
or later
Whitepaper
Browse Files
|
|
|
Bicubic Texture Filtering
This sample demonstrates how to efficiently implement bicubic texture filtering in CUDA. |
|
or later
Browse Files
|
|
|
Volume rendering
This sample demonstrates basic volume rendering using 3D textures. |
|
or later
Browse Files
|
|
|
Simple Texture 3D
Simple example that demonstrates use of 3D textures in CUDA. |
|
or later
Browse Files
|
|
|
Line of Sight
This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation point, it computes all the points along the ray that are visible from the observation point. The implementation is based on the parallel scan primitive provided by the CUDPP library (http://www.gpgpu.org/developer/cudpp/). |
|
or later
Browse Files
|
|
|
N-Body Simulation
This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. This sample accompanies the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA". |
|
or later
Whitepaper
Browse Files
|
|
|
Parallel Reduction
A parallel sum reduction that computes the sum of large arrays of values. This sample demonstrates several important optimization stratezies for parallel algorithms like reduction. |
|
or later
Whitepaper
Browse Files
|
|
|
asyncAPI
This sample uses CUDA streams and events to overlap execution on CPU and GPU. |
|
or later
Browse Files
|
|
|
simpleStreams
This sample uses CUDA streams to overlap kernel executions with memcopies between the device and the host. Requires Compute Capability 1.1 or higher. |
|
or later
Browse Files
|
|
|
Mandelbrot
This sample uses CUDA to compute and display the Mandelbrot or Julia sets interactively. It also illustrates the use of "double single" arithmetic to improve precision when zooming a long way into the pattern. This sample use double precision hardware if a GT200 class GPU is present. Thanks to Mark Granger of NewTek who submitted this sample to the SDK! |
|
or later
Browse Files
|
|
|
Particles
This sample uses CUDA to simulate and visualize a large set of particles and their physical interaction. It implements a uniform grid data structure using either a fast radix sort or atomic operations. |
|
or later
Whitepaper
Browse Files
|
|
|
Fast Walsh Transform
Naturally(Hadamard)-ordered Fast Walsh Tranform for batched vectors of arbitrary eligible(power of two) lengths |
|
or later
Browse Files
|
|
|
Eigenvalues
The computation of all or a subset of all eigenvalues is an important problem in linear algebra, statistics, physics, and many other fields. This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all eigenvalues of a
tridiagonal symmetric matrix of arbitrary size with CUDA. |
|
or later
Whitepaper
Browse Files
|
|
|
Sobel Filter
This sample implements the Sobel edge detection filter for 8-bit monochrome images. |
|
or later
Browse Files
|
|
|
Device Query
This sample enumerates the properties of the CUDA devices present in the system. |
|
or later
Browse Files
|
|
|
Simple Templates
This sample is a templatized version of the template project. It also shows how to correctly templatize dynamically allocated shared memory arrays. |
|
or later
Browse Files
|
|
|
Bandwidth Test
This is a simple test program to measure the memcopy bandwidth of the GPU. It currently is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory. |
|
or later
Browse Files
|
|
|
Scan
This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array. |
|
or later
Whitepaper
Browse Files
|
|
|
Scan of Large Arrays
This example demonstrates an efficient CUDA implementation of parallel prefix sum (also known as "scan") for arbitrary-sized arrays. Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array. |
|
or later
Whitepaper
Browse Files
|
|
|
Simple Texture (Driver Version)
Simple example that demonstrates use of textures in CUDA using the driver API. |
|
or later
Browse Files
|
|
|
Fluids (OpenGL Version)
An example of fluid simulation using CUDA and CUFFT, with OpenGL rendering. |
|
or later
Browse Files
|
|
|
Simple Texture
Simple example that demonstrates use of textures in CUDA. |
|
or later
Browse Files
|
|
|
Matrix Multiplication (Driver Version)
This sample implements matrix multiplication using the CUDA driver API.
It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication.
CUBLAS provides high-performance matrix multiplication. |
|
or later
Browse Files
|
|
|
Template
A trivial template project that can be used as a starting point to create new CUDA src. |
|
or later
Browse Files
|
|
|
Simple CUFFT
Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming both into frequency domain, multiplying them together, and transforming the signal back to time domain. |
|
or later
Browse Files
|
|
|
Simple OpenGL
Simple program which demonstrates interoperability between CUDA and OpenGL. The program modifies vertex positions with CUDA and uses OpenGL to render the geometry. |
|
or later
Browse Files
|
|
|
Simple CUBLAS
Example of using CUBLAS. |
|
or later
Browse Files
|
|
|
Matrix Multiplication
This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide.
It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication.
CUBLAS provides high-performance matrix multiplication. |
|
or later
Browse Files
|
|
|
C++ Integration
This example demonstrates how to integrate CUDA into an existing C++ application, i.e. the CUDA entry point on host side is only a function which is called from C++ code and only the file containing this function is compiled with nvcc. It also demonstrates that vector types can be used from cpp. |
|
or later
Browse Files
|
|
|
1D Discrete Haar Wavelet Decomposition
Discrete Haar wavelet decomposition for 1D signals with a length which is a power of 2. |
|
or later
Browse Files
|
|