Документ взят из кэша поисковой машины. Адрес оригинального документа : http://ccoe.msu.ru/sites/default/files/presentations/Moscow-Keynote-Hwu-10-22-2012.pdf
Дата изменения: Thu Oct 25 10:04:34 2012
Дата индексирования: Thu Feb 27 20:06:06 2014
Кодировка:

The Future of GPU Computing

Wen-mei Hwu
University of Illinois, Urbana-Champaign

Agenda
· · · · Setting the Context Current Victories Coming Battles Conclusion and Outlook

NVIDIA HPC Day Moscow State University 2012

CPUs: Latency Oriented Design
· Large caches
Convert long latency memory accesses to short latency cache accesses
ALU Control ALU ALU

· Sophisticated control
Branch prediction for reduced branch latency Data forwarding for reduced data latency

CPU
Cache

ALU

· Powerful ALU
Reduced operation latency

DRAM

3

GPUs: Throughput Oriented Design
· Small caches
To boost memory throughput

· Simple control
No branch prediction No data forwarding GPU

· Energy efficient ALUs
Many, long latency but heavily pipelined for high throughput

· Require massive number of threads to tolerate latencies

DRAM

4

Winning Applications Use Both CPU and GPU
· CPUs for sequential parts where latency matters
CPUs can be 10+X faster than GPUs for sequential code

· GPUs for parallel parts where throughput wins
GPUs can be 10+X faster than CPUs for parallel code

5

GPU Computing Gems 280 Submissions, 90 chapters
Financial Analysis

Scientific Simulation

Engineering Simulation

Data Intensive Analytics Biomedical Informatics

Medical Imaging

Digital Audio Processing

Digital Video Processing Ray Tracing Rendering

Computer Vision

Electronic Design Automation

Statistical Modeling

Interactive Physics

Numerical Methods

GPU computing is catching on.

NVIDIA HPC Day Moscow State University 2012

A Common GPU Usage Pattern
· Use GPUs for the most time-consuming aspects of a computational problem
Kernels in CUDA, OpenCL, OpenACC, tec. Refactor host code to better support kernels and data transfer Linear solvers, PDE solvers, Convolution filtering (e.g. bilateral Gaussian filters), De Novo gene assembly, etc.

· Rethink the domain problem
NVIDIA HPC Day Moscow State University 2012

CURRENT VICTORIES
HPC Systems

NVIDIA HPC Day Moscow State University 2012

Blue Waters Supercomputer

Cray System & Storage cabinets: Compute nodes: Usable Storage Bandwidth: System Memory: Memory per core module: Gemin Interconnect Topology: Usable Storage: Peak performance: Number of AMD Interlogos processors: Number of AMD x86 core modules: Number of NVIDIA Kepler GPUs:

· >300 · >25,000 · >1 TB/s · >1.5 Petabytes

· 4 GB
· 3D Torus · >25 Petabytes · >11.5 Petaflops · >49,000 · >380,000 · >3,000

10

Cray XE6 Nodes
· Dual-socket Node
Two AMD Interlagos chips
· 16 core modules, 64 threads · 313 GFs peak performance · 64 GBs memory
102 GB/sec memory bandwidth

Gemini Interconnect
Blue Waters contains 22,640 Cray XE6 compute nodes.

· Router chip & network interface · Injection Bandwidth (peak)
9.6 GB/sec per direction

Cray XK7 Nodes
· Dual-socket Node
One AMD Interlagos chip
· 32 GBs memory
51.2 GB/s bandwidth

One NVIDIA Kepler chip
· 1.4 TFs peak performance · 6 GBs GDDR5 memory
190 GB/sec bandwidth

Gemini Interconnect
· Same as XE6 nodes Blue Waters contains 3,072 Cray XK7 compute nodes.

Gemini Interconnect Network
Blue Waters 3D Torus Size 23 x 24 x 24
Y X Z Infiniband Interconnect Network InfiniBand Login Servers Network(s)

GigE Fibre Channel

SMW

Boot Raid

Lustre

Compute Nodes Cray XE6 Compute Operating System Boot

Service Nodes Login/Network Login Gateways

Service Nodes spread throughout the torus

Cray XK7 Accelerator

System Database

Network

Lustre File System LNET Routers

Science Area

Number of Teams 3 2 2

Codes

Structured Grids

Unstructured Grids

Dense Matrix

Sparse Matrix

NBody

Monte Carlo

FF T

Significan t I/O

Climate and Weather Plasmas/Magnet osphere Stellar Atmospheres and Supernovae Cosmology

CESM, GCRM, CM1, HOMME H3D(M), OSIRIS, Magtail/UPIC PPM, MAESTRO, CASTRO, SEDONA Enzo, pGADGET

X X X X

X

X X X X X

X X X X X

2

Combustion/Tur bulence
General Relativity Molecular Dynamics Quantum Chemistry Material Science Earthquakes/Sei smology Quantum Chromo Dynamics Social Networks Evolution Computer Science

1
2 4 2 3 2

PSDNS
Cactus, Harm3D, LazEV AMBER, Gromacs, NAMD, LAMMPS SIAL, GAMESS, NW Chem NEMOS, OMEN, GW , QMCPACK AW P-ODC, HERCULES, PLSQR, SPECFEM3D Chroma, MILD, USQCD EPISIMDEMICS Eve

X
X X X X X X X X X X X X X X X X X X X

X

X X

X X

1

1 1 1

X

X

X

X

X

CURRENT VICTORIES
HPC applications

NVIDIA HPC Day Moscow State University 2012

NAMD Released GPU Features and Future Plans (100,000 users)
NAMD 2.8
· CUDA features supported full electrostatics with PME and most simulation features (not alchemical methods), NBFIX parameters 100M-atom capability functional on CUDA

NAMD 2.9
· · · · Alchemical free energy perturbation Locally enhanced sampling Methods that modify nonbonded interactions for small sets of atoms New multi-level summary method (MSM)

·

Longer Term
· Specialized methods such as LoweAnderson thermostat, Go potentials, and tabulated nonbonded interactions Various performance improvements, including reduce CPU-side performance bottlenecks such as shifting various calculations to GPUs
NVIDIA HPC Day Moscow State University 2012

·

Pushing Limits of Innovation with NAMD

ApoA-1
92,224 Atoms

F1-ATPase
327,506 Atoms

STMV
1,066,628 Atoms

ns/Day 2,81

ns/Day 1,08 ns/Day 0,28

ns/Day 0,68

ns/Day 0,22

ns/Day 0,05

GPU+CPU

CPU

GPU+CPU

CPU

GPU+CPU

CPU

Test Platform: 1 Node, Dual Tesla M2070 GPU (6GB), Dual Intel 4-core Xeon (2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On. Visit www.nvidia.com/simcluster for more information on speed up results, configuration and test models.

NVIDIA HPC Day Moscow State University 2012

GPU Scaling on NAMD
STMV Benchmark on Tsubame 2.0
30

Dual Socket Intel Xeon Westmere 6 core CPUs/Node 1 GPU/Node 2 GPUs/Node (~ 1 Kepler GPU/Node)

Days/ns

20

1.8x

10

4.2x

2.4x
4.3x 2.7x 4.4x

0 64 128 189 # of Nodes NAMD Benchmark on Tsubame 2.0, 9/7/2011
100STMV, ibverbs-smp

· ·

NAMD run on Tsubame 2.0 from 64 to 189 nodes. Using 1 or 2 Fermi GPUs per node
NVIDIA HPC Day Moscow State University 2012

QCD Strong Scaling using GPUs
·
· · ·

General Problem : as core counts increase, the ratio of communication to local computation tends to grow. For a sufficient number of cores, the problem becomes communications bound (vs. computation) Solution : solvers that minimize communication, such as "domain-decomposition" solvers. An additive Schwarz domain-decomposed preconditioner with a Generalized Conjugate Residual solver (GCR-DD) successfully demonstrates strong scaling Results : Strong scaling to 256 GPUs on 323x256 lattice in Chroma (W ilson-clover fermions) Results : Strong scaling to 256 GPUs on 643x256 lattice in MILC (improved staggered fermions)
Sustained strong-scaling performance in Chroma 3.41.0 using Schwarz generalized conjugate residual solver (GCR-DD). BiCGstab is the reference Krylov solver. Sustained strong-scaling performance in MILC 7.6.3 using mixed-precision conjugate gradient solver (parallelized along multiple dimensions)

*

Guochun Shi (NCSA), BАlint JoС (Jefferson Labs), Ron Babich (BU), Mike Clark (Harvard), Rich Brower (BU), Steve Gottlieb (Indiana), "Scaling Lattice QCD beyond 100 GPUs," SC11, ACM (Nov 2011)

NVIDIA HPC Day Moscow State University 2012

USQCD Software GPU Roadmap
·
· ·

ETA September 2011 : exploiting GPU Direct (QUDA 0.4.0 doesn't currently support peer-to-peer transfers to minimize inter-GPU communication.
ETA Q3/Q4 2011 : Multi-GPU DWF fermions. ETA fall 2011 : Adaptive multigrid (MG), expected to deliver O(10)-fold speedup over current solvers.

·

ETA Q4 2011 / Q1 2012 : Refinement of domain-decomposition algorithms. Currently simple block Jacobi. Expect significant speedup from overlapping blocks, multiplicative Schwarz (e.g. block Gauss-Seidel).
Active R&D 2012 : Exploitation of cache locality (e.g. more efficient use of shared memory to reduce memory traffic). Better scaling for GPU cores vs. GPU memory bandwidth.

·

·

ETA 2012 : Full Hybrid Monte Carlo (e.g. gauge generation) on GPUs. Includes support for high-order symplectic symmetric integrators which improves the volume scaling from HMC, which will result in substantial computational cost reduction at large volumes.
Beta 1H 2012 : Complete deployment of QCD applications (e.g. Chroma) on GPUs by implementing the domain specific language (QDP++) in CUDA. Currently prealpha, Jlab R&D (Jie Chen), Frank Winter (Edinburgh). R&D 2012, Deployment 2013 : Combine HMC and MG on GPUs.
NVIDIA HPC Day Moscow State University 2012

·

·

Current Science Team GPU Plans and Results
· Nearly 1/3 of PRAC projects have active GPU efforts, including
AMBER LAMMPS USQCD/MILC GAMESS NAMD QMCPACK PLSQR/SPECFEM3D

· Others are investigating use of GPUs (e.g., Cactus, PPM, AWP-ODC) · Some examples follow
GTC Asia, Beijing, 2011

Current Status - NAMD
· Full run with 100stmv on Titan
CUDA implementation Fermi XK6 vs. XE6: 1.9x with 32 nodes Kepler XK7 vs. XE6: projected 2.9x with 32 nodes Fermi XK6 vs. XE6: 1.4 with 768 nodes Kepler XK7 vs. XE6: projected 1.5x with 768 nodes NAMD is limited by PME on Titan
· Titan has less bandwidth and longer latency than BW due to GPU node placement scheme · NAMD team is working on an alternative PME implementation
GTC Asia, Beijing, 2011

Current Status - Chroma
· Lattice QCD parameters: grid size of 48^3 x 512 running at the physical values of the quark masses on Titan
Fermi XK6 vs. Kepler XK7 vs nodes Fermi XK6 vs. Kepler XK7 vs nodes XE6: 8.2x with 64 nodes . XE6: projected 13.2x with 64 XE6: 6.1 with 768 nodes . XE6: projected 9.2x with 768

· Chroma is not limited by interconnect
GTC Asia, Beijing, 2011

CURRENT VICTORIES
Scalable and Numerically Stable Libraries

NVIDIA HPC Day Moscow State University 2012

Solid Scalable GPU Libraries
· Dense SGEMM/DGEMM, LU, Triangular solvers (CUBLAS, CULA, MAGMA) · Sparse Matrix Vector Multiplication, Tridiagonal solvers (CUSPARSE, QUDA, PARBOIL) · FFTs, Convolutions (CUFFT, Parboil) · N-Body (NAMD/VMD, FMM BU, PARBOIL) · Histograms (PARBOIL) · Some PDE solvers (CURRENT, PARBOIL)

NVIDIA HPC Day Moscow State University 2012

Scalability vs. Numerical Stability A Major Algorithm Design Challenge
Parallelism · Parallelism to fill growing HW parallelism Complexity and data scalability · Operations should grown linearly with data size Locality · DRAM bursts and cache space utilization Regularity · SIMD utilization and load balance Numerical Stability · Pivoting for linear system solvers
© Wen-mei Hwu, 2012

26

Example: Tridiagonal Solver
· Implicit finite difference methods, cubic spline interpolation, preconditioners · An algorithm to find a solution of Ax = d, where A is an n-by-n tridiagonal matrix and d is an nelement vector

NVIDIA HPC Day Moscow State University 2012

GPU Tridiagonal System Solver Case Study
· Thomas (sequential)
e0 b0 e1 a1 e2 e3 c0 b1 a
2

· Hybrid Methods
PCR-Thomas (Kim 2011, Davidson 2011) CR-PCR (CUSPARSE 2012) Etc

c1 b2 a3

c2 b3

· Cyclic Reduction (1 step)
e0 b0 e1 a1 e2 e3 c0 b1 a
2

c1 b2 a3

e0 b0 e a 1 1 c2 e2 a2 b3 e3

0 b1 0

c0 c1 b2 a3

b0 a 0 2 b3

· Numerically unstable
c0 b2

· PCR (1 step)
e0 b0 e1 a1 e2 e3 c0 b1 a
2

c1 b2 a3

e0 b0 e 0 1 c2 e2 a2 b3 e3

0 b1 0 a3

c0 0 b2 0

b0 c1 a2 0 b1 a b3 3

c0 b2 c1 b3

© Wen-mei Hwu, 2012

28

Numerical Stability
· Algorithms that can always find an appropriate operation order and thus finding a solution to the problem as long as it exists for any given input values are numerically stable. · Algorithms that fall short are numerically unstable.

© Wen-mei Hwu, 2012

29

Problems numerical stability
· Algorithms that don't check for divide by zero
0 10 10 0

0
1

nan
0

· Limited ability to represent precision and scale

10-10 1010

1010 0

1 0

1020 1030 inf
© Wen-mei Hwu, 2012 30

Pivoting
· Judiciously swap rows to avoid bad cases

10-10 1010

1010 0

1010 10-10

0 1010

1 0

0 1010

© Wen-mei Hwu, 2012

31

Partition Algorithm
· PCR · SPIKE (Polizzi et al)

SX = Y (5)

DY = F (6)

AiYi = Fi (7)

32

Put the stable sequential algorithm inside each GPU thread
· Each thread will process one tile by itself with a sequential, numerically stable pivoting algorithm · Note that each thread accessing the first element of its own tile will result in large, strided accesses

© Wen-mei Hwu, 2012

33

Memory Layout Issue

e0 b0 e1 a1 e2 e3

c0 b1 a
2

c1 b2 a3

e0 b0 e 0 1 c2 e2 a2 b3 e3

0 b1 0 a3

c0 0 b2 0

b0 c1 a2 0 b1 a b3 3

c0 b2 c1 b3

© Wen-mei Hwu, 2012

34

GPU Memory Bandwidth vs. Stride
· SAXPY with stride:
y[i * stride ] = a * x[ i * stride ] + y[i * stride ];

35

"Efficient Sparse Matrix-VecmeiMultipli2012 on CUDA" © Wen- tor Hwu, cation Nathan Bell and Michael Garland, in, "NVIDIA Technical Report NVR-2008-004",,

Tiles Processed by Each Thread
· Each tile:

· Layout of all tiles: (similar to ELL before transposition)

36

© Wen-mei Hwu, 2012

Another Data Layout Alternative
divide into tiles

© Wen-mei Hwu, 2012 37

ASTA Data Layout

© Wen-mei Hwu, 2012 38

Converting AoS to SoA in place is hard
· Converting AoS to SoA is equivalent to transposing a tall and thin array
AoS same as same as SoA

transpos e

39

© Wen-mei Hwu, 2012

In-place Transpostion: simple case
// data[W][H]-->data[H][W] parallel for (j
© Wen-mei Hwu, 2012 40

In-place Transpostion: First Attempt
// data[W][H]-->data[H][W] parallel for (j
© Wen-mei Hwu, 2012 41

In-place Transpostion: First Attempt
// data[W][H]-->data[H][W] parallel for (j
© Wen-mei Hwu, 2012 42

AoS to ASTA Transformation

AoS to ASTA Marshaling Kernel

Global Memory Throughput (GB/s)

Fine Print

Out-of-Place
In-Place Barrier Sync

80
95

2x Space Tile Size (tunable) < On-chip Memory

© Wen-mei Hwu, 2012 43

Cost and Benefit of ASTA Layout Marshaling

© Wen-mei Hwu, 2012

44

Error and Stability

45

Speed

© Wen-mei Hwu, 2012

46

COMING BATTLES
Hardware trends

NVIDIA HPC Day Moscow State University 2012

Important GPU Architecture Trends
· CPU/CPU fusion architectures
For reduced part count and data movement Reduced hand-off granularity Larger GPU accessible memory

· Emphasis on energy efficiency
By reducing data movement and control flow overhead

· More general forms of parallelism
To accommodate algorithm and locality needs

· Even higher degree of SIMD
Due to increasing width of memory interface
NVIDIA HPC Day Moscow State University 2012

COMING BATTLES
Scalable Kernels

NVIDIA HPC Day Moscow State University 2012

There is a critical need for scalable kernel libraries
· Both CPUs and GPUs require scalable parallel kernel libraries
GPU needs are more urgent

· Only a small percentage of the Intel Math Kernel Library (MKL) functions have scalable forms. · Software lasts through many hardware generations and needs to be scalable to be economically viable
NVIDIA HPC Day Moscow State University 2012

Example of kernel Needs
· Sparse LU factorization, Cholesky factorization, Triangular, and related inverse solvers · Sparse eigen solvers and related eigen analysis · Graph partitioning (Metis) ·...

NVIDIA HPC Day Moscow State University 2012

Four Challenges
· Computations with no known scalable parallel algorithms
Shortest path, Delaunay triangulation, ...

· Data distributions that cause catastrophical load imbalance in parallel algorithms
Scale-free graphs, MRI compressed sensing

· Computations that have little data reuse
Matrix vector multiplication, ...

· Algorithm optimizations that are hard and labor intensive
Locality and regularization transformations
NVIDIA HPC Day Moscow State University 2012

Example - Dynamic Data Extraction
· The data to be processed in each phase of computation need to be dynamically determined and extracted from a bulk data structure
Harder to organize for massively parallel access

· Graph algorithms are popular examples that deal with dynamic data extraction
Widely used in EDA and large scale optimizations Breadth-First Search (BFS) as an example

NVIDIA HPC Day Moscow State University 2012

Dynamic Data Extraction using Queues
· Input data extraction is done by many threads in parallel
Must have a systematic way to avoid contention in assembling extracted data

· Obvious approach is queues with privatization
Replicate queues to reduce contention Combine queues with concatenation Works only when global order does not matter (queue insertion is commutative)
NVIDIA HPC Day Moscow State University 2012

Three-level Queue Hierarchy
w-queue Scratchpad memory

b-queue

b-queue

· At the end of each the kernel

g-queue

Global memory

Threads cooperate to assemble b-queue Multiple threads collaborate to merge b-queue contents into g-queue Fast atomic operation helps NVIDIA HPC Day Moscow State University 2012

· Customize kernels based on the size of frontiers. · Use fast barrier synchronization when the frontier is small.
One-level parallel propagation

Hierarchical Kernels Need for Dynamic Parallelism

Kernel 1: Intra-block Sync.

Kernel 2: Inter-block Sync. Kernel 3: Kernel re-launch
NVIDIA HPC Day Moscow State University 2012

COMING BATTLES
tools

NVIDIA HPC Day Moscow State University 2012

How a mathematician writes matrix multiplication

How a smart CUDA programmer writes matrix multiplication
#define TILE_N 16 #define TILE_TB_HEIGHT 8 #define TILE_M (TILE_N*TILE_TB_HEIGHT) __global__ void mysgemmNT( const float *A, int lda, const float *B, int ldb, float* C, int ldc, int k, float alpha, float beta ){ { float c[TILE_N]; for (int i=0; i < TILE_N; i++) c[i] = 0.0f; int mid = threadIdx.y * blockDim.x + threadIdx.x; int m = blockIdx.x * TILE_M + mid; int n = blockIdx.y * TILE_N + threadIdx.x; __shared__ float b_s[TILE_TB_HEIGHT][TILE_N]; for (int i = 0; i < k; i+=TILE_TB_HEIGHT) { float a; b_s[threadIdx.y][threadIdx.x]=B[n + (i+threadIdx.y)*ldb]; __syncthreads(); for (int j = 0; j < TILE_TB_HEIGHT; j++) { a = A[m + (i+j)*lda]; for (int kk = 0; kk < TILE_N; kk++) c[kk] += a * b_s[j][kk]; } __syncthreads(); } int t = ldc*blockIdx.y * TILE_N + m; for (int i = 0; i < TILE_N; i++) { C[t+i*ldc] = C[t+i*ldc] * beta + alpha * c[i]; } } ... dim3 grid( m/TILE_M, n/TILE_N ), threads( TILE_N, TILE_TB_HEIGHT ); mysgemmNT<<>>( A, lda, B, ldb, C, ldc, k, alpha, beta); NVIDIA HPC Day Moscow State ...

University 2012

Writing efficient code is complicated.

Tools can provide focused help or broad help
Planning how to execute an algorithm Implementing the plan
· · · · · · · · Memory allocation Data movement Pointer operations Index arithmetic
GMAC Data Layout

· Choose data structures

Triolet

· Decompose work into tasks · Schedule tasks to threads
MCUDA

Kernel dimensions Thread Thread ID arithmetic coarsening Synchronization Temporary data structures
NVIDIA HPC Day Moscow State University 2012

Related Kernel Development Tools
· OpenACC Accelerator Pragmas
Wider use of GPU in large applications but less performance in each kernel Cray and others

· Portland Group CUDA FORTAN compiler
· NVIDIA Thrust · Microsoft C++AMP
NVIDIA HPC Day Moscow State University 2012

Conclusion and Outlook
· We have enjoyed some victories
Good initial set of applications and kernels Good deployment interface in major languages Good initial results, educated developers

· We will face more battles
Energy efficiency vs. easy of programming Potential fragmentation of programming interface Widen the set of applications, algorithms and kernels Better tools

There is always hope.

Aragorn in the eve of the Battle of Pelennor NVIDIA HPC Day Moscow State Minas Tirith University 2012

THANK YOU!

NVIDIA HPC Day Moscow State University 2012