Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://ccoe.msu.ru/sites/default/files/presentations/Moscow-Keynote-Hwu-10-22-2012.pdf
Äàòà èçìåíåíèÿ: Thu Oct 25 10:04:34 2012
Äàòà èíäåêñèðîâàíèÿ: Thu Feb 27 20:06:06 2014
Êîäèðîâêà:
The Future of GPU Computing

Wen-mei Hwu
University of Illinois, Urbana-Champaign


Agenda
· · · · Setting the Context Current Victories Coming Battles Conclusion and Outlook

NVIDIA HPC Day Moscow State University 2012


CPUs: Latency Oriented Design
· Large caches
­ Convert long latency memory accesses to short latency cache accesses
ALU Control ALU ALU

· Sophisticated control
­ Branch prediction for reduced branch latency ­ Data forwarding for reduced data latency

CPU
Cache

ALU

· Powerful ALU
­ Reduced operation latency

DRAM

3


GPUs: Throughput Oriented Design
· Small caches
­ To boost memory throughput

· Simple control
­ No branch prediction ­ No data forwarding GPU

· Energy efficient ALUs
­ Many, long latency but heavily pipelined for high throughput

· Require massive number of threads to tolerate latencies

DRAM

4


Winning Applications Use Both CPU and GPU
· CPUs for sequential parts where latency matters
­ CPUs can be 10+X faster than GPUs for sequential code

· GPUs for parallel parts where throughput wins
­ GPUs can be 10+X faster than CPUs for parallel code

5


GPU Computing Gems 280 Submissions, 90 chapters
Financial Analysis

Scientific Simulation

Engineering Simulation

Data Intensive Analytics Biomedical Informatics

Medical Imaging

Digital Audio Processing

Digital Video Processing Ray Tracing Rendering

Computer Vision

Electronic Design Automation

Statistical Modeling

Interactive Physics

Numerical Methods


GPU computing is catching on.

NVIDIA HPC Day Moscow State University 2012


A Common GPU Usage Pattern
· Use GPUs for the most time-consuming aspects of a computational problem
­ Kernels in CUDA, OpenCL, OpenACC, tec. ­ Refactor host code to better support kernels and data transfer ­ Linear solvers, PDE solvers, Convolution filtering (e.g. bilateral Gaussian filters), De Novo gene assembly, etc.

· Rethink the domain problem
NVIDIA HPC Day Moscow State University 2012


CURRENT VICTORIES
HPC Systems

NVIDIA HPC Day Moscow State University 2012


Blue Waters Supercomputer

Cray System & Storage cabinets: Compute nodes: Usable Storage Bandwidth: System Memory: Memory per core module: Gemin Interconnect Topology: Usable Storage: Peak performance: Number of AMD Interlogos processors: Number of AMD x86 core modules: Number of NVIDIA Kepler GPUs:

· >300 · >25,000 · >1 TB/s · >1.5 Petabytes

· 4 GB
· 3D Torus · >25 Petabytes · >11.5 Petaflops · >49,000 · >380,000 · >3,000

10


Cray XE6 Nodes
· Dual-socket Node
­ Two AMD Interlagos chips
· 16 core modules, 64 threads · 313 GFs peak performance · 64 GBs memory
­ 102 GB/sec memory bandwidth

­ Gemini Interconnect
Blue Waters contains 22,640 Cray XE6 compute nodes.

· Router chip & network interface · Injection Bandwidth (peak)
­ 9.6 GB/sec per direction


Cray XK7 Nodes
· Dual-socket Node
­ One AMD Interlagos chip
· 32 GBs memory
­ 51.2 GB/s bandwidth

­ One NVIDIA Kepler chip
· 1.4 TFs peak performance · 6 GBs GDDR5 memory
­ 190 GB/sec bandwidth

­ Gemini Interconnect
· Same as XE6 nodes Blue Waters contains 3,072 Cray XK7 compute nodes.


Gemini Interconnect Network
Blue Waters 3D Torus Size 23 x 24 x 24
Y X Z Infiniband Interconnect Network InfiniBand Login Servers Network(s)

GigE Fibre Channel

SMW

Boot Raid

Lustre

Compute Nodes Cray XE6 Compute Operating System Boot

Service Nodes Login/Network Login Gateways

Service Nodes spread throughout the torus

Cray XK7 Accelerator

System Database

Network

Lustre File System LNET Routers


Science Area

Number of Teams 3 2 2

Codes

Structured Grids

Unstructured Grids

Dense Matrix

Sparse Matrix

NBody

Monte Carlo

FF T

Significan t I/O

Climate and Weather Plasmas/Magnet osphere Stellar Atmospheres and Supernovae Cosmology

CESM, GCRM, CM1, HOMME H3D(M), OSIRIS, Magtail/UPIC PPM, MAESTRO, CASTRO, SEDONA Enzo, pGADGET

X X X X

X

X X X X X

X X X X X

2

Combustion/Tur bulence
General Relativity Molecular Dynamics Quantum Chemistry Material Science Earthquakes/Sei smology Quantum Chromo Dynamics Social Networks Evolution Computer Science

1
2 4 2 3 2

PSDNS
Cactus, Harm3D, LazEV AMBER, Gromacs, NAMD, LAMMPS SIAL, GAMESS, NW Chem NEMOS, OMEN, GW , QMCPACK AW P-ODC, HERCULES, PLSQR, SPECFEM3D Chroma, MILD, USQCD EPISIMDEMICS Eve

X
X X X X X X X X X X X X X X X X X X X

X

X X

X X

1

1 1 1

X

X

X

X

X


CURRENT VICTORIES
HPC applications

NVIDIA HPC Day Moscow State University 2012


NAMD Released GPU Features and Future Plans (100,000 users)
NAMD 2.8
· CUDA features supported full electrostatics with PME and most simulation features (not alchemical methods), NBFIX parameters 100M-atom capability functional on CUDA

NAMD 2.9
· · · · Alchemical free energy perturbation Locally enhanced sampling Methods that modify nonbonded interactions for small sets of atoms New multi-level summary method (MSM)

·

Longer Term
· Specialized methods such as LoweAnderson thermostat, Go potentials, and tabulated nonbonded interactions Various performance improvements, including reduce CPU-side performance bottlenecks such as shifting various calculations to GPUs
NVIDIA HPC Day Moscow State University 2012

·


Pushing Limits of Innovation with NAMD

ApoA-1
92,224 Atoms

F1-ATPase
327,506 Atoms

STMV
1,066,628 Atoms

ns/Day 2,81

ns/Day 1,08 ns/Day 0,28

ns/Day 0,68

ns/Day 0,22

ns/Day 0,05

GPU+CPU

CPU

GPU+CPU

CPU

GPU+CPU

CPU

Test Platform: 1 Node, Dual Tesla M2070 GPU (6GB), Dual Intel 4-core Xeon (2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On. Visit www.nvidia.com/simcluster for more information on speed up results, configuration and test models.

NVIDIA HPC Day Moscow State University 2012


GPU Scaling on NAMD
STMV Benchmark on Tsubame 2.0
30

Dual Socket Intel Xeon Westmere 6 core CPUs/Node 1 GPU/Node 2 GPUs/Node (~ 1 Kepler GPU/Node)

Days/ns

20

1.8x

10

4.2x

2.4x
4.3x 2.7x 4.4x

0 64 128 189 # of Nodes NAMD Benchmark on Tsubame 2.0, 9/7/2011
100STMV, ibverbs-smp

· ·

NAMD run on Tsubame 2.0 from 64 to 189 nodes. Using 1 or 2 Fermi GPUs per node
NVIDIA HPC Day Moscow State University 2012


QCD Strong Scaling using GPUs
·
· · ·

General Problem : as core counts increase, the ratio of communication to local computation tends to grow. For a sufficient number of cores, the problem becomes communications bound (vs. computation) Solution : solvers that minimize communication, such as "domain-decomposition" solvers. An additive Schwarz domain-decomposed preconditioner with a Generalized Conjugate Residual solver (GCR-DD) successfully demonstrates strong scaling Results : Strong scaling to 256 GPUs on 323x256 lattice in Chroma (W ilson-clover fermions) Results : Strong scaling to 256 GPUs on 643x256 lattice in MILC (improved staggered fermions)
Sustained strong-scaling performance in Chroma 3.41.0 using Schwarz generalized conjugate residual solver (GCR-DD). BiCGstab is the reference Krylov solver. Sustained strong-scaling performance in MILC 7.6.3 using mixed-precision conjugate gradient solver (parallelized along multiple dimensions)

*

Guochun Shi (NCSA), BÀlint JoÑ (Jefferson Labs), Ron Babich (BU), Mike Clark (Harvard), Rich Brower (BU), Steve Gottlieb (Indiana), "Scaling Lattice QCD beyond 100 GPUs," SC11, ACM (Nov 2011)

NVIDIA HPC Day Moscow State University 2012


USQCD Software GPU Roadmap
·
· ·

ETA September 2011 : exploiting GPU Direct (QUDA 0.4.0 doesn't currently support peer-to-peer transfers to minimize inter-GPU communication.
ETA Q3/Q4 2011 : Multi-GPU DWF fermions. ETA fall 2011 : Adaptive multigrid (MG), expected to deliver O(10)-fold speedup over current solvers.

·

ETA Q4 2011 / Q1 2012 : Refinement of domain-decomposition algorithms. Currently simple block Jacobi. Expect significant speedup from overlapping blocks, multiplicative Schwarz (e.g. block Gauss-Seidel).
Active R&D 2012 : Exploitation of cache locality (e.g. more efficient use of shared memory to reduce memory traffic). Better scaling for GPU cores vs. GPU memory bandwidth.

·

·

ETA 2012 : Full Hybrid Monte Carlo (e.g. gauge generation) on GPUs. Includes support for high-order symplectic symmetric integrators which improves the volume scaling from HMC, which will result in substantial computational cost reduction at large volumes.
Beta 1H 2012 : Complete deployment of QCD applications (e.g. Chroma) on GPUs by implementing the domain specific language (QDP++) in CUDA. Currently prealpha, Jlab R&D (Jie Chen), Frank Winter (Edinburgh). R&D 2012, Deployment 2013 : Combine HMC and MG on GPUs.
NVIDIA HPC Day Moscow State University 2012

·

·


Current Science Team GPU Plans and Results
· Nearly 1/3 of PRAC projects have active GPU efforts, including
­ ­ ­ ­ ­ ­ ­ AMBER LAMMPS USQCD/MILC GAMESS NAMD QMCPACK PLSQR/SPECFEM3D

· Others are investigating use of GPUs (e.g., Cactus, PPM, AWP-ODC) · Some examples follow
GTC Asia, Beijing, 2011


Current Status - NAMD
· Full run with 100stmv on Titan
­ ­ ­ ­ ­ CUDA implementation Fermi XK6 vs. XE6: 1.9x with 32 nodes Kepler XK7 vs. XE6: projected 2.9x with 32 nodes Fermi XK6 vs. XE6: 1.4 with 768 nodes Kepler XK7 vs. XE6: projected 1.5x with 768 nodes ­ NAMD is limited by PME on Titan
· Titan has less bandwidth and longer latency than BW due to GPU node placement scheme · NAMD team is working on an alternative PME implementation
GTC Asia, Beijing, 2011


Current Status - Chroma
· Lattice QCD parameters: grid size of 48^3 x 512 running at the physical values of the quark masses on Titan
­ Fermi XK6 vs. ­ Kepler XK7 vs nodes ­ Fermi XK6 vs. ­ Kepler XK7 vs nodes XE6: 8.2x with 64 nodes . XE6: projected 13.2x with 64 XE6: 6.1 with 768 nodes . XE6: projected 9.2x with 768

· Chroma is not limited by interconnect
GTC Asia, Beijing, 2011


CURRENT VICTORIES
Scalable and Numerically Stable Libraries

NVIDIA HPC Day Moscow State University 2012


Solid Scalable GPU Libraries
· Dense SGEMM/DGEMM, LU, Triangular solvers (CUBLAS, CULA, MAGMA) · Sparse Matrix Vector Multiplication, Tridiagonal solvers (CUSPARSE, QUDA, PARBOIL) · FFTs, Convolutions (CUFFT, Parboil) · N-Body (NAMD/VMD, FMM BU, PARBOIL) · Histograms (PARBOIL) · Some PDE solvers (CURRENT, PARBOIL)

NVIDIA HPC Day Moscow State University 2012


Scalability vs. Numerical Stability A Major Algorithm Design Challenge
Parallelism · Parallelism to fill growing HW parallelism Complexity and data scalability · Operations should grown linearly with data size Locality · DRAM bursts and cache space utilization Regularity · SIMD utilization and load balance Numerical Stability · Pivoting for linear system solvers
© Wen-mei Hwu, 2012

26


Example: Tridiagonal Solver
· Implicit finite difference methods, cubic spline interpolation, preconditioners · An algorithm to find a solution of Ax = d, where A is an n-by-n tridiagonal matrix and d is an nelement vector

NVIDIA HPC Day Moscow State University 2012


GPU Tridiagonal System Solver Case Study
· Thomas (sequential)
e0 b0 e1 a1 e2 e3 c0 b1 a
2

· Hybrid Methods
­ PCR-Thomas (Kim 2011, Davidson 2011) ­ CR-PCR (CUSPARSE 2012) ­ Etc

c1 b2 a3

c2 b3

· Cyclic Reduction (1 step)
e0 b0 e1 a1 e2 e3 c0 b1 a
2

c1 b2 a3

e0 b0 e a 1 1 c2 e2 a2 b3 e3

0 b1 0

c0 c1 b2 a3

b0 a 0 2 b3

· Numerically unstable
c0 b2

· PCR (1 step)
e0 b0 e1 a1 e2 e3 c0 b1 a
2

c1 b2 a3

e0 b0 e 0 1 c2 e2 a2 b3 e3

0 b1 0 a3

c0 0 b2 0

b0 c1 a2 0 b1 a b3 3

c0 b2 c1 b3



© Wen-mei Hwu, 2012

28


Numerical Stability
· Algorithms that can always find an appropriate operation order and thus finding a solution to the problem as long as it exists for any given input values are numerically stable. · Algorithms that fall short are numerically unstable.

© Wen-mei Hwu, 2012

29


Problems numerical stability
· Algorithms that don't check for divide by zero
0 10 10 0

0
1

nan
0

· Limited ability to represent precision and scale

10-10 1010

1010 0

1 0

1020 1030 inf
© Wen-mei Hwu, 2012 30


Pivoting
· Judiciously swap rows to avoid bad cases

10-10 1010

1010 0

1010 10-10

0 1010

1 0

0 1010

© Wen-mei Hwu, 2012

31


Partition Algorithm
· PCR · SPIKE (Polizzi et al)

SX = Y (5)

DY = F (6)

AiYi = Fi (7)

32


Put the stable sequential algorithm inside each GPU thread
· Each thread will process one tile by itself with a sequential, numerically stable pivoting algorithm · Note that each thread accessing the first element of its own tile will result in large, strided accesses

© Wen-mei Hwu, 2012

33


Memory Layout Issue

e0 b0 e1 a1 e2 e3

c0 b1 a
2

c1 b2 a3

e0 b0 e 0 1 c2 e2 a2 b3 e3

0 b1 0 a3

c0 0 b2 0

b0 c1 a2 0 b1 a b3 3

c0 b2 c1 b3



© Wen-mei Hwu, 2012

34


GPU Memory Bandwidth vs. Stride
· SAXPY with stride:
­ y[i * stride ] = a * x[ i * stride ] + y[i * stride ];

35

"Efficient Sparse Matrix-VecmeiMultipli2012 on CUDA" © Wen- tor Hwu, cation Nathan Bell and Michael Garland, in, "NVIDIA Technical Report NVR-2008-004",,


Tiles Processed by Each Thread
· Each tile:

· Layout of all tiles: (similar to ELL before transposition)

36

© Wen-mei Hwu, 2012


Another Data Layout Alternative
divide into tiles

© Wen-mei Hwu, 2012 37


ASTA Data Layout

© Wen-mei Hwu, 2012 38


Converting AoS to SoA in place is hard
· Converting AoS to SoA is equivalent to transposing a tall and thin array
AoS same as same as SoA

transpos e

39

© Wen-mei Hwu, 2012


In-place Transpostion: simple case
// data[W][H]-->data[H][W] parallel for (j
© Wen-mei Hwu, 2012 40


In-place Transpostion: First Attempt
// data[W][H]-->data[H][W] parallel for (j
© Wen-mei Hwu, 2012 41


In-place Transpostion: First Attempt
// data[W][H]-->data[H][W] parallel for (j
© Wen-mei Hwu, 2012 42


AoS to ASTA Transformation

AoS to ASTA Marshaling Kernel

Global Memory Throughput (GB/s)

Fine Print

Out-of-Place
In-Place Barrier Sync

80
95

2x Space Tile Size (tunable) < On-chip Memory

© Wen-mei Hwu, 2012 43


Cost and Benefit of ASTA Layout Marshaling

© Wen-mei Hwu, 2012

44


Error and Stability

45


Speed

© Wen-mei Hwu, 2012

46


COMING BATTLES
Hardware trends

NVIDIA HPC Day Moscow State University 2012


Important GPU Architecture Trends
· CPU/CPU fusion architectures
­ For reduced part count and data movement ­ Reduced hand-off granularity ­ Larger GPU accessible memory

· Emphasis on energy efficiency
­ By reducing data movement and control flow overhead

· More general forms of parallelism
­ To accommodate algorithm and locality needs

· Even higher degree of SIMD
­ Due to increasing width of memory interface
NVIDIA HPC Day Moscow State University 2012


COMING BATTLES
Scalable Kernels

NVIDIA HPC Day Moscow State University 2012


There is a critical need for scalable kernel libraries
· Both CPUs and GPUs require scalable parallel kernel libraries
­ GPU needs are more urgent

· Only a small percentage of the Intel Math Kernel Library (MKL) functions have scalable forms. · Software lasts through many hardware generations and needs to be scalable to be economically viable
NVIDIA HPC Day Moscow State University 2012


Example of kernel Needs
· Sparse LU factorization, Cholesky factorization, Triangular, and related inverse solvers · Sparse eigen solvers and related eigen analysis · Graph partitioning (Metis) ·...

NVIDIA HPC Day Moscow State University 2012


Four Challenges
· Computations with no known scalable parallel algorithms
­ Shortest path, Delaunay triangulation, ...

· Data distributions that cause catastrophical load imbalance in parallel algorithms
­ Scale-free graphs, MRI compressed sensing

· Computations that have little data reuse
­ Matrix vector multiplication, ...

· Algorithm optimizations that are hard and labor intensive
­ Locality and regularization transformations
NVIDIA HPC Day Moscow State University 2012


Example - Dynamic Data Extraction
· The data to be processed in each phase of computation need to be dynamically determined and extracted from a bulk data structure
­ Harder to organize for massively parallel access

· Graph algorithms are popular examples that deal with dynamic data extraction
­ Widely used in EDA and large scale optimizations ­ Breadth-First Search (BFS) as an example

NVIDIA HPC Day Moscow State University 2012


Dynamic Data Extraction using Queues
· Input data extraction is done by many threads in parallel
­ Must have a systematic way to avoid contention in assembling extracted data

· Obvious approach is queues with privatization
­ Replicate queues to reduce contention ­ Combine queues with concatenation ­ Works only when global order does not matter (queue insertion is commutative)
NVIDIA HPC Day Moscow State University 2012


Three-level Queue Hierarchy
w-queue Scratchpad memory

b-queue

b-queue

· At the end of each the kernel

g-queue

Global memory

­ Threads cooperate to assemble b-queue ­ Multiple threads collaborate to merge b-queue contents into g-queue ­ Fast atomic operation helps NVIDIA HPC Day Moscow State University 2012


· Customize kernels based on the size of frontiers. · Use fast barrier synchronization when the frontier is small.
One-level parallel propagation

Hierarchical Kernels Need for Dynamic Parallelism

Kernel 1: Intra-block Sync.

Kernel 2: Inter-block Sync. Kernel 3: Kernel re-launch
NVIDIA HPC Day Moscow State University 2012


COMING BATTLES
tools

NVIDIA HPC Day Moscow State University 2012


How a mathematician writes matrix multiplication

How a smart CUDA programmer writes matrix multiplication
#define TILE_N 16 #define TILE_TB_HEIGHT 8 #define TILE_M (TILE_N*TILE_TB_HEIGHT) __global__ void mysgemmNT( const float *A, int lda, const float *B, int ldb, float* C, int ldc, int k, float alpha, float beta ){ { float c[TILE_N]; for (int i=0; i < TILE_N; i++) c[i] = 0.0f; int mid = threadIdx.y * blockDim.x + threadIdx.x; int m = blockIdx.x * TILE_M + mid; int n = blockIdx.y * TILE_N + threadIdx.x; __shared__ float b_s[TILE_TB_HEIGHT][TILE_N]; for (int i = 0; i < k; i+=TILE_TB_HEIGHT) { float a; b_s[threadIdx.y][threadIdx.x]=B[n + (i+threadIdx.y)*ldb]; __syncthreads(); for (int j = 0; j < TILE_TB_HEIGHT; j++) { a = A[m + (i+j)*lda]; for (int kk = 0; kk < TILE_N; kk++) c[kk] += a * b_s[j][kk]; } __syncthreads(); } int t = ldc*blockIdx.y * TILE_N + m; for (int i = 0; i < TILE_N; i++) { C[t+i*ldc] = C[t+i*ldc] * beta + alpha * c[i]; } } ... dim3 grid( m/TILE_M, n/TILE_N ), threads( TILE_N, TILE_TB_HEIGHT ); mysgemmNT<<>>( A, lda, B, ldb, C, ldc, k, alpha, beta); NVIDIA HPC Day Moscow State ...

University 2012


Writing efficient code is complicated.

Tools can provide focused help or broad help
Planning how to execute an algorithm Implementing the plan
· · · · · · · · Memory allocation Data movement Pointer operations Index arithmetic
GMAC Data Layout

· Choose data structures

Triolet

· Decompose work into tasks · Schedule tasks to threads
MCUDA

Kernel dimensions Thread Thread ID arithmetic coarsening Synchronization Temporary data structures
NVIDIA HPC Day Moscow State University 2012


Related Kernel Development Tools
· OpenACC Accelerator Pragmas
­ Wider use of GPU in large applications but less performance in each kernel ­ Cray and others

· Portland Group CUDA FORTAN compiler
· NVIDIA Thrust · Microsoft C++AMP
NVIDIA HPC Day Moscow State University 2012


Conclusion and Outlook
· We have enjoyed some victories
­ Good initial set of applications and kernels ­ Good deployment interface in major languages ­ Good initial results, educated developers

· We will face more battles
­ ­ ­ ­ Energy efficiency vs. easy of programming Potential fragmentation of programming interface Widen the set of applications, algorithms and kernels Better tools

There is always hope.

­

Aragorn in the eve of the Battle of Pelennor NVIDIA HPC Day Moscow State Minas Tirith University 2012


THANK YOU!

NVIDIA HPC Day Moscow State University 2012