Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.mrao.cam.ac.uk/~bn204/publications/2011/boostcon2011slides.pdf
Äàòà èçìåíåíèÿ: Fri Aug 10 20:18:10 2012
Äàòà èíäåêñèðîâàíèÿ: Tue Oct 2 14:12:57 2012
Êîäèðîâêà:

Ïîèñêîâûå ñëîâà: trees
Numerical computation in C++

Introduction

Fast numerical computation in C++: Expression Templates and Beyond to Lazy Code Generation (LzCG)
B. Nikolic
Cavendish Laborator y/Kavli Institute University of Cambridge

Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

BoostCon 2011 May 2011


Overview of ideas
1. `Standard' rules of C++ lead to inefficient numerical code 2. New rules ( sub-languages) can be implemented using expression templates
2.1 Types are used confer information about expressions 2.2 Translated to `standard' C++ at compile-time

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

3. Makes high-performance numerical C++ libraries possible and successful 4. But is it enough?
4.1 Most efficient algorithm not obvious at compile-time 4.2 Convenience/flexibility of generating code in C++

5. Types retain information about expressions in signatures in object code
5.1 Can re-generate expression template implementations post-compilation-time


Outline
Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


About myself: ALMA telescope
Largest ground-based astronomy project in the world

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Currently being commissioned at altitude of 5000 m in Chile. Will have 66 telescopes separated by up to 15 kms and observed at wavelength between 7 and 0.35 mm.


About myself: Green Bank Telescope
Largest steerable telescope in the world

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Main reflector is 100x110 m in size, total height 160 m. Entire structure is accurate to 0.25 mm.


About myself: Thermal radio emission from Messier 66

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

Colour scale is emission from dust at 0.024 mm wavelegnth Contours represent emission at 3 mm from hot electron gas Both appear to be powered by recent star formation

Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


General Interests

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

Model optimisation and statistical inference (maximum-likelihood, Markov Chain Monte Carlo, Nested Sampling techniques) Pricing and risk-management of derivative contracts Remote sensing of Ear th's atmosphere Radiative transfer and other physical simulations All very numerically intensive applications...

Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Aper ture synthesis radio-astronomy
Revolutionised the radio view of the universe ­ Nobel prize in 1972 Development of the technique closely tied to computers:
Lots of Fourier Transforms Large quantities of data to be binned, inspected, discarded if necessar y Instruments inherently unstable so calibration is critical

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Atacama Large Millimetre Array: eventually 66 antennas, 20 Mb/s average output data rate:
Computational issues inconvenient, reduce scientist productivity

Square Kilometre Array (SKA): 1000s antennas, wide field of view, few Gb/s average output data rate:
Computational issues limiting factor in scientific output


Risk management of `derivative' contracts in finance
Requirements in just one product line (e.g., credit derivatives)

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Typically calculations involve either: solving PDEs using finite differences; or computing FFTs; or Monte-Carlo (MC) simulations. 2000 nodes â 1 kW/node + 50% aircon cost = 3 MW 3 MW â 10 p/s â 8500 hr/yr = 2.5 â 106 GBP/yr! Additional costs number of nodes:
Installation, maintenance, software licenses (even Excel sometimes!) Floor-space (in expensive buildings) Standby backup power generation costs


Numerical performance
(Why) does it matter?

Numerical computation in C++

Introduction

Easily parallelisable
Cost Heat, power, floor space Environmental impact Time to scale-up Access to capital

Difficult to parallelise
Feasibility Latency User patience

Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Parallelisation is usually the most impor tant aspect of high-performance numerical computing
Not directly considering it in this talk although much of the material is relevant


Small problems simple solutions
Many practical scientific and industrial problems can be accelerated a simple way

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Listing 1: By-hand coding + SIMD intrinsics
void add2Vect ( const const std : : t y p e d e f double v 2 v2df dest =( v2df c o n s t s i z e t n= v 1 const v 2 d f s r c 1 = const v 2 d f s r c 2 = i f ( n %2==0) { for ( s i z e t i =0; { dest [ i ]= b } } else { for ( s i z e t i =0; { dest [ i ]= src1 [ } } } s t d : : v e c t o r & r e df attribute ( )&( r e s . b e g i n ( ) ) . size ( ) ; ( c o n s t v 2 d f )& v1 [ ( c o n s t v 2 d f )& v2 [ e> &v 1 , e> &v 2 , s) { ( mode ( V2DF ) ) ) ; ; 0]; 0];

i
i
Simple problems are common in real life but not really the subject of this talk!


Hand coding unsuitable for large systems

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Correctness Maintainability, readability, por tability Algorithms need adjustment over time Experiment with different implementations of algorithms Approximations: how much precision, what accuracy is necessar y? These can be difficult to achieve with complex hand-crafted code!


Warning!
"Don't try this at home" ­ try existing libraries first

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

Writing numerical libraries is difficult and error prone ­ always carefully consider alternatives! Can you use standard existing libraries ("C" or "C++") Are you writing a general purpose library or an application? Can you, in advance, identify a subset of algorithm which is likely to consume most time but can present a clean, data-only, interface?

Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Outline
Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Requirements for good numerical performance
Maximise parallelism
Use all of the nodes/processors/cores/execution units Use Single-Instruction-Multiple-Data (SIMD)

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Minimise memor y access
Keep close data to be processed together Use algorithms that process small chunks of input data at a time Avoid temporaries

Minimise `branching'
Keep the pipeline and speculative fetches good But, need enough code at hand to execute

Minimise quantity of transcendental calculations
Includes division in this set Reducing precision or accuracy makes these faster


Optimisation Challenges I
Want: to describe the algorithm in simple, readable, re-usable way
/ / This : R=A+B+C+D+E ; / / Not t h i s : addFiveVect Double Double Double Double Double ( A , B , C, D, E , R ) ;

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Rules for transforming such description to executable code need to be complex to be efficient Simple application of rules applying to C++ objects:
Arguments are `evaluated' before being passed to functions Operators take two arguments at most Creation of temporaries Iteration is interpreted literally as ordered repetition of same segment of code

Fundamentally: One step of the algorithm at a time Definitely not suitable for fast code!


Optimisation Challenges II

Numerical computation in C++

Introduction

Programs may be compiled on one hardware setup but run on many different hardware setups Might need (or want) to adjust rules for generation of implementations after the compilation of the main program Speed of execution of par ticular implementation of algorithm can be difficult to predict
Depends on precise model of the processor : clock speed, number of floating point execution cores, hyper-threading, branch-prediction, pipeline designs, microcode implementations of complex instructions Sizes of the various levels of data and code caches, main memor y bus speed

Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Example: row vs column matrix access

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

Listing 2: Sum by iterating through columns first u : : m a t r i x A ( n r o w // initialise ... double r e s ; f o r ( s i z e t k = 0 ; k
Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

; ++ k ) ; ++ i ) o l ; ++ j )


Example: row vs column matrix access

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

Listing 3: Sum by iterating through rows first u : : m a t r i x A ( n r o w , // initialise ... double r e s ; f o r ( s i z e t k = 0 ; k
Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

++ k ) ++ j ) / / n o t e swap ; ++ i ) / / n o t e swap


Example: row vs column matrix access
Rows 1000000 100000 10000 1000 100 10 Columns 10 100 1000 10000 100000 1000000 Time for col-first (seconds) 4.16 4.15 4.04 4.02 4.00 3.96 Time for row-first (seconds) 4.52 9.54 5.52 5.41 4.56 4.04

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example

Note
Compiled using gcc without optimisation Run on my laptop Illustration only!

Summary


Techniques used for advanced numerical libraries

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

Optimising compilers Custom compilers for standard languages Code generation using custom languages/frameworks Run-time selection according to detected hardware Run-time profiling of multiple/many algorithms Run-time generation of machine code Expression templates and lazy code generation can adapt all of these to standard C++.

Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


A quick case study: FFTW
http://www.fftw.org

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Revolutionar y at the time: Building blocks ("codelets") of algorithms generated at compile-time: OCAML C machine code Run-time selection of best combination of building-blocks
Memor y-layout, cache sizes, relative speed of memor y Code selection can be saved

SIMD+ (p)threads + MPI parallelism Presents trivial C-language & For tran-language interfaces


Some subtleties
The order of floating point operations usually matters even when operations on real numbers would not: 1 + (-1 + 10
-10

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

) = (1 - 1) + 10

-10

(1)

"Standard" x86 floating point uses 80-bit internal precision! (SIMD instructions do not) Non-normal (NaN, Inf, de-normalised) floating point number badly affect performance: + 1 = (slowly!) (2)

Transcendentals can be approximated to less than full precision Ensuring "bit-equivalent" results is difficult and expensive


Outline
Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


How to identify bottlenecks

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Run a tick counting profiler: http://oprofile.sourceforge.net/
Get a stochastic measurement of where the CPU spends most of its time without modifying the code!

Run a call-graph profiler: http://valgrind.org/
Shows how the CPU-intensive par ts of the code fit into the big picture of the application

Compile to assembly only ("gcc -S") and look at the code!
Allows identifications of in-efficiencies in the produced code and gives hints for optimisations


Outline
Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


What is an expression template?
Templated Type Where the type itself encodes an operation, expression or algorithm Passed between functions as instantiated objects (usually with data as references) Implemented through (par tial) specialisations

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Expression template

=


Expression template (minimal) example

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example

Listing 4: Adding vectors / / Some v e c t o r s t o u s e i n t h e e x a m p l e s t d : : v e c t o r a ( 1 0 , 1 . 0 ) , b ( 1 0 , 2 . 0 ) ; / / A d d i t i o n t h e o l d way ( s e e n e x t s l i d e ) d o u b l e r e s = ( a+b+a+b ) [ 3 ] ; 1. A temporar y is created 2. All members of the result vector are computed 3. The temporar y is iterated three times over

Summary


Helper function
Listing 5: Simple add operator for std::vector t e m p l a t e < c l a s s T> s t d : : v e c t o r operator +( const std const std { s t d : : v e c t o r r e f o r ( s i z e t i =0; i res [ i ]= a [ i ]+ b [ i r e t u r n res ; }

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

: : v e c t o r &a , : : v e c t o r &b ) s (a . size ( ) ) ;
1. Considers two vectors at a time 2. Iterates through the whole vector and computes the entire result


Expression template (minimal) example
Listing 6: Expression template class t e m p l a t e valueop, addop are simple tag structs

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

a s s E1 , c l a s s E2 , a s s op = v a l u e o p > p &left ; E1 val ), E1 c o n s t E2 & r i g h t ;

onst p op left onst

& l e f t , c o n s t E2 & r i g h t , ): r i g h t ( r i g h t ) {}; &v a l ) ;


Expression template (minimal) example
Listing 7: Creation of compound expression t e m p l a t e o p e r a t o r + ( c o n s t E1 & l c o n s t E2 & r { r e t u r n b i n o p eft , ight ) addop >( l e f t , right , addop ( ) ) ;

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

} 1. E1, E2 are `free' template parameters ­ types of sub-expression is encoded in result 2. addop is the type of third template parameter ­ encodes addition of sub-expressions


Expression template (minimal) example

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Listing 8: Templated evaluation operation // / // / tem dou Evaluate the an e x p r e s s i o n p l a t e b i n o p &o , t i );


Expression template (minimal) example
Par tial specialisation for add operation

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Listing 9: Implementation of sum operation using par tial specialisation t e m p l a t e b i n o p &o , i) f t , i )+ eval ( o . r i g h t , i ) ;

1. Specialised on addop as third temp-par 2. Recursively evaluate and add using standard operator+


Expression template (minimal) example
Par tial specialisation for value operation

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

Listing 10: Par tial specialisation for value types t e m p l a t e d o u b l e e v a l ( c o n s t b i n o p &o ,

Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

1. Specialised on valueop as third template-parameter 2. Simply returns the value of reference vector at i


Expression template (minimal) example

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

Listing 11: In use s t d : : v e c t o r a ( 1 0 , 1 . 0 ) , b(10 , 2 . 0 ) ; b i n o p <> ba ( a ) , bb ( b ) ; d o u b l e r e s = e v a l ( b a + b b + b a + bb , 3 ) ; 1. Does not create a temporary vector 2. Evaluates only the third element of the result

Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


A look at the types

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

Listing 12: Signatures as seen by nm -C
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 0 0 0 0 1 1 1 1 1 4 6 4 4 a 2 7 a 7 4 3 W W W W d d d d o o o o u u u u b b b b l l l l e e e e e e e e v v v v a a a a l l l l
Expression template generalities t d : : v e c t o r >, s c t o r , s t d : : v e l e> code o u b l e , s t d : : a l l o c a t o r >, gener: : v e c t what s t d ation ­ o r >, s t d : : v e is & how it w e , LzCG example Summary

td c o s


A look at the types
Listing 13: Output of nm -C wrapped properly
d o u b l e e v a l, b i n o p, a d d o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p >, a d d o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p> >( b i n o p >, s t d : : v e c t o r >, v a l u e o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p >, a d d o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p >, a d d o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p >, a d d o p> c o n s t & , unsigned l o n g ) l e> >, l e> >, l e> >, l e> >,

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


A look at the types

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Lazy evaluation
Not the same as lazy code generation...

Numerical computation in C++

In summar y: 1. Operations (+function calls) return `expressions' not results 2. The order and implementation of operations in expression can be modified (at compile-time) 3. Results are only evaluated at a boundary, e.g., when assigning to plain-old-data 4. If the result is never required, it is never computed Doing this properly is very elegant but things get complicated ­ see the Haskel programming language Things to keep in mind: 1. Side-effects are ill-defined ­ stick to `functional' programming
Assigning to an already initialised variable is not functional programming!

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

2. Implemented with expression templates, size of types grows ver y quickly


Libraries
Eigen http://eigen.tuxfamily.org (LGPL) Array Ops, Basic + Advanced Linear Algebra, Geometr y Armadillo http://arma.sourceforge.net/ (LGPL) Array Ops, Basic + Advanced Linear Algebra, Geometr y Boost.uBLAS http://www.boost.org (Boost License) Basic Linear Algebra NT
2 http://nt2.sourceforge.net/

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

(LGPL)

Blitz++ http://www.oonumerics.org/blitz/ (GPL+ar tistic) This was one of the first libraries to use expression templates


Outline
Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Introduction
`Standard' expression templates
1. Types convey information about algorithm that functions implement 2. These types are interpreted at compile-time and corresponding code generated

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example

Lazy code generation
1. The types are recorded in object code too, so the algorithm implemented by symbols is retained 2. Generate new implementations, post-compilation-time Introduces new flexibility and modularity in code generation process

Summary


Example ­ BLAS Reference card
Level 1 BLAS
SUBROUTINE SUBROUTINE SUBROUTINE SUBROUTINE SUBROUTINE SUBROUTINE SUBROUTINE SUBROUTINE FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION FUNCTION xROTG ( xROTMG( xROT ( xROTM ( xSWAP ( xSCAL ( xCOPY ( xAXPY ( xDOT ( xDOTU ( xDOTC ( xxDOT ( xNRM2 ( xASUM ( IxAMAX( dim scalar vector 5-element array A, B, C, S ) D1, D2, A, B, PARAM ) X, INCX, Y, INCY, C, S ) X, INCX, Y, INCY, PARAM ) X, INCX, Y, INCY ) ALPHA, X, INCX ) X, INCX, Y, INCY ) ALPHA, X, INCX, Y, INCY ) X, INCX, Y, INCY ) X, INCX, Y, INCY ) X, INCX, Y, INCY ) X, INCX, Y, INCY ) X, INCX ) X, INCX ) X, INCX ) vector scalars

Numerical computation in C++

Introduction
N, N, N, N, N, N, N, N, N, N, N, N, N,

Level 2 BLAS
xGEMV xGBMV xHEMV xHBMV xHPMV xSYMV xSBMV xSPMV xTRMV xTBMV xTPMV xTRSV xTBSV xTPSV xGER xGERU xGERC xHER xHPR xHER2 xHPR2 xSYR xSPR xSYR2 xSPR2 ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (

Generate plane rotation Generate modi ed plane rotation Apply plane rotation Apply modi ed plane rotation x$y xx yx y x+y dot xT y dot xT y dot xH y dot + xT y nrm2 jjxjj2 asum jjre(x)jj1 + jjim(x)jj1 amax 1st k 3 jre(xk )j + jim(xk )j = max(jre(xi )j + jim(xi )j)

pre xes S, D S, D S, D S, D S, D, C, Z S, D, C, Z, CS, ZD S, D, C, Z S, D, C, Z S, D, DS C, Z C, Z SDS S, D, SC, DZ S, D, SC, DZ S, D, C, Z S, S, C, C, C, S, S, S, S, S, S, S, S, S, S, C, C, C, C, C, C, S, S, S, S, S, S, C, S, C, S, C, S, S, D, D, Z Z Z D D D D, D, D, D, D, D, D Z Z Z Z Z Z D D D D D, D, Z D, Z D, Z D, D, C, Z C, Z C, Z C, Z C, Z C, Z C, Z C, Z

Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

options TRANS, TRANS, UPLO, UPLO, UPLO, UPLO, UPLO, UPLO, UPLO, TRANS, UPLO, TRANS, UPLO, TRANS, UPLO, TRANS, UPLO, TRANS, UPLO, TRANS, options

DIAG, DIAG, DIAG, DIAG, DIAG, DIAG,

UPLO, UPLO, UPLO, UPLO, UPLO, UPLO, UPLO, UPLO,

dim b-width M, N, M, N, KL, KU, N, N, K, N, N, N, K, N, N, N, K, N, N, N, K, N, dim scalar M, N, ALPHA, M, N, ALPHA, M, N, ALPHA, N, ALPHA, N, ALPHA, N, ALPHA, N, ALPHA, N, ALPHA, N, ALPHA, N, ALPHA, N, ALPHA,

scalar ALPHA, ALPHA, ALPHA, ALPHA, ALPHA, ALPHA, ALPHA, ALPHA,

vector X, INCX, X, INCX, X, INCX, X, INCX, X, INCX, X, INCX, X, INCX, X, INCX, X, INCX, X, INCX, X, INCX,

matrix vector A, LDA, X, INCX, A, LDA, X, INCX, A, LDA, X, INCX, A, LDA, X, INCX, AP, X, INCX, A, LDA, X, INCX, A, LDA, X, INCX, AP, X, INCX, A, LDA, X, INCX A, LDA, X, INCX AP, X, INCX A, LDA, X, INCX A, LDA, X, INCX AP, X, INCX vector matrix Y, INCY, A, LDA Y, INCY, A, LDA Y, INCY, A, LDA A, LDA AP ) Y, INCY, A, LDA Y, INCY, AP ) A, LDA AP ) Y, INCY, A, LDA Y, INCY, AP )

scalar BETA, BETA, BETA, BETA, BETA, BETA, BETA, BETA, ) ) ) ) ) ) ) ) ) ) ) ) )

vector Y, INCY Y, INCY Y, INCY Y, INCY Y, INCY Y, INCY Y, INCY Y, INCY

) ) ) ) ) ) ) )

y y y y y y y y x x x x x x A A A A A A A A A A A
) ) ) ) ) ) )

Level 3 BLAS
options xGEMM ( xSYMM ( xHEMM ( xSYRK ( xHERK ( xSYR2K( xHER2K( xTRMM ( xTRSM ( TRANSA, TRANSB, SIDE, UPLO, SIDE, UPLO, UPLO, UPLO, UPLO, UPLO, SIDE, UPLO, SIDE, UPLO, TRANS, TRANS, TRANS, TRANS, TRANSA, TRANSA,

Ax + y; y AT x + Ax + y; y AT x + Ax + y Ax + y Ax + y Ax + y Ax + y Ax + y Ax; x AT x; x AH x Ax; x AT x; x AH x Ax; x AT x; x AH x A 1 x; x A T x; x A A 1 x; x A T x; x A A 1 x; x A T x; x A xyT + A; A m n xyT + A; A m n xyH + A; A m n xxH + A xxH + A xyH + y( x)H + A xyH + y( x)H + A xxT + A xxT + A xyT + yxT + A xyT + yxT + A op(A)op(B) + AB + C; C AB + C; C AAT + C; C AAH + C; C ABT + BAT ABH + BAH op(A)B; B op(A 1 )B; B

y; y y; y

AH x + y ; A m n AH x + y ; A m n

Hx Hx Hx

C, C, C, C, C, C,

Z Z Z Z Z Z

dim M, N, M, N, M, N, N, N, N, N, DIAG, M, N, DIAG, M, N,

scalar matrix K, ALPHA, A, LDA, ALPHA, A, LDA, ALPHA, A, LDA, K, ALPHA, A, LDA, K, ALPHA, A, LDA, K, ALPHA, A, LDA, K, ALPHA, A, LDA, ALPHA, A, LDA, ALPHA, A, LDA,

matrix scalar matrix B, LDB, BETA, C, LDC B, LDB, BETA, C, LDC B, LDB, BETA, C, LDC BETA, C, LDC BETA, C, LDC B, LDB, BETA, C, LDC B, LDB, BETA, C, LDC B, LDB ) B, LDB )

2

C C C C C C C B B

C; op(X ) = X; X T ; X H ; C m n BA + C; C m n; A = AT BA + C; C m n; A = AH AT A + C ; C n n AH A + C; C n n + C; C AT B + BT A + C; C n n + C; C AH B + B H A + C ; C n n Bop(A); op(A) = A; AT ; AH ; B m n Bop(A 1 ); op(A) = A; AT ; AH ; B m n


BLAS ­ structure of function of names

Numerical computation in C++

Introduction

Operation: DOT scalar product AXPY vector sum MV matrix-vector product SV matrix-vector solve MM matrix-matrix product SM matrix-matrix solve Numerical type: S single real D double real C single complex Z double complex

Matrix type: GE general GB general band SY symmetric SB symmetric band SP symmetric packed HE hermitian HB hermitian band HP hermitian packed TR triangular TB triangular band TP triangular packed

Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


C++ type is encoded in the function (symbol) name
Listing 14: Back to basic expression template example
d o u b l e e v a l, b i n o p, a d d o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p >, a d d o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p> >( b i n o p >, s t d : : v e c t o r >, v a l u e o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p >, a d d o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p >, a d d o p >, b i n o p >, s t d : : v e c t o r >, v a l u e o p >, a d d o p> c o n s t & , l e> >, l e> >, l e> >, l e> >,

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


C++ type is encoded in the function (symbol) name

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Listing 15: Back to basic expression template example (raw nm)
000000000040114 a W Z4evalI5binopIS0 IS0 ISt6vectorIdSaIdEES3 \ 7 valueopES5 5addopES5 S6 ES5 EdRKS0 IT T0 S6 Em

The function/symbol name specifies exactly the algorithm that it should apply to its data This information is available post-compile-time (as simply as using nm) Implementation can be generated post-compilation and used in program simply by linking it (weak symbols make this trivial)


Alternative view

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

What we get:
Expression templates allow expor t of a subset of the program parse tree outside the compiler environment.

Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

What would be the alternative?
Parsing the C++ source code writing new compiler


Potential advantages of LzCG
General: 1. A very simple, clean, efficient mechanism for separating specification of what needs to be from how its done 2. A mechanism introducing an embedded language in C++ that can be implemented outside traditional C++ compilation scheme Specific: 1. Can tr y multiple algorithms and select the experimentally most efficient 2. Can detect hardware configuration and generate efficient code without access to source code 3. Can use custom and third-par ty code generators (GPU compilers, cluster compute tools, or Ocaml + C-compiler!)

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Outline
Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Example introduction ­ Fast Fourier Transforms

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

Optimum FFT algorithm depends on array length, cache sizes, processor architecture, etc., etc Normally selected by benchmarking at run-time Can we do better if we know array size at compile-time?

Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Code
Listing 16: Call to compute the forward FFT b o o s t : : a r r a y i n p ; b o o s t : : a r r a y o u t ; FFTForward ( i n p , o u t ) ; 1. Array sizes are known at compile-time 2. Can select optimum algorithm as soon as we know what machine we run on
This may be after compile-time but must be before run-time

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

3. Selection before run-time:
3.1 Removes potentially lengthy run-time algorithm selection 3.2 Makes the program performance more predictable 3.3 Reduces code size 3.4 Allow selection from wider range of algorithms


Code

Numerical computation in C++

Introduction

Listing 17: Call to FFTW the old-fashioned way f f t w p l a n p= f f t w p l a n d f t 1 d (5 , ( fftw complex )(& in [ 0 ] ) , ( fftw complex )(& out [ 0 ] ) , FFTW FORWARD, FFTW ESTIMATE ) ; fftw execute (p ) ; 1. First call to create the plan can be time consuming (e.g., 1 second !) 2. Linking the entire library ­ `codelets' + the algorithm selection code

Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Code

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Listing 18: Declaration of the FFTForward template t e m p l a t e v o i d F F T F o r w a r d ( c o n s t b o o s t : : a r r a y & i n , b o o s t : : a r r a y & o u t ) ; Listing 19: Call to compute the forward FFT b o o s t : : a r r a y i n p ; b o o s t : : a r r a y o u t ; FFTForward ( i n p , o u t ) ;


Code
Listing 20: Signature of the call to FFTForward (nm -C)
0 0 0 0 0 0 0 0 0 0 4 0 0 5 e3 W v o i d F F T F o r w a r d( b o o s t : : a r r a y c o n s t & , b o o s t : : a r r a y&)

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

This is all the information we need to select an algorithm: 1. Parse signatures to identify all instances of FFTForward 2. Machine generate new C++ code with specialisation for each FFTForward instance but with optimum algorithm pre-selected 3. Compile and link these with original code

Note:
Do not need access to the application source code ­ just the object file Can do the code generation on a different computer, using different tools, compilers & languages


Implementation

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Listing 21: Parse symbol table
def analy mre = r boost : : ar boost : : ar stabl seCal ( fnamein ) : e . c o m p i l e ( " . v o i d F F T F o r w a r d\\(\ <> r a y c o n s t & , \ r a y&\\). " ) e = s u b p r o c e s s . Popen ( [ " nm " , "-C " , f n a m e i n ] , s t d i n = s u b p r o c e s s . PIPE , s t d o u t = s u b p r o c e s s . PIPE ) . c o m m u n i c a t e ( ) [ 0 ] f o r l i n s t a b l e . s p l i t ( " \n " ) : m=mre . m a t c h ( l ) i f m: s = mkFunc ( i n t ( m . g r o u p ( " N " ) ) )

1. Simple regular expression on output of the nm -C 2. The array length is extracted from the signature


Implementation
Listing 22: Algorithm selection
def t e m p l a t e P r o g (N return " " " # i n c l u d e i n t main ( v o i d ) { c o n s t s i z e t N=% i ; f f t w c o m p l e x in , o fftw plan p; i n = ( f f t w c o m p l e x ) out = ( fftw complex p = fftw plan dft 1d fftw print plan (p ); } " " " % (N / 2 ) ):

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux ut ; fftw malloc ( sizeof ( fftw complex ) N) ; ) fftw malloc ( sizeof ( fftw complex ) N) ; ( N , i n , o u t , FFTW FORWARD, FFTW MEASURE ) ; / / This outputs the plan to stdout Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

1. Trivial C program that calls FFTW with right array size 2. Execute this at lazy-code-generation-time 3. Print the selected best algorithm 4. Hardcode this algorithm selection in the specialised function


Implementation
Listing 23: Emit a specialised function for this array size
d e f mkFunc ( N ) : mkProg ( t e m p l a t e P r o g ( N ) ) p l a n =open ( " t m p P r o g O u t " ) . r e a d ( ) s= " " " # i n c l u d e # i n c l u d e " f f t b i n d . hpp " #include " fftw3 . h" template < > v o i d F F T F o r w a r d( c o n s t b o o s t : : a r r a y & i n , b o o s t : : a r r a y & o u t ) { c o n s t c h a r m p l a n = "%s " ; f f t w i m p o r t w i s d o m f r o m s t r i n g ( mplan ) ; f f t w p l a n p= f f t w p l a n d f t 1 d (% i , ( f f t w c o m p l e x ) c o n s t c a s t (& i n [ 0 ] ) , ( f f t w c o m p l e x )(& out [ 0 ] ) , FFTW FORWARD, FFTW ESTIMATE ) ; fftw execute (p ) ; fftw destroy plan (p ); } " " "% ( N , N , N , p l a n , N / 2 ) return s

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

1. The plan is stored as string literal within each specialised function


What have we achieved?

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

1. Optimum, pre-selected FFTW transform for each array of known size at compile-time ­ efficiency 2. Could switch to multi-core/GPU/etc without access to source code ­ modularity Implementation shor tcomings (this is a quick example!): The entire FFTW is linked-in ­ not just the specific algorithms Loading plans at run-time could be significant for small transforms


Outline
Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Simple C++ code is not numerically efficient

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works example

1. Function arguments evaluated promptly:
v e c t o r m y V e c t ( 1 0 0 0 0 0 0 0 , 3 . 0 ) ; t a k e O n e ( 1 . 0 / myVect , 5 ) ;

2. Only binar y operators

v e c t o r m y V e c t ( 1 0 0 0 0 0 0 0 , 3 . 0 ) ; v e c t o r m y V e c t 2 = m y V e c t + m y V e c t + m y V e c t ;LzCG

Summary

3. Optimum algorithm sometimes can not be selected at compile-time


Expression templates resolve many of these issues

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux

1. Expressions create objects with type that specifies the algorithm to be carried out 2. This type is inter preted at compile-time (through par tial specialisation) to generate an efficient algorithm implementation of the whole expression Blitz++, Boost::UBLAS, Armadillo, Eigen, NT2

Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary


Lazy code generation

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

1. The type of expression templates is available in object code
0 0 0 0 0 0 0 0 0 0 4 0 0 5 e3 W v o i d F F T F o r w a r d( b o o s t : : a r r a y c o n s t & , b o o s t : : a r r a y&, b o o s t : : a r r a y&)

2. It can be inter preted post-compilation-time to generate code for the implementation
Tr y different algorithms Delay algorithm selection for final hardware Increased modularisation

3. The original implementation can be empty or a fall-back


When to use Lazy code generation

Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary

1. Need to do empirical selection of optimum algorithms 2. Want to do code generation using non-C++ compiler tools:
OcamL C+intrinsics machine code GPU compiler?

3. Need to re-establish clean separation between application and library
Update implementations without access to original source code Licensing concerns, proprietar y libraries


Numerical computation in C++

Introduction Numerical algorithms, libraries and their performance

Thanks for listening!
See http://www.bnikolic.co.uk/ and http://www.mrao.cam.ac.uk/~bn204/ and for more information

Interlude: Profiling on Linux Expression template generalities Lazy code generation ­ what it is & how it works LzCG example Summary