Документ взят из кэша поисковой машины. Адрес оригинального документа : http://www.parallel.ru/sites/default/files/ftp/computers/ibm/sg247041.pdf
Дата изменения: Wed Nov 2 11:53:58 2011
Дата индексирования: Tue Oct 2 03:22:36 2012
Кодировка:
Поисковые слова: arp 220

Front cover

The POWER4 Processor Introduction and Tuning Guide

Comprehensive explanation of POWER4 performance Includes code examples and performance measurements How to get the most from the compiler

Steve Behling Ron Bell Peter Farrell Holger Holthoff Frank O'Connell Will Weir

ibm.com/redbooks

International Technical Support Organization The POWER4 Processor Introduction and Tuning Guide November 2001

SG 24 -7041-00

Take Note! Before using this infor mation and the product it suppor ts, be sure to read the general infor mation in "Special notices" on page 175.

First Edition (November 2001) This edition applies to AIX 5L for POWER Version Version 7.1.1 (5765-C10 and 5765-C11) and subs pSeries POWER4-based ser ver. Unless otherwise document were measured on a 1.1 GHz machine, 5.1 (program number 5765-E61), XL For tran equent releases running on an IBM ^ noted, all perfor mance values mentioned in this then nor malized to 1.3 GHz.

Note: This book is based on a pre-GA version of a product and may not apply w hen the product becomes generally available. We recommend that you consult the product documentation or follow-on versions of this redbook for more current infor mation. Comments may be addressed to: IBM Cor poration, Inter national Technical Suppor t Organization Dept. JN9B Building 003 Inter nal Zip 2834 11400 Bur net Road Austin, Texas 78758-3493 When you send infor mation to IBM, you grant IBM a non-exclusive r ight to use or distr ibute the infor mation in any way it believes appropr iate without incurring any obligation to you.
© Copyright International Business Machines Corporation 2001. All rights reser ved. Note to U.S Gover nment Users Documentation related to restr icted rights Use, duplication or disclosure is subject to restr ictions set for th in GSA A DP Schedule Contract with IBM Cor p.

Contents
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface . . . . . . . . . The team that wrote Notice . . . . . . . . . . . IBM trademarks . . . Comments welcome .. th .. .. .. . is . . . ..... redbo ..... ..... ..... .. ok .. .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi xii xiii xiv xiv . . . . . . . 1 1 2 2 3 4 4

Chapter 1. Processor 1.1 P OW E R1 . . . . . . 1.2 P OW E R2 . . . . . . 1.3 P ower P C . . . . . . 1 .4 R S6 4 . . . . . . . . . 1.5 P OW E R3 . . . . . . 1.6 P OW E R4 . . . . . .

evolu ..... ..... ..... ..... ..... .....

tio .. .. .. .. .. ..

n . . . . . .

.. . . . . . .

. . . . . . .

.. . . . . . . .. . . . .. .. .. .. .. .. .. .. .. .. . .. .. .. .. .. ..

Chapter 2. The POW ER4 system . . . . . . . . . . . . . . . . . . . . . . 2.1 P OW E R4 s ys tem ov er view . . . . . . . . . . . . . . . . . . . . . . . 2.2 The POW ER 4 chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 P roc es sor ov er view . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 The POWER4 processor execution pipeline . . . . . . . . 2.3.2 Instruction fetch, group formation, and dispatch . . . . . 2.3.3 Instruction execution, speculation, rename resources 2.3.4 Branch prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Translation buffers (TLB, SLB, I- and D-ERAT) . . . . . 2.3.6 Load instruction processing . . . . . . . . . . . . . . . . . . . . 2.3.7 Store instruction processing . . . . . . . . . . . . . . . . . . . . 2.3.8 Fixed-point execution pipeline. . . . . . . . . . . . . . . . . . . 2.3.9 Floating-point execution pipeline . . . . . . . . . . . . . . . . . 2.3.10 Group completion . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 S torage hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 L1 instruction cache . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 L1 data cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 L2 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 L3 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Interconnecting chips to form larger SMPs . . . . . . . . . 2.4.6 Multiple module interconnect . . . . . . . . . . . . . . . . . . .

.5 .5 .6 .8 .9 .9 11 12 13 13 14 15 15 16 16 17 17 17 18 18 19

© Copyr ight IBM Cor p. 2001

iii

2.4.7 Memory subsystem . . . . . . . . . . . . . . . . . . . . 2.4.8 Hardware data prefetch . . . . . . . . . . . . . . . . . 2.4.9 Memory/L3 cache command queue structure 2.5 I/ O s t ruc tur e . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 The POW ER 4 Per for manc e M onito r . . . . . . . . . Chapter 3.1 Tuni 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 3.1.6 3.1.7 3.1.8 3.2 Tuni 3.2.1 3.2.2 3 .3 S ys t 3.3.1 3.3.2 3.3.3 3.3.4

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

.. .. .. . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20 21 22 23 23 25 25 26 26 27 34 40 47 49 51 52 52 53 54 54 58 61 67 69 69 70 75 76 79 79 80 81 82 83 84 88 88 89 89 90 91

3. POWER4 system performance and tuning . . . . . . . . ng for num eric ally int ensi ve applicat ions . . . . . . . . . . . The tuning process for numerically intensive applications Hand tuning overview for numerically intensive programs Key aspects of the POWER4 design . . . . . . . . . . . . . . . . Tuning for the memory subsystem . . . . . . . . . . . . . . . . . . Tuning for the FPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache and memory latency measurement . . . . . . . . . . . . Selected fundamental kernel performance within on-chip Other tuning considerations . . . . . . . . . . . . . . . . . . . . . . . ng non-f loating point applicat ions . . . . . . . . . . . . . . . . The load/store and integer units . . . . . . . . . . . . . . . . . . . . Memory configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . e m t u n in g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . POWER4 virtual memory architecture overview . . . . . . . . Small and large page sizes . . . . . . . . . . . . . . . . . . . . . . . . AIX system parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . Minimizing variation in job performance . . . . . . . . . . . . . . ...... ...... ...... ...... ...... ...... eration ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..... ..... ..... ..... ..... ..... ..... ..... c ac he ..... ..... ..... ..... ..... ..... ..... ..... ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 4. Optimizing with the compilers . . . . . . . . 4.1 P OW E R4- s peci fic c ompiler opt ions . . . . . . . . . 4.1.1 General performance options . . . . . . . . . . . . 4.1.2 Options for POWER4 . . . . . . . . . . . . . . . . . . 4.1.3 Using XL Fortran vector-intrinsic functions . . 4.1.4 Recommended options . . . . . . . . . . . . . . . . . 4.1.5 Comparing C and Fortran compiler code gen 4.2 X L For tr an c om piler dir ect ives f or tuning . . . . . 4.2.1 Prefetch directives . . . . . . . . . . . . . . . . . . . . . 4.2.2 Loop-related directives . . . . . . . . . . . . . . . . . 4.2.3 Cache and other directives . . . . . . . . . . . . . . 4.3 The object c ode lis ting . . . . . . . . . . . . . . . . . . . 4.4 B asic c oding pr ac tic es for per f orm ance . . . . . . 4.4.1 Language-independent tips . . . . . . . . . . . . . . 4.4.2 Fortran tips . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 C and C ++ tips . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Inlining procedure references . . . . . . . . . . . . 4.4.5 Structuring code for optimal grouping . . . . . .

.. . .. .. .. .. .. . .. .. .. . . .. .. .. .. ..

iv

POWER 4 P rocessor Introduction and Tuning Guide

4.5 Tuning for 64-bit integer perform ance . . . . . . . . . . . . . . . . . . . . . . . . . 91 Chapter 5. General tuning guidel 5.1 Hand t uning c ode . . . . . . . . . 5.1.1 Local or global variables? 5.1.2 Pointers . . . . . . . . . . . . . . 5.1.3 Expressions . . . . . . . . . . . 5.1.4 Data type conversions . . . 5.1.5 Tuning loops . . . . . . . . . . 5.2 Us ing pre-t uned c ode . . . . . . 5.3 The perf orm anc e m oni tor . . . 5.4 Tuning for I /O . . . . . . . . . . . . 5.5 Loc ating hot s pots ( pr ofiling) ines .... .... .... .... .... .... .... .... .... .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. .. .. .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . .. .. .. .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 1 93 93 93 94 94 95 95 01 01 07 10

Chapter 6. Performance libraries . . . . . . . . . . . . . . . . 6.1 The ES SL and P arallel E SS L li braries . . . . . . . . 6.1.1 Capabilities of ESSL and Parallel ESSL . . . . . 6.1.2 Performance examples using ESSL . . . . . . . . 6.2 The MA SS li brar ies . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Installing and using the MASS libraries . . . . . . 6.2.2 Description and performance of MASS librarie 6.3 M odul ar I/O ( MI O) libr ar y . . . . . . . . . . . . . . . . . . 6.4 Wat son Spars e Mat r ix Pac k age (WSM P ) . . . . . .

.. . .. .. . .. .. . .

1 13 114 1 15 1 15 11 7 1 17 1 19 12 0 122 1 25 12 6 1 26 1 29 1 30 1 31 1 32 133 137 1 37 1 43 1 47 148 1 53 15 5 157 159 160 1 61

Chapter 7. Parallel programming techniques and performance . 7 . 1 S h a re d m e m o ry p a r a l l e l i z a t i o n . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 SMP runtime behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Shared memory parallel examples . . . . . . . . . . . . . . . . . . . 7.1.3 Automatic shared memory parallelization . . . . . . . . . . . . . . 7.1.4 Directive-based shared memory parallelization . . . . . . . . . 7.1.5 Measured SMP performance . . . . . . . . . . . . . . . . . . . . . . . 7 . 2 M P I i n a n S M P e n v i r o n m e nt . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 P rogramm ing wit h t hreads . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Coding and performance considerations . . . . . . . . . . . . . . 7.3.3 The best approach for shared memory parallelization . . . . 7.4 P arallel program ming with shared c ac hes . . . . . . . . . . . . . . Chapter 8. Application performance and throughpu 8.1 M em or y to mem ory c opy . . . . . . . . . . . . . . . . . . 8.2 M em ory bandwidth limit ed throughput . . . . . . . . 8.3 M PI parallel on pSeries 690 and SP . . . . . . . . . . 8.4 M ul tiple job throughput . . . . . . . . . . . . . . . . . . . . 8.4.1 ESSL DGEMM throughput performance . . . . . t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .. . . . . . .

Contents

v

8.4.2 Multiple ABAQUS/Explicit job streams . . 8.4.3 Memory stress effects on throughput . . . 8.4.4 Shared L2 cache and logical partitioning 8.5 Genet ic sequenc ing program . . . . . . . . . . . 8.6 FA ST A genetic s equenc ing program . . . . . 8.7 B LAS T genet ic s equencing program . . . . . Related publications . . . . . IBM Redbooks . . . . . . . . . . . Other resources . . . . . . . Referenced Web sites . . . . . How to get IBM Redbooks . . IBM Redbooks collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..... ..... (LPAR ..... ..... ..... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . ) . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

.... .... .... ... ... ... . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

1 61 1 62 1 65 168 168 169 1 1 1 1 1 1 7 7 7 7 7 7 1 1 1 2 3 3

Special notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

vi

POWER 4 P rocessor Introduction and Tuning Guide

Figures
2- 1 2- 2 2- 3 2-4 2-5 2-6 2-7 2- 8 3- 1 3-2 3-3 3-4 3-5 3-6 3-7 4- 1 6- 1 7- 1 8- 1 8- 2 8-3 8-4 8-5 8-6 8-7 The POW ER4 chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 The POW ER4 processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 The execution pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 A logical view of the interconnection buses within an MCM . . . . . . . . . 18 Logical view of MCM-to-MCM interconnections . . . . . . . . . . . . . . . . . . . 19 Multiple MCM interconnection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Hardware data prefetch operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 I/O structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 The POW ER4 L1 data cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 POWER4 data transfer rates for multiple prefetch streams . . . . . . . . . . 37 Outer loop unrolling effects on matrix-vector multiply (1.1GHz system) 44 Latency in machine cycles to access N bytes of random data . . . . . . . 48 32-bit environment segment register usage . . . . . . . . . . . . . . . . . . . . . . 55 POWER address translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Translation of 64-bit effective address to 80-bit virtual address . . . . . . . 57 Integer computation: B(I)=A(I)+C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 ESSL DGEMM single processor GFLOPS . . . . . . . . . . . . . . . . . . . . . 116 Shared memory parallel job flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Memory copy performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 C library memcpy performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 System memory throughput for pSeries 690 HPC . . . . . . . . . . . . . . . . 157 System memory throughput on pSeries 690 Turbo . . . . . . . . . . . . . . . 158 Job throughput effects on a 375 MHz POW ER3 SMP High Node . . . . 163 Job throughput effects on an eight-way pSeries 690 HPC . . . . . . . . . 163 Job throughput effects on a 32-way pSeries 690 Turbo . . . . . . . . . . . 164

© Copyr ight IBM Cor p. 2001

vii

viii

P OWER4 Processor Intr oduction and Tuning Guide

Tables
1-1 Comparative POWER3-II, RS64-III, and POW ER4 processor met 2-1 Issue queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 Rename resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3 Storage hierarchy organization and size . . . . . . . . . . . . . . . . . . . . 3-1 Performance of various fundamental loops . . . . . . . . . . . . . . . . . . 4-1 Vector-intrinsic function speedups . . . . . . . . . . . . . . . . . . . . . . . . . 6-1 DGEMM throughput summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 Mass library functions and performance . . . . . . . . . . . . . . . . . . . . 7-1 Loop A parallel performance elapsed time . . . . . . . . . . . . . . . . . . 7-2 Loop B parallel performance elapsed time . . . . . . . . . . . . . . . . . . 7-3 Loop C parallel performance elapsed time . . . . . . . . . . . . . . . . . . 7-4 Advantages and disadvantages of message passing techniques . 7-5 Shared memory cache results, pSeries 690 Turbo . . . . . . . . . . . . 7-6 Counter and semaphore sharing cache line . . . . . . . . . . . . . . . . . 7-7 Counter and semaphore in separate cache line . . . . . . . . . . . . . . 7-8 Heavily used shared cache line performance . . . . . . . . . . . . . . . . 8-1 Memory copy performance relative to one CPU . . . . . . . . . . . . . . 8-2 MPI performance results for AWE Hydra code . . . . . . . . . . . . . . . 8-3 Effects of running multiple copies of DGEMM . . . . . . . . . . . . . . . . 8-4 Multiple ABAQUS/Explicit job stream times . . . . . . . . . . . . . . . . . . 8-5 FIRE benchmark: Impact of shared versus non-shared L2 cache . 8-6 FIRE benchmark: Uniprocessor, single job versus partitioning . . . 8-7 FIRE benchmark: Throughput performance versus partitioning . . 8-8 Performance on different systems . . . . . . . . . . . . . . . . . . . . . . . . . 8-9 Relative performance of FASTA utilities . . . . . . . . . . . . . . . . . . . . 8-10 Blastn results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11 Tblastn results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . rics ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..4 . 10 . 11 . 16 . 49 . 77 1 16 1 19 1 32 1 32 1 33 1 36 1 49 1 50 1 50 1 51 1 56 1 59 1 61 1 62 1 66 1 67 1 67 1 68 1 68 1 69 1 69

© Copyr ight IBM Cor p. 2001

ix

x

POWE R4 P rocessor Introduction and Tuning Guide

Preface
This redbook is designed to familiarize you with the IBM ^ pSeries POWER4 microarchitecture and to provide you with the infor mation necessar y to exploit the new high-end ser vers based on this architecture. The eight to 32-way symmetric multiprocessing (SMP) pSer ies 690 Model 681 will be the first POWER4 system to be available. Thus, most analysis presented in this publication refers to this system. Specifically, this publication will address the following issues: POWER4 features and capabilities Processor and memor y optimization techniques, especially for For tran programming AIX XL For tran Version 7.1.1 compiler capabilities and which options to use Parallel processing techniques and perfor mance Available librar ies and programming interfaces Perfor mance examples of commonly used ker nels The anticipated audience for this redbook is as follows: Application developers Technical managers responsible for equipment purchase decisions Managers responsible for project planning Researchers involved in numerical algorithm development End users with an interest in understanding the perfor mance of their applications While this publication is decidedly technical in nature, the fundamental concepts are presented from a user point of view and numerous examples are provided to reinforce these concepts.

© Copyr ight IBM Cor p. 2001

xi

The team that wrote this redbook
This redbook was produced by a team of specialists from around the wor ld working at the Inter national Technical Suppor t Organization, Austin Center. Stephen Behling is an Application Specialist based in Minneapolis, Minnesota, USA. He has 27 years of experience in the computer aided engineering field. He has specialized in high perfor mance computing for the past 15 years with Cray Research Inc., Silicon Graphics, Inc., and since 1999, IBM. His areas of exper tise include computational fluid dynamics, parallel programming, benchmar king and perfor mance tuning. Ron Bell is an IBM IT Consultant in the UK. He has an MA in Physics and a DPhil in Nuclear Physics from the University of Oxford. He has 30 years of experience with IBM High Perfor mance Computing. His areas of exper tise include the For tran language, perfor mance tuning for POW ER architecture, and MPI parallel coding and tuning for the RS/6000 SP. He has for many years collaborated with HKS Inc. to optimize their ABAQUS product for IBM platfor ms. Peter Farrell is a Consulting IT Specialist in Australia. He has wor ked with computer systems for over 30 years and has extensive experience in operating systems and networking software development. He has over 15 years experience in UN IX technical suppor t with a special interest in system perfor mance. He has wor ked at IBM for two years and before that was with Sequent C omputer Systems. His current areas of responsibility include benchmar king and perfor mance tuning. Holger Holthoff is an IBM IT Consultant in Ger many. He has been involved in parallel computing on RS/6000 SP since he joined the IBM Scientific Center, Heidelberg in 1994. C urrently, he is a member of the pSeries Technical Suppor t group focusing on high-performance computing projects and CAE applications in manufacturing industr ies. His areas of exper tise include perfor mance tuning for the POWER architecture and message passing programming for the RS/6000 SP. Frank O'Connell is a Senior Technical Staff Member in the Future Processor Perfor mance D epar tment, where he has been a member of IBM's high-perfor mance processor development effor t since 1992. For the past 15 years, he has focused on scientific and technical computing performance within IBM, including microprocessor and systems design, operating system and compiler perfor mance, algor ithm development, and application tuning, in the capacity of both product development and customer suppor t. Mr. O'Connell received a B.S.M.E. degree from the University of C onnecticut and an M.S. degree in engineeringeconomic systems from Stanford University.

xii

POWER4 Processor Introduction and Tuning Guide

Will Weir is an IBM IT Specialist in the UK. He has wor ked on applications and systems on IBM RS/6000 and RS/6000 SP fo Currently he is a member of the High Perfor mance C omputing IBM Bedfont. His areas of exper tise include application por ting benchmar king, and RS/6000 SP systems. The project that produced this Redbook was managed by: Scott Vetter from IBM Austin

scientific r 11 years. team based in and

A special thanks to Steve White from IBM Austin, for his determination and pursuit of excellence. Thanks to the following people for their contr ibutions to this project: Ar thur Ban, Bill Hay, Harr y Mathis, John McCalpin, Alex Mer icas, William Star ke, Steve Stevens, Joel Tendler from IBM Austin Joan McComb and Frank Johnston from IBM Poughkeepsie Richard Eickemeyer from IBM Rochester Bob Blainey, Ian McIntosh, and Kelvin Li from IBM Toronto

Notice
This publication is intended for developers of numerically intensive code for the IBM ^ pSer ies POW ER4, for business par tners and sales specialists wanting suppor ting metr ics for pSer ies 690 Model 681, and for technical specialists who require detailed product infor mation to help demonstrate IBM's industr y-leading technology. See the PUBLICATION S section of the IBM Programming Announcement for For tran Version 7.1.1 for more infor mation about what publications are considered to be product documentation.

Preface

xiii

IBM trademarks
The following terms are trademarks of the Inter national Business Machines Cor poration in the United States and/or other countries:
AIX® AIX 5LTM e (logo)® IBM ® iSeriesTM LoadLeveler® PowerPC® PowerPC ArchitectureTM PowerPC 601® PowerPC 603TM PowerPC 604TM pSeriesTM RedbooksTM Redbooks Logo RISC System /6000® RS/6000® Sequent® SPTM

Comments welcome
Your comments are impor tant to us! We want our IBM Redbooks to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways: Use the online Contact us review redbook for m found at:
ibm.com/redbooks

Send your comments in an Inter net note to:
redbook@us.ibm.com

Mail your comments to the address on page ii.

xiv

POWE R4 P rocessor Introduction and Tuning Guide

1

Chapter 1 .

Processor evolution
In this section, the stages of RS/6000 and pSer ies processor development are discussed, star ting with the POWER1 architecture through to the latest POWER4.

1.1 POWER1
The first RS/6000 products were announced by IBM in Februar y of 1990, and were based on a multiple chip implementation of the POWER architecture, described in IBM RISC System/6000 Technology, SA23-2619. This technology is now commonly referred to as POW ER 1, in the light of more recent developments. The models introduced included an 8 KB instruction cache (I-cache) and either a 32 KB or 64 KB data cache (D-cache). They had a single floating-point unit capable of issuing one compound floating-point multiply/add (FMA) operation each cycle, with a latency of only two cycles. Therefore, the peak MFLOPS rate was equal to twice the MHz rate. For example, the Model 530 was a desk-side workstation operating at 25 MHz, w ith a peak perfor mance of 50 MFLOPS. Commonly occurring numer ical ker nels were able to achieve perfor mance levels ver y close to this theoretical peak. In Januar y of 1992, the Model 220 implementation of the POW ER arc Chip (RSC). It was designed as a l contained a single 8 KB combined was announced, based on a single chip hitecture, usually referred to as RISC Single ow-cost, entr y-level desktop wor kstation, and instruction and data cache.

© Copyr ight IBM Cor p. 2001

1

The last POW ER1 machine, announced in September of 1993, was the Model 580. It ran at 62.5 MHz and had a 32 KB I-cache and a 64 KB D-cache.

1.2 POWER2
Announced in September 1993, the first POWER2 machines included the 55 MHz Model 58H, the 66.5 MHz Model 590, and the 71.5 MHz 990. The most significant improvement introduced with the POWER2 architecture for scientific and technical applications was the floating-point unit (FPU) that was enhanced to contain two 64-bit execution units. Thus, two floating-point multiply/add instructions could be executed each cycle. A second fixed-point execution unit was also provided. In addition, several new hardware instructions were introduced with POW ER 2: Quad-word storage instructions. The quad-word load instr uction moves two adjacent double-precision values into two adjacent floating-point registers. Hardware square root instr uction. Floating-point to integer conversion instructions. Although the Model 590 ran with only a marginally faster clock than the POWER1-based Model 580, the architectural improvements listed above, combined with a larger 256 KB D -cache size, enabled it to achieve far greater levels of perfor mance. In October 1996, IBM announced the RS/6000 Model 595. This was the first machine to be based on the P2SC (POW ER2 Super Chip) processor. As its name suggests, this was a single chip implementation of the POWER2 architecture, enabling the clock speed to be increased fur ther. The Model 595 ran at 135 MHz, and the fastest P2SC processors, found in the Model 397 workstation and RS/6000 SP Thin4 nodes, ran at 160 MHz, with a theoretical peak speed of 640 MFLOPS.

1.3 PowerPC
The RS/6000 Model 250 workstation, the first to be based on the PowerPC 601 processor r unning at 66 MHz, was introduced in September, 1993. The 601 was the first processor ar ising out of the par tnership between IBM, Motorola, and Apple. The PowerPC Architecture includes most of the POWER instructions. However, some instr uctions that were executed infrequently in practice were excluded from the architecture, and some new instructions and features were added, such as suppor t for symmetric multiprocessor (SMP) systems. In fact, the 601 did not implement the full PowerPC instr uction set, and was a br idge from

2

POWE R4 P rocessor Introduction and Tuning Guide

POWER to the full PowerPC Architecture implemented in more recent processors, such as the 603, 604, and 604e. Currently, the fastest PowerPC -based machines from IBM for technical pur poses, the four-way system RS/6000 7025 Model F50 and the uniprocessor system RS/6000 7043 Model 150, use the 604e processor r unning at 332 MH z and 375 M respectively. The POWER3 and POWER4 processors are also based on PowerPC Architecture, but discussed in the following sections.

SM P 43P Hz, the

1.4 RS64
The first RS64 processor was introduced in September of 1997 and was the first step into 64-bit computing for RS/6000. W hile the POW ER 2 product had strong floating-point perfor mance, this ser ies of products emphasized strong commercial ser ver perfor mance. It ran at 125 MHz with a 2-way associative, 4 MB L2 cache and had a 64 KB L1 instr uction cache, a 64 KB L1 data cache, one floating-point unit, one load-store unit, and one integer unit. Systems were designed to use up to 12 processors. pSeries products using the RS64 were the first pSeries products to have the same processor and memor y system as iSeries products. In September 1998, the RS64-II was introduced. It was a different design from the RS64 and increased the clock frequency to 262 MHz. The L2 cache became 4-way set associative with an increase in size to 8 MB. It had a 64 KB L1 instruction cache, a 64 KB L1 data cache, one floating-point unit, one load-store unit, two integer units, and a shor t in-order pipeline optimized for conditional branches. With the introduction of the RS64-III in the fall of 1999, this design was modified to use copper technology, achieving a clock frequency of 450 MHz, with a L1 instruction and data cache increased to 128 KB each. This product also introduced hardware multithreading for use by AIX. Systems were designed to use up to 24 processors. In the fall of 2000, this design was enhanced to use silicon on insulato technology, enabling the clock frequency to be increased to 600 MHz. cache size was increased to 16 MB on some models. Continued devel this design provided processors running at 750 MHz. The most recent this microprocessor was called the RS64-IV. During the histor y of this fami been made for a large var iety transaction processing), SAP PeopleSoft (ERP), SPECweb r (SOI) The L2 opment of version of

ly of products, top perfor mance publications have of benchmarks, including TPC-C (online (enter prise resource planning - ERP), Baan (ERP), (web ser ving), and SPECjbb (Java).

Chapter 1. P rocessor evolution

3

1.5 POWER3
The POWER3 processor brought together the fundamental design of the POWER2 microarchitecture, as currently implemented in the P2SC processor, with the PowerPC Architecture. It combined the excellent floating-point perfor mance delivered by P2SC's two floating-point execution units, while being a 64-bit, SMP-enabled processor ultimately capable of running at much higher clock speeds than current P2SC processors. Initially introduced in the fall of 1998 at a processor clock frequency of 200 MHz, most recent versions of this microprocessor incor porate copper technology and operate at 450 MHz.

1.6 POWER4
The new POWER4 processor, described in detail in Chapter 2, "The POWER4 system" on page 5, continues the evolution. The POWER4 processor chip contains two microprocessor cores, chip and system per vasive functions, core interface logic, a 1.41 MB level-2 (L2) cache and controls, the level-3 (L3) cache director y and controls, and the fabric controller that controls the flow of infor mation and control data between the L2 and L3 and between chips. Each microprocessor contains a 64 KB level-1 instr uction cache, a 32 KB level-1 data cache, two fixed-point execution units, two floating-point execution units, two load/store execution units, one branch execution unit, and one execution unit to perfor m logical operations on the condition. Instr uctions dispatched in program order in groups are issued out of program order to the execution units, with a bias towards oldest operations first. Groups can consist of up to five instr uctions, and are always ter minated by a branch instruction. The processors on the first IBM POWER4-equipped ser vers, the IBM ^ pSeries 690 Model 681 ser vers, operate at either 1100 MHz or 1300 MHz. A quick look at comparative metrics may help you put the capacity of the latest POWER-based processors in perspective, as provided in Table 1-1.
Table 1-1 Comparative POWER3-II, RS64-III, and POWER4 processor metrics Metric SPECint2000 SPECfp2000 POWER3-II 450 MHz 335.0 433.0 RS64-III 450 MHz 234.0 210.0 POWER4 1300 MHz 814.0 1169.0

4

POWE R4 P rocessor Introduction and Tuning Guide

2

Chapter 2 .

The POWER4 system
The POWER4 system is a new generation of high-perfor mance 64-bit microprocessors and associated subsystems especially designed for ser ver and supercomputing applications. POWER4 systems power the next generation of ser vers that will be the replacements for the POWER3 and RS64-ser ies high-end RS/6000 and pSeries technical ser vers. This chapter provides details of the POWER4 system that are significant to application programmers concer ned with understanding or improving application perfor mance.

2.1 POWER4 system overview
The POWER4 system is a high-perfor mance microprocessor and storage subsystem utilizing IBM's most advanced semiconductor and packaging technology. It is the building block for the next-generation pSeries and iSer ies SMP ser vers. The POWER4 system implements the PowerPC AS Processor Architecture, which specifies the instruction set, register set, and storage model, to name a few, in other words, all functions that are visible to the programmer. A POWER4 system logically consists of multiple POWER4 microprocessors and a POWER4 storage subsystem, interconnected together to for m an SMP system. Physically, there are three key components: the POW ER4 processor chip, the L3 Merged Logic DRAM (MLD) chip, and the memor y controller chip. The POWER4 processor chip contains two 64-bit microprocessors, a microprocessor interface controller unit, a 1.41 MB (1440 KB) level-2 (L2)

© Copyr ight IBM Cor p. 2001

5

cache, a level-3 (L3) cache director y, a fabric controller responsible for controlling the flow of data and controls on and off the chip, and chip/system per vasive functions. The L3 merged logic DRAM (MLD) chip, w hich contains 32 MB of L3 cache. An eight-way POWER4 SMP module will share 128 MB of L3 cache consisting of four modules each of which contains two 16 MB merged logic DRAM chips. The memor y controller chip features one or two memor y data por ts, each 16 bytes wide, and connects to the L3 MLD chip on one side and to the Synchronous Memor y Interface (SMI) chips on the other. The pSer ies 690 Model 681 is built around the POWER4 Multi-chip Module (MCM) which contains four POW ER4 chips. A 32-way SMP system contains four MC Ms. POWER4 MCMs are mounted on system boards along with the L3, memor y cards including the memor y controllers, and suppor t chips to for m the hear t of the pSeries 690 Model 681.

2.2 The POWER4 chip
The main components of the POWER4 chip are shown in Figure 2-1 on page 7. The POWER4 chip has a maximum of two microprocessors, each of which is a fully functional 64-bit implementation of the PowerPC AS Architecture specification. Also on the chip is a unified second-level cache, shared by both microprocessors through a core interface unit (CIU). The L2 cache is physically divided into three equal-sized par ts, each having an L2 cache controller. The CIU connects each of the three L2 controllers to each processor though separate 32-byte wide data reload and instr uction reload por ts. Each microprocessor also has an 8-byte wide store por t to the CIU that in tur n is used to store data through the appropriate L2 controller. Each processor also has associated non-cacheable unit (N CU), shown in Figure 2-1 on page 7, responsible for handling instruction-ser ializing functions and perfor ming any non-cacheable operations in the storage hierarchy. Logically, these are par t of the L2 cache. To improve perfor mance by reducing the latency to memor y, the director y for the level 3 cache (L3 cache) and its controller are also located on the POWER4 chip (while the actual L3 arrays are located on the L3 MLD module). Additionally, for I/O device communication, the GX bus controller and the two associated four-byte wide GX bus, one on chip and one off chip, are on the chip as well.

6

POWE R4 P rocessor Introduction and Tuning Guide

Processor 1 Core
IFet ch St ore 8B 32B Loads 32B

P rocessor 2 Core
IFetch 32B Stor e 8B Loads

Trace & Debug BIST En gines Perf Monito r
8B

SP Control ler POR Sequenc er
32B

JTAG

CIU S witch
8B 8B 32B 8B 32B

Cache L2 L2 Cache

Cache L2

Cache L2

8B

Error Detect And Logg ing

Co re1 NC Un it
8B 8B 16B 32B 32B 32B 32B 32B 32B

Co re2 NC Un it
8B 8B 16B

Ch ip-Ch ip Fabric (2:1) MCM-MCM (2:1) GX Bus (n :1)

16B 16B 8B

Fab ric Fabric Con tro ller Co ntro ller

16B 16B 8B

Chip-Chip Fabric (2:1) MCM-MCM (2:1) Bus L3/Mem (3:1)

4B 4B

GX Controller

L3 Directory

16B

L3 Controller Mem Controller

16B

Figure 2-1 The POWER4 chip

Each POW ER4 chip contains a fabr ic controller that provides master control of the network of buses. These buses connect together the on-chip L2 controllers, L3, other POWER4 chips, and other POWER4 modules, and also perfor m snooping and coherency duties. The Fabric Controller directs a point-to-point networ k between each of the four chips on the MC M made up of unidirectional 16-byte w ide buses r unning at half the processor frequency, the 8-byte buses also operating at half the processor speed connecting each chip to a corresponding chip on a neighboring MCM, and also controls the unidirectional 16-byte wide buses (r unning at 3:1 in the pSeries 690 Model 681) between the POWER4 chip and the L3 cache, as well as the buses to the NCU and GX controller. Although not related to perfor mance, it is wor th mentioning that the chip also includes an impor tant set of pervasive functions. These include trace and debug facilities used for First Failure Data Capture, built-in self-test (BIST) facilities, perfor mance monitoring unit (PMU), an interface to the ser vice processor (SP) used to control the overall system, power-on reset (POR ) sequencing logic, and error detection and logging circuitr y.

C hapter 2. The POWER4 system

7

2.3 Processor overview
Figure 2-2 shows a high-level block diagram of a POWER4 microprocessor. The POWER4 microprocessor is a high-frequency, speculative superscalar machine with out-of-order instr uction execution capabilities. Eight independent execution units are capable of executing instr uctions in parallel providing a significant perfor mance attr ibute known as superscalar execution. These include two identical floating-point execution units, each capable of completing a multiply/add instruction each cycle (for a total of four floating-point operations per cycle), two load-store execution units, two fixed-point execution units, a branch execution unit, and a conditional register unit used to perfor m logical operations on the condition register.

IF A R

I- c a c h e BR Sc a n BR P redic t In r B Q In s ts t r u f fe r D e c o de, D e c o de, Cr a c k & Cr a c k & G roup G roup Fo r m a t io n

GC T
FP I ssu e Q
FP 1 Exe c Unit

BR /C R Is s u e Q BR Exe c Un it CR Ex e c Un it

FX /L D 1 Is s u e Q FX 1 Ex e c Un it LD 1 Exe c Un it

FX /L D 2 Is s u e Q LD 2 Exe c Unit FX 2 Exe c Unit

FP I ssu e Q
FP 2 Ex e c Un it

St Q D - ca ch e

Figure 2-2 The POWER4 processor

To keep these execution units supplied with wor k, each processor can fetch up to eight instr uctions per cycle and can dispatch and complete instr uctions at a rate of up to five per cycle. A processor is capable of tracking over 200 instructions in-flight at any point in time. Instr uctions may issue and execute out-of-order with respect to the initial instruction stream, but are carefully tracked so as to complete in program order. In addition, instr uctions may execute speculatively to improve perfor mance when accurate predictions can be made about conditional scenarios.

8

POWE R4 P rocessor Introduction and Tuning Guide

2.3.1 The POWER4 processor execution pipeline
Figure 2-3 depicts the POWER4 processor execution pipeline. The deeply pipelined str ucture of the machine's design is shown. Each small box represents a stage of the pipeline (a stage is the logic which is perfor med in a single processor cycle). Note that there is a common pipeline which first handles instruction fetching and group for mation, and this then divides into four different pipelines corresponding to four of the five types of execution units in the machine (the CR execution unit is not show n, which is similar to the fixed-point execution unit). All pipelines have a common ter mination stage, which is the group completion (CP) stage.

B ranch R edir e cts

Ou t-o f-Or der Pr oc e ssin g

In str uc tio n Fetch
MP IF IC BP MP D0 D1 D2 D3 Xfe r GD MP MP IS S IS S IS S RF RF RF EA EX F6 F6 F6 F6 F6 F6 IS S RF EX

BR WB LD/ ST DC Fm t FX WB WB Xf er Xf er CP Xf er

In str uc tio n C rack and G roup Form a t ion

FP WB Xf er

In te rr upts and F l ushes

Figure 2-3 The execution pipeline

2.3.2 Instruction fetch, group formation, and dispatch
The instructions that make up a program are read in from storage and are executed by the processor. Dur ing each cycle, up to eight instructions may be fetched from cache according to the address in the instr uction fetch address register (IFAR) and the fetched instr uctions are scanned for branches (corresponding to the IF, IC, and BP stages in Figure 2-3). Since instr uctions may be executed out of order, it is necessar y to keep track of the program order of all instr uctions in-flight. In the POW ER4 microprocessor, instructions are tracked in groups of one to five instructions rather than as individual instructions. Groups are for med in the pipeline stages D0, D1, D2, and D3. This requires breaking some of the more complex PowerPC instructions dow n into two or more simpler instructions. Instructions that are broken into two internal instr uctions are named cracked instructions. Instructions that are broken into three or more inter nal instructions are named millicoded instr uctions. Common instr uctions that are cracked are: All load/store-update for ms (cracked into load/store + addi for gpr update)

C hapter 2. The POWER4 system

9

X-for m fixed-point stores (cracked into an add + store) Load algebraic (cracked into load + sign extend) CR-logicals, except for destructive for ms (RT=R B) Common instr uctions that are millicoded are: lmw, lswi (all multiples and str ing load instructions) mtcrf (move to condition register fields, more than one target field) mtxer and mfxer Groups are for med that contain up to five inter nal instr uctions, each occupying an inter nal instruction slot (numbered 0 through 4) of a dispatch group. After a group is assembled, it is readied for dispatch, which is the process of sending the instructions as a group to the issue queues. As par t of the dispatching operation, inter nal group instr uction dependencies are deter mined and inter nal resources such as issue queue slots, rename registers, reorder queues, and mappers are assigned (GD and MP stages). Groups are dispatched and tracked using the 20-entr y global completion table (GCT) in program order at a rate of up to one per cycle. (See discussion on completion in Section 2.3.10, "Group completion" on page 16). Each inter nal instr uction slot in a group feeds separate issue queues for the floating-point units, the branch execution unit, the CR execution unit, the logical CR execution unit, the fixed-point execution units and the load/store execution units. The fixed point and load/store execution units share common issue queues. Table 2-1 summar izes the depth of each issue queue and the number of queues available for each type of queue. For the floating-point issue queues and the common issue queues for the fixed point and load/store units, the issue queues fed from slots 0 and 3 of the instr uction group hold instr uctions to be executed in one of the execution units, w hile the issue queues fed from slots 1 and 2 of the group feed the other execution unit. The CR execution unit draws its instructions from the CR logical issue queue fed from instr uction slots 0 and 1.
Table 2-1 Issue queues Queue type Fixed point and load-store units Floating point Branch execution CR logical Entries per queue 9 5 12 5 Number of queues 4 4 1 2

10

POWER4 Processor Introduction and Tuning Guide

During the issue stage (ISS), instr uctions that are ready to execute are pulled out of the issue queues and enter the register file access stage, where they access their source operands from registers. If more than two instr uctions from a par ticular queue are ready to execute, the issue logic attempts to issue the oldest instruction.

2.3.3 Instruction execution, speculation, rename resources
The speculative-execution design of the POWER4 microprocessor can execute instructions before it is cer tain that those instructions will be required to be executed. Speculative execution can significantly enhance perfor mance by potentially eliminating stalls associated with waiting for a condition associated with a branch to be resolved. It is impor tant to understand the distinction between instructions that finish execution and instructions that complete. The term completion carries an impor tant architectural meaning: completion makes results available to a program. An instr uction or group of instructions may have been speculatively executed by the hardware, but unless they complete, their results are not visible to the program. (This, however, does not hold up speculative execution of instructions dependent on these results.) A superscalar speculative-execution design requires an orderly way to manage machine resources and to flush instructions, along with affected registers, when predictions are found to be incorrect. The POW ER4 microprocessor uses physical resources called rename registers throughout its design that are cr itical to this capability. Rename registers are assigned to instr uctions during the mapping (MP) stage and are typically released when the next instr uction writing to the same logical resource (for example, the same architected general pur pose register) is completed. At any point in time, a rename register may represent an architected register or a target buffer register. In the latter case, the register will be reclassified as an architected register upon successful completion of the instruction or released for reuse if the instruction is flushed. For each type of PowerPC register group, Table 2-2 lists the number of architected registers in the PowerPC specification and the corresponding number of physical (rename) registers in the POW ER 4 microprocessor.
Table 2-2 Renam e resources Resource type General-Purpose Register (G PR ) Floating-Point Register (FPR) Condition R egister (CR) Architected (PowerPC) 32 32 eight 4-bit fields Physical 80 72 32

Chapter 2. The POWE R4 system

11

Resource type Link/Count Register (LCR) Floating-Point Status and Control Register (FPSCR) Fixed-Point Exception Register (XER)

Architected (PowerPC) 2 1 four fields

Physical 16 20 24

2.3.4 Branch prediction
If an instr uction sequence contains a conditional branch instruction, the conditional test associated with that branch directs the flow of execution, either to take the branch or to continue execution at the next sequential instruction. In the POWER4 microprocessor, all such conditional branches are predicted, and instructions are fetched and executed speculatively based upon that prediction. Instruction streams are scanned for branch instructions, and upon encounter ing a conditional branch, a prediction is made as to the outcome of its conditional test. This prediction is used to direct the fetching of instr uctions beyond the branch. If the prediction is correct, processing simply continues and the branch instruction completes nor mally. If, however, the prediction is incorrect, the instructions corresponding to the incorrect prediction are flushed and instruction fetching is redirected down the correct path, incurring a performance penalty of at least 12 cycles. To make accurate predictions about the outcome of a conditional branch instruction, the POWER4 microprocessor tracks two different prediction methodologies simultaneously, and also tracks which method is predicting a par ticular branch more effectively, so that it may use the more successful prediction method for a given branch. The first employs a traditional branch histor y table, each entr y of which corresponds to whether a given branch was taken or not taken. The second method attempts to predict the direction of a branch by using infor mation about the path of execution that was taken to get to that branch. Both methods use a 16 KB entr y table to hold their 1-bit prediction per branch, and a third 16 KB table holds the 1-bit selector indicating the preferred predictor for that branch. This combination of branch prediction methods produces ver y accurate predictions across a wide range of wor kload types. As branch instr uctions are executed and resolved, the branch histor y tables and the other predictors are updated to reflect the latest and most accurate information. If the first branch encountered in a par ticular cycle is predicted as not taken and a second branch is found in the same cycle, the POWER4 processor predicts and acts on the second branch in the same cycle. In this case, the machine will register both branches as predicted, for subsequent resolution at branch execution, and will redirect the instruction fetching based on the second branch.

12

POWER4 Processor Introduction and Tuning Guide

Dynamic branch prediction can be overridden by hint bits in the branch instructions. This is useful in cases w here knowledge at the application level exists that can result in better predictions than the execution-time hardware prediction methods. It is accomplished by setting two previously reser ved bits in conditional branch instructions, one to indicate a software override and the other to predict the direction. When these two bits are zero, the hardware branch prediction previously described is used. Since only reser ved bits are used for this pur pose, 100 percent binar y compatibility with ear lier software is maintained. The POWER4 processor also has target address prediction logic for predicting the target of branch to link and branch to count instructions, w hich often have repeating and therefore predictable targets.

2.3.5 Translation buffers (TLB, SLB, I- and D-ERAT)
The PowerPC Architecture specifies a vir tual storage model for applications, in which each program's effective address (EA) space is a subset of a larger vir tual address (VA) space that is managed by the operating system (see Section 3.3.1, "POWER4 vir tual memor y architecture over view " on page 54). Vir tual addresses are, in tur n, translated into real (physical) storage locations. Each POWER4 processor has three types of buffer caches to speed this process of translation: a translation look-aside buffer (TLB), a segment look-aside buffer (SLB), and an effective-to-real address table (ERAT). The SLB is a 64-entr y, fully associative buffer for caching the most recent segment table entries (STEs). The TLB is a 1024-entr y, four-way set-associative buffer for caching the most recent page table entries (PTEs). These page table entr ies may represent either the standard 4 KB page or a 16 MB large page. The POWER4 microprocessor also has separate ERATs for instr uctions (I-ERAT) and for data (D -ERAT), both of which are 128-entr y, two-way set-associative arrays. The ERATs hold the most recent {EA,RA} pairs to facilitate the high-frequency, high-bandw idth design of the POWER4 microprocessor. Both ERATs are indexed using the effective address and require 10 cycles to reload from the TLB, assuming that pages EA to RA translation exists in the TLB. ERAT entr ies are always maintained on a 4 KB page basis.

2.3.6 Load instruction processing
Load instructions execute in the LD/ST pipeline shown in Figure 2-3 on page 9. After a load instr uction issues, it must generate the effective address of the operand being loaded or stored using the contents of the general-pur pose registers specified along with the instr uction. The RA stage is the cycle in which the registers are accessed and the EA cycle is the address generation stage, also called AGEN.

Chapter 2. The POWE R4 system

13

To keep track of hazards associated with loads and stores executing out of order with respect to each other, two 32-entr y queues exist: the load reorder queue (LRQ) and the store reorder queue (SRQ). All loads and stores are allocated an entr y in these queues at dispatch, respectively. Loads and stores are checked against the entries in these tables to ensure that program correctness is maintained. The cycle following the AGEN cycle is the DC cycle, in which the real address from the D-ERAT is obtained and the data cache is accessed for the appropriate cache line. If the DC cycle is successful, the data is formatted and wr itten into a register, and it is ready for use by a dependent instruction. If a D-ERAT miss occurs, the instr uction is rejected, but it is kept in the issue queue. Meanwhile a request is made to the TLB to reload the D -ERAT with the address translation infor mation. The rejected instruction is then re-issued a minimum of 7 cycles after it was first issued. If the D -ERAT still does not contain the translation infor mation, the instruction is again rejected. This process continues until the D-ER AT is reloaded. In the case of loads, hits in the L1 data cache result in the r equested bytes being for matted and written into the appropr iate register. In the event of a cache miss, a request is initiated to the L2 cache to retr ieve the line. Requests to the L2 cache are stored in the load miss queue (LMQ), which acts as a repositor y for all outstanding L1 cache line misses. The LMQ can hold up to eight requests to the L2 cache; hence each POWER4 microprocessor is capable of managing up to eight data cache line requests to the L2 cache (and beyond) at any given time, providing an effective mechanism for reducing the average latency of cache line reloads. If the LMQ is full, the load instr uction that missed in the data cache is rejected and is re-issued again in a minimum of 7 cycles. If there is already a request to the L2 cache for the same line from another load instruction, the second request is merged into the same LMQ entr y. If a third request to the same line occurs, the load instruction is rejected and processing continues as above. All reloads from the L2 cache check the LMQ to see if there is an outstanding request yet to be honored against a just-retur ned line. If there is, the requested bytes are forwarded to the register to complete the execution of the load instruction. After the line has been reloaded, the LMQ entr y is released for reuse.

2.3.7 Store instruction processing
Store instructions are assigned an entr y in the SRQ during the issue stage for tracking by the real address of the stored data. The store data queue (SD Q) has 32 double-word entr ies and receives the data being stored in an entr y corresponding to the address entr y in the SRQ. Stores are removed from the SRQ and SDQ and the data is wr itten to the L2 once it has been completed and all older stores have been successfully sent to the L2.

14

POWER4 Processor Introduction and Tuning Guide

Loads that are to the same address as a previous store may be forwarded directly from the SDQ to the target register of the load, provided the data for the load is completely contained w ithin the store operand and the data has not yet been wr itten to the cache. The L1 data cache is a store-through design: all data stored to cache lines that exist in the L1 data cache are also sent to the L2 cache to ensure that modifications to lines in the L1 cache are always reflected in corresponding lines in the L2 cache. If data is stored to a cache line that is not found in the L1 data cache, the data is simply transferred straight through to the L2 cache without establishing the cache line in the L1 data cache. All data contained in the L1 data cache is guaranteed to be in the L2 cache. If the L2 needs to cast out data that is contained in the L1 data cache, that line is invalidated in the L1 data cache. Stores can be sent to the L2 cache at a maximum rate of one store per cycle. Store data is directed to the proper L2 controller (through a hashing function) by way of the storage slice queue (SSQ) and the L2 store queue (STQ). Steady-state store perfor mance is descr ibed in detail in Section 3.1.7, "Selected fundamental ker nel perfor mance within on-chip cache" on page 49.

2.3.8 Fixed-point execution pipeline
The pipeline for the two fixed-point execution units (FXUs) is shown as the FX pipe in Figure 2-3 on page 9. Both units are capable of basic ar ithmetic, logical, and shifting operations, and both units are capable of fixed-point multiplies (non-pipelined). One of the FXUs is capable of fixed-point divides, and the other can handle special-pur pose register (SPR) operations.

2.3.9 Floating-point execution pipeline
The POWER4 microprocessor contains two symmetrical floating-point execution units each of w hich implement a fused multiply/add pipeline with single cycle throughput confor ming to the PowerPC microarchitecture. All floating-point instructions pass though both the multiply stage and the add stage. For floating-point multiplies, 0 is used as the add operand, and for floating-point adds, 1 is used as the multiplicand. Each floating-point execution unit suppor ts single-cycle throughput and six- cycle data forwarding for dependent instructions. The floating-point operations square root (fsqrt and fsqrts) and divide (fdiv and fdivs) are not pipelined. Each pipeline can execute the operations with the assistance of additional logic to handle their numerical algorithms. The perfor mance of these and other floating-point operations is highlighted in Section 3.1.7, "Selected fundamental ker nel perfor mance within on-chip cache" on page 49.

Chapter 2. The POWE R4 system

15

The POWER4 microprocessor also implements the optional PowerPC instructions fres (floating-point reciprocal estimate) and frsqr te (floating-point reciprocal square-root estimate), as well as fsel (floating-point select). The last instruction provides for a conditional floating-point assignment operation without branching, which eliminates the chance of incurr ing a perfor mance penalty for a mispredicted branch.

2.3.10 Group completion
Results are wr itten to registers or cache/memor y when the group completes. Completion carries an impor tant architectural meaning: completion makes results available to a program through architected resources (such as floating-point registers). An instruction or group of instr uctions may have been executed speculatively by the hardware, but do not complete unless all conditions associated w ith their execution have been successfully resolved. A group can complete when all older groups have completed and when all instructions in the group have finished execution free of exceptions. One group can complete in a cycle, which matches the rate at which groups can be dispatched.

2.4 Storage hierarchy
The POWER4 system storage hierarchy consists of three levels of cache and the memor y subsystem. The L1 caches and L2 cache is physically on the POWER4 chip. The director y for the L3 cache is also on the chip, but the actual cache itself is on a separate. Table 2-3 summar izes the capacities and organization of the various levels of cache.
Table 2-3 Storage hierarchy organization and size Component L1 instruction cache L1 data cache L2 cache L3 cache Organization Direct map, 128-byte line Two-way, 128-byte line Four-way to eight-way, 128-byte line Eight-way, 512-byte lines, managed as four 128-byte sectors C apacity 128 KB per chip (64 KB per processor) 64 KB per chip (32 KB per processor) 1440 KB per chip (1.41 MB) 128 M B per M CM

16

POWER4 Processor Introduction and Tuning Guide

2.4.1 L1 instruction cache
Each POW ER4 microprocessor has an L1 instr uction cache that is a 64 KB direct mapped cache and capable of either one 32-byte read or wr ite each cycle. It is indexed by the effective address of the instruction cache line.

2.4.2 L1 data cache
Each POW ER4 microprocessor contains an L1 data cache that is 32 KB in size, two-way set associative, and has a replacement policy of first-in-first-out (FIFO). It is capable of two eight-byte reads and one eight-byte wr ite per cycle (it is effectively triple por ted). When the cache line containing the operand of a load instruction is not in the L1 data cache, the processor requests a cache line reload, which retrieves the line from the memor y subsystem and places it in the L1 data cache across a reload interface to the CIU that is 32 bytes wide. Since the maximum the processor can demand per cycle from the register file is two doubleword loads (for example, 16 bytes/cycle) this reload rate is twice the rate that the processor itself can demand data. The L1 data cache implements a store-through design, which means that any updates to data in the L1 data cache are immediately stored through to the L2 cache to keep it synchronized w ith the L1 data cache. If the operand of a store instruction is not found in any of the cache lines currently resident in the L1 data cache ( such as when there is an L1 store miss), the data that is in the source register of the store instruction is stored through to the L2 cache, and the cache line is not established or reloaded into the L1. The data to be stored passes through various queues in the processor (the Store Data Queue), the C IU (the Slice Store Queue), and the L2 cache (the L2 Store Queue) before it actually gets stored into the L2 cache. These queues act as buffers for stored data, which allows the store instruction itself to complete and facilitates the optimization of store perfor mance through a technique named store gathering.

2.4.3 L2 cache
Each POW ER4 chip has an L2 cache that is super vised by three L2 controllers, each of which manages 480 KB, for a total L2 size of 1440 KB. Cache lines are hashed across the three controllers. Cache line replacement is implemented as a binar y-tree pseudo-LRU algorithm. The L2 cache is a unified cache: it caches instructions, data, and page table entries. The L2 cache is also shared by the processors on the chip. For HPC features of the pSeries Model 690, there is only one processor per chip, and thus the L2 cache is entirely owned by that processor.

Chapter 2. The POWE R4 system

17

Memor y coherency in the system is enforced primar ily at the L2 cache level by L2 cache controllers. Each L2 has associated command queues, known as coherency processors. Snoop processors within each controller obser ve all transactions in the system and respond accordingly, providing responses or delivering cache lines if the situation mer its.

2.4.4 L3 cache
The L3 cache is eight-way set-associative organized in 512-byte blocks, but with coherence still maintained in the system cache line size of 128 bytes. POW ER4 chips are connected to memor y through an L3 cache (see Figure 2-4). Generally, it caches data that comes from the memor y por t to which it is attached. An exception to this is when the cache line has been sent from a remote MCM, in which case an attempt is made to cache the line in an L3 cache on the requesting module. The L3 cache is designed to be combined with other L3 caches on the same processor module in pairs or quadr uplets to create a larger, address-interleaved L3 cache of 64 MB or 128 MB. Combining L3 caches into groups not only increases the L3 cache size, but also increases the L3 bandwidth available to any processor. When combined into groups, L3 caches and the memor y behind them are interleaved on 512-byte granularity.

2.4.5 Interconnecting chips to form larger SMPs
The basic building block for a pSeries is a multi-chip module (MCM) with four POWER4 chips for ming an 8-way SMP, as shown in Figure 2-4. Multiple MCMs can then be interconnected to form 16-, 24-, and 32-way SMPs.
Multi-Chip Module (MCM)
Mem Ctrl
GX Bus
>1 Gh z Co re >1 Ghz Core
L3 Di r

GX Bus
Shared L2 Shared L2

Mem Ctrl

L3

Multi-chip Module Boundary

Figure 2-4 A logical view of the interconnection buses within an M CM

18

POWER4 Processor Introduction and Tuning Guide

noitacinummoc pihc -pihC

M E M O R Y

L3

S hared L 2

Shared L2

L3

Mem Ctrl
GX Bus

GX Bus

L3

Mem Ctrl

M E M O R Y

The logical interconnection of four POWER4 chips is point-to-point, with uni-directional buses connecting each pair of chips to for m an 8-way SMP with an all-to-all interconnection topology. The fabric controller on each chip monitors (for example s noops) all buses and wr ites to its own bus, arbitrating between the L2 cache, I/O controller, and the L3 controller for the bus. Requests for data from an L3 cache are snooped by each fabric controller to deter mine if it has the data being requested in its L2 cache (in a suitable state), or in its L3 cache, or in the memor y attached to its L3 cache. If any one of these is tr ue, then that controller returns the requested data to the requesting chip on its bus. The fabric controller that generated the request then sees the response on that bus and accepts the data.

2.4.6 Multiple module interconnect
Figure 2-5 shows the interconnection of four MCMs to for m a 32-way SMP. Up to four MCMs can be interconnected by extending each bus from each module to its neighboring module in one direction. Inter-module buses r un at half the processor frequency and are 8-bytes wide. The inter-MCM topology is that of a ring in w hich requests and data move from one module to another module in one direction. As w ith the single MCM configuration, each chip always sends requests, commands and data on its own bus but snoops all buses for requests or commands from other MCMs.

GX Bus

GX Bus

GX Bus

GX Bus

GX Bus

GX Bus

GX Bus

GX Bus

L3

L3

Memory GX Bus

Memory GX Bus

L3

L3

Memory

Memory

Figure 2-5 Logical view of MCM-to-MCM interconnections

4REWOP
L3 L3 L3 L3 L3 L3 Memory GX Bus Memory GX Bus Memory
GX Bus

Memory GX Bus

Memory GX Bus

Memory GX Bus

L3

L3

L3

L3

L3

L3

Memory

Memory

Memory

Memory

Memory

Memory

Chapter 2. The POWE R4 system

19

2.4.7 Memory subsystem
Each POW ER4 chip can optionally have a memor y controller attached behind the L3 cache. Memor y controllers are packaged two to a memor y card and suppor t two of the four POW ER 4 chips on a module (as shown by the placement of memor y slots in Figure 2-6). A pSeries 690 Model 681 has two memor y slots associated with each module. No memor y cards, one, or two memor y cards can be installed per module. Memor y controllers can each have either one or two por ts to memor y. The memor y controller is attached to the L3 MLD chips, with each memor y controller having two 16-byte buses to the L3, one in each direction. These buses operate at one-third of the processor speed. Each por t to memor y has four 4-byte bidirectional buses operating effectively at 400 MHz connecting load/store buffers in the memor y controller to four System Memor y Interface (SMI) chips used to read and write data from memor y. When two memor y por ts are available, they each work on 512-byte boundar ies. The memor y controller has a 64-entr y read command queue, a 64-entr y write command queue, and a 16-entr y write cache queue.

Mem Slot GX Slot
L3

Mem Slot

GX Slot
L3 L3
GX

Mem Slot
L3 L3
GX
P L2 P P L2 P

L3 L3
GX

L3 L3
GX
P L2 P

Mem Slot GX Slot

L3
GX

P L2

P

L3 L3
GX

MCM 2
P P P L2 P P L2 P L2

MCM 3
P L2 P

L3

L3
GX

L3 L3
GX

L3

L3
GX

P L2

P

P L2

P

P L2

P

P L2

P

L3 L3
GX

L3

L3
GX

MCM 1
P L2 P P L2 P P L2 P

MCM 0
P L2 P

L3 L3
GX

GX

GX

GX

GX

L3 L3

L3 L3

L3 L3

L3 L3

Mem Slot

Mem Slot

GX Slot

Mem Slot

Mem Slot

Figure 2-6 Multiple MCM interconnection

If one memor y card or two unequal size memor y cards are attached to a module, then the L3 caches attached to the module function as two 64 MB L3 caches.

20

POWER4 Processor Introduction and Tuning Guide

The two L3 caches that act in concer t are the L3 caches that would be in front of the memor y card. (N ote that one memor y card is attached to two chips.)

2.4.8 Hardware data prefetch
In addition to out of order execution and the ability to sustain multiple outstanding cache misses, POW ER4 systems provide additional hardware to hide memor y latency by prefetching data cache lines from memor y, L3 cache, and L2 cache transparently into the L1 data cache. The POW ER4 processor can prefetch streams, which are defined as a sequence of loads from storage that reference at least two or more contiguous data cache lines, in order, either in an ascending or descending patter n (the loads themselves need not be monotonically increasing or decreasing). Eight such streams per processor are suppor ted. Hardware prefetching is tr iggered by data cache line misses, and then paced by loads to the stream. Pacing prefetches by monitor ing loads provides a consumption-driven method to provide timely and effective prefetching. The prefetch engine typically initiates a prefetch stream after detecting misses to two consecutive cache lines. Figure 2-7 shows the sequence of prefetch operations in the steady-state after a ramp-up phase. L1 prefetches are one cache line ahead of the cache line currently being loaded from in the program. L2 prefetches, which prefetch cache lines from the L3 cache (or memor y) into the L2 cache, are five cache lines ahead, which is sufficient to hide the latency between the L3 cache and the L2 cache. Finally, L3 prefetches, which prefetch data from the memor y into the L3 cache, are 17 to 20 lines ahead of the current cache line being loaded from in the program. L3 prefetches are usually done as logical 512-byte lines, for example, four 128-byte lines at a time. This increases the efficiency of the transactions and need only be perfor med ever y four th line referenced.

Hardware Data Prefetch
L3 L3 Core DL1 EU
l
0

Memory

L2 L2
l
1 1

l l
4

5

l l
9

6

l l

7

l l

8

l

2

l

3

l

5

10

11

l

12

l

l

13

l

14

l

15

l

16

l
20

17

l

18

l

19

l

20

l

17

l

18

l

19

l

Figure 2-7 Hardware data prefetch operations

Chapter 2. The POWE R4 system

21

To begin a stream, the prefetch engine either increments or decrements the real address of a cache line miss (so that it is the address of the next or the previous cache line) and places that address in the prefetch filter queue. The decision whether to increment or decrement is based upon the offset within the line corresponding to the load operand. As new cache misses occur, if the real address of the new cache miss matches one of the guessed addresses in the filter queue, a stream has been detected. If the prefetch engine has fewer than eight streams active, the new stream is installed in the prefetch request queue and the prefetching ramp-up sequence is begun. Once placed in the prefetch request queue, a stream remains active until it is aged out. Normally a stream is aged out w hen the stream reaches its end and other cache misses displace its entr y in the filter queue. The hardware prefetch engine issues prefetches only within a real page since it does not carr y infor mation about the effective to real address mapping. Hence, page boundar ies cur tail prefetching and end streams. If a prefetchable storage reference patter n crosses a page boundar y, a new stream is star ted at the beginning of the new real page according to the star tup logic described above. Since this results in a perfor mance penalty that can be significant, POWER4 systems suppor t, in addition to the standard 4 KB page, an additional page size of 16 MB (concurrently with 4 KB pages). Applications which place data into 16 MB pages can significantly improve prefetching perfor mance by essentially eliminating this penalty associated with stream re-initialization at page boundaries.

2.4.9 Memory/L3 cache command queue structure
Each L3 cache controller has eight all-pur pose coherency processors and eight special-pur pose coherency processors. In the desired mode in which four L3 cache arrays are operating in shared mode and therefore appear as one logical, interleaved L3 cache, there are a total of 32 all-pur pose coherency processors and 32 special-pur pose coherency processors. Coherency processors are busy processing a request until the operation is complete. Special-pur pose coherency processors handle primar ily cache line writes to memor y. For many wor kloads, the major ity of requests to the L3 cache will be read requests or data prefetch requests, and hence the all-pur pose coherency processors perfor mance will essentially deter mine the overall perfor mance of the L3 cache and memor y subsystem.

22

POWER4 Processor Introduction and Tuning Guide

2.5 I/O structure
Figure 2-8 on page 23 shows the I/O str ucture in POWER4 systems. The POWER4 GX bus is attached to a Remote I/O (RIO) br idge chip. This chip transmits the data across two one-byte wide RIO buses to PCI Host Bridge (PHB) chips.
B u rs t 1 7 0 0 M B / s S im p le x 3 4 0 0 M B / s D u p le x

S u s ta in e d
1 2 5 0 M B / s S im p le x 2 5 0 0 M B / s D u p le x

G X bus
R IO HUB
2 P la n a r s p e r R IO H U B

P e r P la n a r : B u rst
5 0 0 M B / s S im p le x 1 0 0 0 M B / s D u p le x

A c tiv e
P a s s i v e / f a il o v e r
R IO B u s

A c tiv e

S u s ta in e d
R IO B u s

7 0 4 0 -6 1 D

I/O d ra w e r

4 0 0 M B / s S im p le x 5 5 0 M B / s D u p le x 3 2 b it P H B 2 5 0 M B /s B u r s t 1 5 0 M B / s S u s t a in e d
32
PHB3
P C I- P C I

6 4 b it P H B 5 00 M B /s B u rst 3 0 0 M B / s S u s t a in e d
64
PHB1
P C I- P C I

R I O to P C I B r id g e

R IO to P C I B rid g e

64
PHB2
P C I- P C I

32
PHB3
P C I- P C I

64
PHB1
P C I- P C I

64
PHB2
P C I- P C I

B r id g e

B r id g e

B r id g e

B r id g e

B r id g e

B r id g e

P la n a r 1

U ltr a 3 SCSI

U ltra 3 SCSI

U ltra 3 SCSI

U ltra 3 SCSI

P la n a r 2

Figure 2-8 I/O structure

Two separate PCI buses attach to PCI-PCI br idge chips that fur ther fan the data out across multiple PCI buses. When multiple nodes are interconnected to for m clusters of systems, the RIO Br idge chip is replaced with a chip that connects to the switch. This provides increased bandwidth and reduced latency compared to switches attached using the PC I interface.

2.6 The POWER4 Performance Monitor
The POWER4 design includes powerful perfor mance monitoring facilities that can collect data on various system events and provide valuable perfor mance data. The perfor mance monitor facilities enable the counting of up to eight concurrent events, and counting can be star ted and stopped and the results retr ieved by software. Counters can be frozen until a user-selected tr igger event occurs and then incremented, or they can be incremented until a trigger event occurs and then be frozen. It enables the monitoring of classes of instr uctions selected by the instr uction matching facility, or the random selection of instructions for detailed monitoring, as well as count star t/stop event pairs that exceed a selected time-out value threshold.

Chapter 2. The POWE R4 system

23

AIX 5L contains application program interface (API) code for customer use in enabling and using the perfor mance monitor facilities from their applications. Use of the POW ER4 Performance Monitor API is discussed in Section 5.3, "The perfor mance monitor" on page 101.

24

POWER4 Processor Introduction and Tuning Guide

3

Chapter 3 .

POWER4 system performance and tuning
This chapter provides a guide for For tran or C programmers w ho have a general understanding of tuning techniques to tune their programs for POWER4. The following major topics are discussed within: Tuning for scientific and technical numerically intensive applications Tuning for non-numerically intensive or commercial applications General system level aspects of tuning For more infor mation on the general aspects of tuning, see Optimization and Tuning Guide for For tran, C, and C++, SC09-1705.

3.1 Tuning for numerically intensive applications
Before describing specific tuning techniques, this section first reviews the tuning process and discusses those aspects of POW ER 4 microarchitecture that par ticularly influence the perfor mance of numerically intensive scientific and technical programs.

© Copyr ight IBM Cor p. 2001

25

3.1.1 The tuning process for numerically intensive applications
For an existing program, the following steps summar ize the tuning process in approximate order of impor tance. Taking these guidelines into account when writing a new program should significantly reduce the need for tuning at a later stage. 1. If I/O is a significant par t of the program, tuning for this is an impor tant but separate activity from computational tuning. Some guidelines for efficient I/O coding are given in Chapter 4, "Optimizing with the compilers" on page 69. 2. Use the best set of compiler optimization flags. See Section 4.1, "POWER4-specific compiler options" on page 69. 3. Locate the hot spots in the program (profiling). This step is ver y impor tant. Do not waste time tuning code that is infrequently executed. 4. Use the MASS librar y and ESSL (and maybe other perfor mance-optimized librar ies) when possible. These librar ies are discussed in Chapter 6, "Perfor mance libraries" on page 113. 5. Make sure that the gener ic comm on sense tuning guidance given in Chapter 4, "Optimizing with the compilers" on page 69 has been followed. 6. Hand tune the code to the POWER4 design. This will be discussed in the rest of this chapter.

3.1.2 Hand tuning overview for numerically intensive programs
Hand tuning for cache-based RISC architecture computers such as a pSeries 690 Model 681 is divided into two par ts: 1. Avoid the negative. Tune to avoid or minimize the impact of a cache and memor y subsystem that is necessar ily slower than the computational units. Basic techniques for doing this include: Str ide minimization Encouragement of data prefetch streaming Avoidance of cache set associativity constraints Data cache blocking

2. Exploit the positive. Tune to maximize the utilization efficiency of the computational units, in par ticular the floating-point units. Techniques for CPU tuning include: Unrolling inner loops to increase the number of independent computations in each iteration to keep the pipelines full.

26

POWER4 Processor Introduction and Tuning Guide

Unrolling outer loops to increase the ratio of computation to load and store instructions so that loop perfor mance is limited by computation rather than data movement. It will be assumed that the reader has a basic understanding of the concepts of: Loading and stor ing (into and from floating-point registers) Str ide and what deter mines it in For tran and C loops Loop unrolling

3.1.3 Key aspects of the POWER4 design
This section covers those par ts of the POWER4 design that are relevant to tuning the perfor mance of floating-point intensive applications. Additional details are provided in Chapter 2, " The POWER4 system" on page 5. The components descr ibed here are: The L1, L2, and L3 caches The ERAT and TLB Data prefetch streaming Floating point and load/store units

The level 1, 2, and 3 caches
A br ief descr iption of the caches follows.

The L1 instruction cache
The L1 instruction cache (I-cache) is 64 KB and is direct mapped. It can be of considerable impor tance for commercial applications such as transaction processing. For computationally intensive applications, it does not usually have a significant impact on performance because such applications usually consist of highly active loops (DO-loops in For tran or for-loops in C) that contain relatively few instructions. The amount of data handled is usually much larger than the space taken by the instr uction stream. Tuning for the I-cache consists mainly of ensur ing that active loops do not contain a ver y large number of instructions.

Chapter 3. POWE R4 system perfor mance and tuning

27

The L1 data cache
Each processor has a dedicated 32 KB L1 data cache. It is two-way set associative with a first-in-first-out (FIFO) replacement algorithm. These concepts and the implications for tuning are explained fully in "Str ucture of the L1 data cache" on page 28.

The L2 cache
Each POW ER4 chip has a dedicated L2 (data and instruction combined) cache 1440 KB in size. The pSer ies 690 Model 681 and pSeries 690 Turbo have two processors per chip that share the L2 cache. The pSeries 690 HPC feature has one processor per chip that, therefore, has the L2 cache dedicated. Cache coherence is maintained across the entire pSer ies 690 Model 681 system at the L2 level.

The L3 cache
Four POWER4 chips are combined into an multi-chip module (MCM) each of which has a 128 MB Level 3 cache. For pSer ies 690 Model 681 systems with more than one MCM, the L3 caches on remote MCMs are accessible with a modest perfor mance penalty. This applies even if the system is par titioned using LPAR . The L3 cache is eight-way set associative.

General cache considerations
The high bandwidth from L2 to L1 is more than enough to feed the floating-point units. Thus, the primar y difference from a performance point of view between L1 and L2 is latency. A load/store between floating-point register and L1 has a latency of about 4 cycles; between registers and L2 it is approximately 14 cycles. The tuning recommendation for dense (as opposed to sparse) computation is therefore to block data for the L2 cache and to structure the data access (array leading dimension, for example) for the L1. This tuning advice will be explained in subsequent sections. An application whose performance is dominated by latency (such as the pointer-chasing code described in Section 3.1.6, "Cache and memor y latency measurement" on page 47) may need to be blocked for L1 for best perfor mance.

Structure of the L1 data cache
There are two concepts, cache lines and set associativity, that are key to understanding the str ucture of the pSeries 690 Model 681 data cache, discussed in the following sections.

28

POWER4 Processor Introduction and Tuning Guide

Cache lines
Conceptually, memor y is sectioned into contiguous 128-byte lines, each one star ting on a cache-line boundar y whose hardware address is a multiple of 128. The cache is similarly sectioned and all data transfer between cache and memor y is in units of these lines. If, for example, a par ticular floating-point number is required to be copied (loaded) into a floating-point register to be used in a computation, then the w hole cache line containing that number is transferred from memor y to cache.

Set associativity
The L1 data cache is mapped onto memor y, as shown in Figure 3-1. Each column in one of the diagrams is called a congruence class, and any par ticular line from memor y may only be loaded into a cache line in a par ticular congr uence class, that is into one of only two locations. The POWER4 L1 data cache is two-way set associative with 128 congruence classes. Each cache line is 128 bytes. In total, the L1 data cache can contain 32,768 bytes of data.

The 2-way set associative POWER4 L1 data cache
128 lines of 128 bytes each (16 KB) 128 congruence classes

Cache
2 locations for any particular l ine.

...
Load Store

Memory
via L2 cache

16 32 48 64 80 16*n

0 KB KB KB KB KB KB

...
... ... ... ...

Figure 3-1 The POWER4 L1 data cache

When a new line is loaded into L1, it displaces the oldest of the two lines in the congruence class (FIFO replacement).

Chapter 3. POWE R4 system perfor mance and tuning

29

The set associative str ucture of the cache can lead to a reduction in its effective size. Suppose successive data elements are being processed that are regular ly spaced in memor y (that is with a constant stride). With the POW ER4 cache, the worst case is when the str ide is exactly 16 KB or a multiple of 16 KB. In this case, all elements will lie in the same congruence class and the effective cache size will be only two lines. This effect happens, to a lesser extent, with any stride that is a multiple of a power of 2 less than 16 KB.

Characteristics of the L2 cache
The size of the L2 cache is 1440 KB per POWER4 chip, and this is shared between the two processors in the chip. As with the L1 data cache, the cache line size is 128 bytes. The replacement policy is pseudo-LRU (least recently used) so frequently accessed cache lines should be readily maintained in the cache. The L2 cache is a combined data and instr uction cache. Instr uction caching aspects of the L2 cache are not considered here. The L2 cache is divided into three equal par ts, each under control of a separate L2 cache controller. The par ticular por tion a line is stored is in is deter mined from the real memor y address using a hashing algor ithm. Sixteen consecutive double-precision For tran array elements (138 bytes) are held in the same cache line, and therefore under control of the same cache controller. The 17th element will be in a different cache line and the hashing algor ithm guarantees it will be stored under control of a different cache controller. This has implications for the optimization of store processing when accessing arrays sequentially. Loads are processed by loading a cache line from L2 into the L1 data cache 32 bytes at a time. This means that the L2 cache can load the equivalent of four double-precision floating-point data elements per cycle, which is double the capability of the processor to issue load instr uctions. Prefetched data will be loaded into the L1 at the same rate. The L2 cache is a store-in cache, which means that stores are always w ritten to the L2 cache whether there is a hit in the L2 cache or not. This is in contrast to the store-through L1 data cache where a store miss will not result in the data being wr itten into this level. Stores are passed to the L2 cache interface 8 bytes at a time. The rate at which stores can be accepted by the interface depends on whether the stores are to the same L2 section or not. See 3.1.7, "Selected fundamental ker nel perfor mance within on-chip cache" on page 49 and 3.1.8, "Other tuning considerations" on page 51 for discussions concer ning store perfor mance.

30

POWER4 Processor Introduction and Tuning Guide

Once the store has been accepted by the interface unit, the store instruction is released by the processor, freeing up resources. Note that store data is never written into the caches until they have been completed, such as, made visible to the program, by the processor. Completion in this sense is separate and later than execution. In the case where the cache line is not already present in the L2 cache (an L2 cache miss), then it must be loaded from either memor y, another chip's L2 cache, or the L3 cache to ensure that the L2 cache contains the latest copy of this cache line. Depending on whether the line already exists in another L2 cache on another chip, some coherency processing may be required to ensure that the local chip has per mission to modify the line. Once the line is updated in the L2 cache, then it is mar ked as dirty and will eventually be written out to memor y and potentially to L3 cache.

The ERAT and TLB
The instruction stream addresses data using a 64-bit effective addresses (EA). To access the data in memor y, the EA is first conver ted to an 80-bit vir tual address (VA) and then to a 64-bit real address (RA). The translation lookaside buffer (TLB) holds the 1024 entr ies organized in a 4-way set-associative str ucture. It contains previously translated EAs to RAs and other infor mation on a page basis, either 4 KB or 16 MB page sizes. For 4 KB pages, the TLB addresses a total of 4 MB of (not necessar ily contiguous) memor y. Data that is within a page addressed by the TLB will not take the overhead of a TLB miss when the EA is accessed. The ERAT is effectively a cache for the TLB. It is a 256-entr y 2-way set-associative array. All ERAT entries are based on 4 KB pages pages, even if 16 MB pages are used. The TLB addresses a greater amount of memor y (at least 4 MB) than the L2 cache ( 1.41 MB). Therefore, any program that is tuned to take any advantage of the L2 cache is unlikely to experience ser ious overheads due to TLB misses (this is different from POWER3 where the TLB addressed 1 MB but L2 was 4 MB or 8 MB). It is still possible on POW ER4 to constr uct situations involving high str ides that will create a TLB miss and not a cache miss, but tuning for the TLB is beyond the scope of this document. For codes in which a blocking strategy is used, empir ically deter mining the blocking factors will also include ERAT and TLB effects.

Prefetch data streaming
The POWER4 design provides a prefetch mechanism that can as defined in Section 2.4.8, "Hardware data prefetch" on page POWER4 microprocessor can suppor t up to eight independent In contrast, the POW ER 3 processor suppor ted four independe streams. Note that there is no prefetch on store operations. identify streams 2 1. E ac h prefetch streams. nt prefetch

Chapter 3. POWE R4 system perfor mance and tuning

31

The prefetch mechanism is based on real addresses. Therefore, whenever a real address reference crosses a page boundar y, the prefetch mechanism is stopped. Two consecutive cache line misses on the subsequent page are required to restar t the mechanism. Large pages are suppor ted in AIX 5L only through shared memor y segments. More general large page suppor t will be available in a future release of AIX 5L. There can be perfor mance benefits because the prefetch mechanism can operate over much larger arrays before crossing page boundaries.

The floating-point units and maximum GFLOPS
To achieve the maximum floating-point rate possible on a single pSer ies 690 Model 681 processor, the delays due to the memor y subsystem have to be eliminated and the program must reside in the L1 cache. The following key facts summar ize the way the FPUs perfor m: A single pSeries 690 Model 681 processor has two FPUs (sharing a single L1 cache) that can operate independently. The two FPUs see only floating-point registers. There are a total of 72 physical registers. An assembler program can address 32 architected registers and these are mapped onto the physical registers through a hardware process known as renaming. The 72 physical registers ser ve both FPUs. They all have 64 bits. floating-point computation is carried out only with data in these registers. They are all 64-bits wide. All floating-point arithmetic instr uctions are register-to-register operations, logically using only floating-point registers as sources and targets. Data is copied into the registers from the L1 cache (loaded) and copied back to the L2/L1 cache (stored) by two load/store units. For data in the L1 or L2 cache, loads or stores of floating-point double-precision (REAL*8) var iables can be done by each load/store unit at the rate of one per cycle, but for loads, there is a latency before the FPU can use the data for computation. This latency is approximately four cycles if the data is in L1, or 14 cycles if it is in L2 but not L1. For maximum perfor mance, it is impor tant that loaded data is in L1, because the compiler will assume the L1 latency. Single precision (REAL*4) variables use the same register set as REAL*8. Each var iable occupies an entire 64-bit register (there is no ability to pack two REAL*4s into a single register). The basic computational floating-point instruction is a double-precision multiply/add, with var iants multiply/subtract, negative multiply/add, and negative multiply/subtract. There are also single precision var iants.

32

POWER4 Processor Introduction and Tuning Guide

A single add, subtract, or multiply (not divide) is done using the same hardware as a multiply/add and takes the same amount of time. A multiply/add counts as two floating-point operations. For example, a program doing only additions might r un at half the MFLOPS rate of one doing alter nate multiplies and adds. The assembler acronym for the double-precision floating-point multiply/add is FMA. This ter m will be used extensively as shor thand for any of the var iants of this basic floating-point instr uction. The computational par t of an FMA takes six cycles. The worst case would be a sequence of wholly dependent 6-cycle FMAs (where a result of one FMA is needed by the next) where only one of the FPUs would be active. This would run at the rate of one FMA per six cycles. A sequence of independent FMAs, however, can be pipelined and the throughput can then approach the peak rate of two FMAs per cycle (one per FPU). Divides are ver y costly and are not pipelined. A fundamental aspect of RISC architecture is that the functional units can r un independently. Therefore, FMAs can r un in parallel with load/stores and other functions.

Conditions for approaching peak GFLOPS
When considering a numerically intensive loop, the following applies to the instruction stream within the loop: Operate efficiently within L1 and L2 caches. No divides (or square roots or function calls and so on). To achieve peak megaflops, loops must contain FMAs only, therefore using floating-point adds or subtracts with multiplies. FMAs must be independent and at least 12 in number to keep two pipes of depth six busy. The loop should be FMA-bound. That is, cycles needed for instr uctions other than FMAs (mainly load/stores) should be less than that needed for FMAs so that they can be over lapped with FMAs and effectively hidden. In pr inciple, they could be equal to the FMA cycles, but, in practice, peak perfor mance is approached most easily if there are fewer.

Chapter 3. POWE R4 system perfor mance and tuning

33

The perfor mance of floating-point intensive applications on 1.3 GHz POWER4 is typically between two and three times faster than on 375 MHz POWER3 but is usually somewhat less than that as indicated merely by a comparison of clock rate ratios. This is because it is more difficult to approach peak perfor mance on POWER4 than on POWER3 because of factors such as: The increased FPU pipeline depth The reduced L1 cache size The higher latency (in terms of processor cycles) on the higher level caches

3.1.4 Tuning for the memory subsystem
There are four basic tuning techniques (some of these techniques may be done by the compilers) that will be discussed in this section, namely: Str ide minimization Encouragement of data prefetch streaming Str uctur ing for L1 set associativity Data cache blocking

Stride minimization
Sequential accessing of data is beneficial for two reasons: It ensures that, once a line is loaded into cache, all other operands in the same cache line will also be referenced. If the data is accessed with a large str ide, less data from the cache line will be referenced. If the str ide is greater than 16 for double precision words, or 32 for single precision words, only one number in each line will be referenced. There will then be a high probability that, when the other numbers in the line are accessed at a later stage, the line will no longer be in cache leading to the overhead of a cache miss. Sequentially accessed (forwards or backwards - stride 1 or -1) data can star t one of the eight hardware prefetching streams. Other low-value strides may also star t a prefetching stream provided that they are contiguous cache line references. For tran arrays are stored in memor y in column major order, C arrays in row-major order. Coding nested loops to access data the right way so that multi-dimensioned arrays are accessed sequentially (str ide 1) is the most basic tuning technique of all.

34

POWER4 Processor Introduction and Tuning Guide

The following examples illustrate this:
Correctly tuned stride 1 sequential access Fortran do i=1,n do j=1,n a(j,i)=a(j,i)+b(j,i)*c(j,i) ! Left subscript same as inner loop var. enddo enddo C for(i=0;i
If the nesting order of the loops is changed, the arrays are then accessed with a large str ide. In this simple case, the compilers will reverse the order of the loops for you. However, it is sound coding practice not to rely on the compiler and always to code loops in the correct order. It is, of course, not always possible to code so that all arrays are accessed str ide 1. For example, the following is a typical matr ix multiply code fragment:
do i=1,n do j=1,n do k=1,n d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddo enddo

No matter how the loops are coded, one or more arrays will have non-unit stride. In this case, data cache blocking may be necessar y as described in "Data cache blocking" on page 38.

Encouragement of data prefetch streaming
Data prefetching is implemented in the POWER4 processor hardware so that prefetching is transparent to the application: it does not require any software assistance to be effective. There are, however, situations where the perfor mance of an application can be improved with code tuning to more fully exploit the capabilities of the hardware prefetch engine. These situations arise when: There are too few or too many streams in a performance-cr itical loop The length of the streams in a perfor mance-critical loop is too shor t.

Chapter 3. POWE R4 system perfor mance and tuning

35

The POWER4 data prefetch design was optimized for loops with four to eight concurrent hardware streams. Figure 3-2 on page 37 shows the perfor mance for a ser ies of loops with one to eight streams per loop. Note that increasing the number of streams from one to eight can improve data bandwidth out of the L3 cache and memor y by up to 70 percent, and that most of the improvement comes from increasing the number of streams from one to four. The number of streams in a loop can be increased by fusing adjacent loops (a capability which the XL compilers posses with the -qhot optimization) or by midpoint bisection of the loop. Fusing simply means combining two or more compatible loops into a single loop. For example:
DO I=1,N S = S + B(I) * A(I) ENDDO DO I=1,N R(I) = C(I) + D(I) ENDDO

may be combined into:
DO I=1,N S = S + B(I) * A(I) R(I) = C(I) + D(I) ENDDO

Midpoint bisection of a loop doubles the number of streams but halves its vector length. Consider the standard dot-product loop:
DO I=1,N S = S + B(I) * A(I) ENDDO

This loop contains two streams corresponding to the two arrays on the r ight hand side of the expression. Midpoint bisection doubles the number of streams by star ting two more streams at the halfway point of each of the arrays, as shown in the following:
NHALF = N/2 S0=0.D0 S1=0.D0 DO I=1,NHALF S0 = S0 + A(I)*B(I) S1 = S1 + A(I+NHALF)*B(I+NHALF) ENDDO IF(2*NHALF.NE.N) S0 = S0 + A(N)*B(N) S = S0+S1

36

POWER4 Processor Introduction and Tuning Guide

For this example, in situations where the data is being reloaded from beyond the L2 cache, the break-even vector length is approximately 220. Loops with vector lengths beyond 220 which have been midpoint bisected as shown have super ior perfor mance by up to 20 percent. When a loop has more than eight streams, reducing the number of streams per loop may also boost overall perfor mance. Since only eight of the streams can be prefetched (as there are only eight prefetch request queues), streams beyond eight will be reloaded on a demand basis. It may be possible to split the loop into two or more loops, each with eight or fewer streams. This may or may not involve introducing extra temporar y vectors to allow the loop to be split. In any event, profiling or loop timing should always be done within the application to check whether the tuning, either to increase or decrease the number of streams per loop, had a positive overall effect on performance. Increasing vector length can significantly improve perfor mance as well, simply due to the fact that there is a fixed overhead resulting from loop unrolling and prefetch stream acquisition. In some cases, increasing the vector length of an application is under direct control of the programmer, such as in those in which explicit integration of a variable per mits operations on groups of entities of arbitrar y size. In these situations, there is often a trade-off between cache reuse and vector length, so it is again advisable to deter mine the optimal vector length empir ically.

10.0

Gigabytes per second

8.0

6.0

1 2 4 8

stream streams streams streams

4.0

2.0

0.0 1.0E+5 1.0E+6 1.0E+7 1.0E+8 1.0E+9

W orking set (bytes)

Figure 3-2 POWER4 data transfer rates for multiple prefetch streams

Chapter 3. POWE R4 system perfor mance and tuning

37

Structuring for L1 set associativity
In cases w here it is not possible to access arrays sequentially, the str ide is typically deter mined by the leading dimension of the array. For example, consider the following loop.
real*8 a(2048,75) . . do i=1,75 a(100,i)=a(100,i)*1.15 enddo

This updates the 100th row of cache lines will be accessed ( accessed). The L1 cache has of the array has been recently cache.

a For tran array. The row is 75 elements long, so 75 if this were a column, only 5 cache lines would be a total of 256 lines. So, if, for example, this section accessed, you might hope to find these lines in the

However, the leading dimension of the array deter mines the str ide for array A to be 2048 REAL*8 numbers or 16384 bytes. These map to a single congr uence class in the L1 cache so that only two elements of A can be held in L1. At best, only the first two lines (of the 75) accessed could possibly be in the L1 cache. Changing the leading dimension to 2064 (that is, 2048 plus a single cache line of 16 REAL*8 numbers) would cause the 75 lines to map to different congr uence classes and all 75 lines would fit. W ith a two-way set associative cache, a leading dimension of 2056 (2048 plus half a cache line) would also wor k. But 2046 would work for any level of set associativity, including direct mapping. The general r ule is: Avoid leading dimensions that are a multiple of a high power of two. Any odd number of cache lines is ideal, that is for 128-byte cache lines, any odd multiple of 16 for REAL*8 arrays or any odd multiple of 32 for REAL*4 arrays.

Data cache blocking
The data cache blocking idea is basic: if your arrays are too big to fit into cache, then process them in blocks that do fit into cache. Generally w ith POWER4, it is the 1440 KB L2 cache that needs to be large enough to contain the block. There are two factors that deter mine if using blocking will be effective: When all arrays are accessed stride 1 When each data item is used in more than one arithmetic operation

38

POWER4 Processor Introduction and Tuning Guide

The combination of these factors produces four scenar ios: All arrays are str ide 1 and no data reuse. There is no benefit from blocking.
! Summed dot products. Note each element of A and B used just once. do j=1,n do i=1,n s = s + a(i,j)*b(i,j) enddo endd

Some arrays are not str ide 1 and there is no data reuse. Blocking will be moderately beneficial.
! Summed dot products with transposed array. do j=1,n do i=1,n s = s + a(j,i)*b(i,j) enddo endd

All arrays are str ide 1 and there is much data reuse. Blocking will be moderately beneficial.
! Matrix multiply transpose. do i=1,n do j=1,n do k=1,n d(i,j)=d(i,j)+a(k,j)*b(k,i) enddo enddo enddo

Some arrays are not str ide 1 and there is much data reuse. Blocking will be essential.
! Matrix multiply. do i=1,n do j=1,n do k=1,n d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddo enddo

Chapter 3. POWE R4 system perfor mance and tuning

39

The following example shows how matr ix multiply should be blocked.
!3 blocking loops do ii=1,n,nb do jj=1,n,nb do kk=1,n,nb ! ! In-cache loops do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddo enddo ! enddo enddo enddo

In this example, the size of the blocks of each matrix is NB x NB elements. For blocking to be effective, it must be possible for the L2 cache to hold three such blocks. On an pSeries 690 HPC, the process will have the whole L2 cache available. On a non-HPC model, it may be sharing L2 with another process or thread so that only half the cache is available. The relatively complicated str ucture of the cache may also require NB to be smaller than a simple size calculation would suggest. In practice, the right way to fix NB is to var y it and measure the perfor mance to achieve the optimum value. However, if a non-HPC machine is being used, these measurements should not be r un stand-alone if, in practice, the application will be r un w hen another application or thread is competing for L2. Note that, although this code leads to in-cache performance, it does not lead to maximum GFLOPS. The reason for this is explained in the next section. Blocking usually needs to be done by hand rather than leaving it to the compiler.

3.1.5 Tuning for the FPUs
In contrast to tuning for the memor y subsystem, the compiler is generally ver y successful at tuning for the FPUs and often there is little extra that can be achieved by hand tuning. Some exceptions to this are highlighted in this section.

40

POWER4 Processor Introduction and Tuning Guide

Inner loop unrolling and instruction scheduling
To keep the FPU pipelines busy, the following conditions must apply in the inner instruction loop: There must be enough (at least 12) independent FMAs in the compiled loop. Loads must precede FMAs in the instruction stream by at least four cycles to overcome the L1 cache latency. The total number of architected registers used must not exceed 32. If this happens, the compiler must generate spill coding that stores the register values and reloads them later. The number of rename registers needed must not exhaust the hardware pool available. The number of loads and stores must be less than or equal to the number of FMAs, otherwise the load/store time dominates. Techniques for dealing with the last item - load/store bound loops - are discussed in "Outer loop unrolling to increase the FMA to load/store ratio" on page 41. The basic technique for achieving multiple independent FMAs is inner loop unrolling. W hile this can be done by hand, it produces convoluted coding and usually there is no point since the compiler will do it for you efficiently and reliably. If you unroll manually, there is a danger that the compiler w ill unroll again. This may cause register spilling or other overheads and it may be beneficial to use the -qnounroll compiler flag. To help the compiler to avoid register spilling, you should avoid coding too many unnecessar y temporar y scalar var iables in the loop. Apar t from the items noted, you must rely on the compiler to produce the optimum instr uction stream unless assembler language is used. This is easier than might be imagined, since advantage can be taken of the -S compiler option. This will produce a file from the For tran with a .s suffix that may be assembled with the as command and linked into the program. Identifying the inner loop of the routine and editing it to improve the instr uction stream is then quite possible for the exper ienced programmer w ithout the necessity to fully lear n assembler language. Never theless, most programmers will not choose to do this and fur ther advice is beyond the scope of this publication.

Outer loop unrolling to increase the FMA to load/store ratio
In cases w here the inner loop is load/store bound (loads + stores greater than FMAs) it may be possible to significantly improve performance by increasing the ratio of FMAs to loads and stores in the loop. This is only possible in data re-use cases and the basic technique is usually outer loop unrolling.

Chapter 3. POWE R4 system perfor mance and tuning

41

This section considers two cases: one simple loop that the compiler does not handle successfully, and then blocked matrix multiply coding.

Simple loop benefitting from hand unrolling
Consider the following loop:
do i = 1,n do j = 1,n y(i) = y(i) + x(j)*a(j,i) end do end do

This loop is already well str uctured in that the inner loop both has stride 1 and is a sum-reduction (y(i) is a scalar). This means that the number of loads and stores needed in the inner loop is minimized because the scalar value y(i) can be held in a single register and stored just once after the inner loop is complete. Iteration of the inner loop needs just two loads (for x(j) and a(j,i)) and zero stores. If the loop order were reversed (with the inner loop on I), there would be two loads needed (for y(i) and a(j,i)) plus one store (for y(i)). In addition, there would be poor str ide on a(j,i). However, the loop is load/store bound because there are more load and store instructions than FMAs. Therefore, as it stands, the performance of this loop will be limited by the effective rate at which the load/store unit can operate. The compiler will successfully unroll the inner loop on J. This is necessar y in order to populate the inner loop with independent FMAs rather than dependent ones. However, this does nothing to alter the FMA to load/store ratio. The solution, in this case, is to unroll the outer loop on I. With this simple loop, the compiler may optimize the code for you with the -qhot option, but, generally, it is more reliable to do outer-loop unrolling by hand. The following code shows the loops unrolled to depth 4 (tidy-up coding omitted for cases where n is not a multiple of 4).
do i = s0 s1 s2 s3 1,n,4 = y(i) = y(i+1) = y(i+2) = y(i+3) 1,n = s0 = s1 = s2 = s3

do j = s0 s1 s2 s3 enddo

+ + + +

x(j)*a(j,i) x(j)*a(j,i+1) x(j)*a(j,i+2) x(j)*a(j,i+3)

42

POWER4 Processor Introduction and Tuning Guide

y(i) y(i+1) y(i+2) y(i+3) enddo

= = = =

s0 s1 s2 s3

Note the introduction of the temporar y scalar values, S0, S1, S2, and S3. This is ver y impor tant because usually, whenever the inner loop contains anything more complicated than a single subscr ipted scalar, the compiler may not recognize that they are scalars and may generate unnecessar y loads and stores. Generally speaking, introducing temporar y scalars to make the scalar nature of array elements clear to the compiler is good coding practice. This does not contradict previous advice to avoid the introduction of unnecessar y scalar var iables. In this case, it is necessar y for the compiler to recognize that y(i), y(i+1), y(i+2), and y(i+3) are scalar in the inner loop. The load/store to FMA ratio is reduced because the element x(j) is now re-used three times in the inner loop. So, now, for four of the or iginal iterations, there are five loads rather than eight. Clearly, as the unrolling depth increases, the load/store to FMA ratio reduces asymptotically from two to one. The actual perfor mance depends on the compiler optimization flags and the depth of hand-unrolling. Selected results for a 1.1 GHz machine are shown in Figure 3-3 on page 44. The label depth refers to the unrolling depth of the outer loop of the hand-tuned code. The x-axis refers to dimension n. At n=64 the data just exceeds the size of the L1 cache. W ithout hand-unrolling, the compiler does not take advantage of the L1 cache. The top two lines are with different compilers but the main reason for the difference in perfor mance is that the top line is compiled for -qarch=pwr4 rather than pwr3. Note the "L1 cache peak" for the (top three) hand unrolled lines as the array size is increased. The untuned code (the bottom line) does not show this peak.

Chapter 3. POWE R4 system perfor mance and tuning

43

Figure 3-3 Outer loop unrolling effects on matrix-vector multiply (1.1G Hz system)

M x N unrolling for matrix multiply
The following is the hear t of the blocked matr ix multiply code. The blocking loops have been omitted for clar ity.
do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) do k=kk,min(n,kk+nb-1) d(i,j)=d(i,j)+a(j,k)*b(k,i) enddo enddo enddo

As with the previous example, having the inner loop on k (rather than i or j) minimizes the number of loads and stores. The array element d(i,j) is a scalar in the inner loop, since it does not depend on the inner loop variable, k, so the inner loop is a sum reduction. The scalar may be held in a register during iteration and only stored after the inner loop is complete. The inner loop requires just two loads (for a(j,k) and b(k,i)) w hereas if i or j were the inner loop variable, there would be two loads plus one store.

44

POWER4 Processor Introduction and Tuning Guide

Let us recast the loop so as to make the scalar nature of d(i,j) explicit.
do i=ii,min(n,ii+nb-1) do j=jj,min(n,jj+nb-1) s = d(i,j) do k=kk,min(n,kk+nb-1) s = s + a(j,k)*b(k,i) enddo d(i,j) = s enddo enddo

As with the previous example, introduction of the variable S is sound coding practice. Although the number of load/stores in the inner loop has been minimized, the loop is never theless clear ly load/store bound. There are two loads and only one FMA. This can be effectively transfor med into an FMA-bound loop by unrolling the outer two loops. If the outer loop is unrolled to depth M and the middle loop to depth N, then the number of loads is m+n and the number of FMAs is m*n. Unrolling 2x2 makes the loop balanced (load/stores = FMAs). Anything more makes it FMA-bound. The following code shows 5x4 unrolling. This requires 29 architectured registers (20 for the holding of the 20 par tial sums in the temporar y scalar var iables and 9 for holding the elements of A and B). Anything higher would exceed the number of architectured registers.
do i=ii,min(n,ii+nb-1),5 do j=jj,min(n,jj+nb-1),4 s00 = d(i+0,j+0) s10 = d(i+1,j+0) s20 = d(i+2,j+0) s30 = d(i+3,j+0) s40 = d(i+4,j+0) s01 = d(i+0,j+1) s11 = d(i+1,j+1) s21 = d(i+2,j+1) s31 = d(i+3,j+1) s41 = d(i+4,j+1) s02 = d(i+0,j+2) s12 = d(i+1,j+2) s22 = d(i+2,j+2) s32 = d(i+3,j+2) s42 = d(i+4,j+2) s03 = d(i+0,j+3) s13 = d(i+1,j+3) s23 = d(i+2,j+3) s33 = d(i+3,j+3) s43 = d(i+4,j+3) do k=kk,min(n,kk+nb-1)

Chapter 3. POWE R4 system perfor mance and tuning

45

s00 = s00 s10 = s10 s20 = s20 s30 = s30 s40 = s40 s01 = s01 s11 = s11 s21 = s21 s31 = s31 s41 = s41 s02 = s02 s12 = s12 s22 = s22 s32 = s32 s42 = s42 s03 = s03 s13 = s13 s23 = s23 s33 = s33 s43 = s43 enddo d(i+0,j+0) = d(i+1,j+0) = d(i+2,j+0) = d(i+3,j+0) = d(i+4,j+0) = d(i+0,j+1) = d(i+1,j+1) = d(i+2,j+1) = d(i+3,j+1) = d(i+4,j+1) = d(i+0,j+2) = d(i+1,j+2) = d(i+2,j+2) = d(i+3,j+2) = d(i+4,j+2) = d(i+0,j+3) = d(i+1,j+3) = d(i+2,j+3) = d(i+3,j+3) = d(i+4,j+3) = enddo enddo

+ + + + + + + + + + + + + + + + + + + +

a(j+0,k)*b(k,i+0) a(j+0,k)*b(k,i+1) a(j+0,k)*b(k,i+2) a(j+0,k)*b(k,i+3) a(j+0,k)*b(k,i+4) a(j+1,k)*b(k,i+0) a(j+1,k)*b(k,i+1) a(j+1,k)*b(k,i+2) a(j+1,k)*b(k,i+3) a(j+1,k)*b(k,i+4) a(j+2,k)*b(k,i+0) a(j+2,k)*b(k,i+1) a(j+2,k)*b(k,i+2) a(j+2,k)*b(k,i+3) a(j+2,k)*b(k,i+4) a(j+3,k)*b(k,i+0) a(j+3,k)*b(k,i+1) a(j+3,k)*b(k,i+2) a(j+3,k)*b(k,i+3) a(j+3,k)*b(k,i+4)

s00 s10 s20 s30 s40 s01 s11 s21 s31 s41 s02 s12 s22 s32 s42 s03 s13 s23 s33 s43

As with all hand unrolling operations, extra "tidy-up" coding is necessar y where the array dimensions are not multiples of (in this case) 5 and 4. The tidy-up coding is omitted for clarity.

46

POWER4 Processor Introduction and Tuning Guide

Together with blocking, this technique provides the best perfor mance for matr ix-multiply ker nel. Matr ix factor ization str uctured to use the rank-n update, which is an operation identical to matr ix-multiply but which updates the target matr ix, is also optimized using this unrolling technique. See Section 6.1.2, "Perfor mance examples using ESSL" on page 115 for the perfor mance of ESSL DGEMM, which uses similar optimization techniques.

3.1.6 Cache and memory latency measurement
Most of the examples so far in this chapter have been in connection with str uctured data that can usually be accessed sequentially and for w hich data prefetch streaming gives excellent perfor mance even for ver y large amounts of data that do not fit into the cache. Some applications, however, access data in a much more random way and, for these applications, data streaming cannot be used. The key perfor mance factor for such an application is the latency, that is, the time before the computational units can make use of a data item. The latency is ver y different depending on which cache holds the data or whether it is in memor y. To study this, the following loop was used:
ip1=ia(1) do i=2,n ip2=ia(ip1) ip1=ip2 enddo

The data in the INTEGER*8 array ia was a random sequencing of the integers from 1 to N, subject to the constraint that follow ing the pointers as shown would traverse the whole array. This ensured that each iteration was dependent on the previous one and that data streaming could not operate. As usual, the loop was iterated many times so that, if the whole of the ia array fitted into a par ticular cache, it would be the latency of that cache that was being measured.

Chapter 3. POWE R4 system perfor mance and tuning

47

The results in Figure 3-4 have been nor malized to present the latency in ter ms of numbers of cycles. Since a 1.3 GHz pSer ies 690 HPC was used, the numbers should be divided by 1.3 to get latency in nanoseconds. In the graph, the numbers of bytes increase uniformly on a logar ithmic scale.

Figure 3-4 Latency in machine cycles to access N bytes of random data

The following conclusions can be draw n from this graph: Latency for the L1 cache is around 4-5 cycles. The figures increase shar ply when bytes exceed about 32000, the size of the L1 cache. Latency for the L2 cache is around 11-14 cycles but seems to increase to over 20 cycles as the cache becomes full at around 1500000 bytes. When data spills out of L2 cache, the combined L3 cache and memor y subsystem cause a fairly graceful increase in latency to a value of at least 340 cycles corresponding to memor y latency. It is difficult to discer n the L3 cache latency separately from these figures. W ith a large volume of random data, some will be in L3 and some will be in memor y and this blurs the effect. If the data had been str uctured non-randomly to ensure that data would not be in cache unless it would all fit, the L3 cache effect might have been clearer. However, the random distr ibution used is probably more realistic.

48

POWER4 Processor Introduction and Tuning Guide

3.1.7 Selected fundamental kernel performance within on-chip cache
Table 3-1 shows the measured perfor mance of a set of fundamental loops on a pSeries 690. These measurements ser ve as a reference for achievable perfor mance levels on the machine; both absolute perfor mance in cycles per iteration for the loop, and performance relative to a 375 MHz POWER3-II processor, are given. Since the POW ER4 processor has two levels of on-chip cache, results are shown for loops contained within each level: the L1 data cache and the L2 cache. The vector length, which is also the inner-loop limit, is shown for each set of data. An outer repetition loop has been used to obtain accurate timings. The inner loop is often unrolled by the compiler to minimize branch instructions, break floating-point instr uction dependencies, and to allow for more flexibility in scheduling instructions for maximum perfor mance. All of the loops were compiled using the -qarch=pw r4 and -O3 flags with the development version of XL For tran Version 7.1.1 available at the time of publication.
Table 3-1 Perfor mance of various fundamental loops ID Kernel L1 data cache contained results Vector length Cycles per iteration 1.7 1.7 1.7 1.7 0.9 1.3 18.1 15.1 6.5 2.0 Performance relative to POWER3 Model 270 2. 0 2. 8 3. 0 3. 0 2. 2 3. 0 2. 1 2. 2 1. 7 3. 1 L2 cache contained results Vector length Cycles per iteration 1.8 2.1 2.2 2.2 1.7 1.9 18.1 15.1 6.5 2.5 Performance relative to POWER3 Model 270 5.2 4.6 2.9 2.9 3.1 2.8 2.1 2.2 1.7 3.2

1 2 3 4 5 6 7 8 9 10

x(i)=s x(i)=y(i) x(i)=x(i)+s*y(i) x(i)=x(i)+y(i) s=s+x(i) s=s+x(i)*y(i) x(i)=sqr t(y(i)) x(i)=1.0/y(i) x(i)=a(i)+x(i-1) s=s+y(i)*a(ix(i))

2000 1000 1000 1000 2000 1000 1000 1000 1000 800

40000 20000 20000 20000 40000 20000 20000 20000 20000 16000

1. Loop 1 has only stfd (double-precision floating-point store) instr uctions in the inner loop. As discussed in Section 2.3.7, " Store instruction processing" on page 14, store data is placed in the SDQ and the data then proceeds to the proper SSQ and STQ until it is finally written into the L2 array. Store perfor mance is deter mined by the rate at which the STQ can be drained, and since there is an STQ per L2 cache controller, it depends on how stores are

Chapter 3. POWE R4 system perfor mance and tuning

49

distributed across the three L2 controllers. The loop measured is a straightforward stride 1 store patter n in which the compiler has unrolled the inner loop by eight and has roughly scheduled the stores within the loop from highest address to lowest (that is, in reverse order). The perfor mance of stores is relatively flat for vector lengths up through the size of the L2 cache because of the store-through design, which always sends updates through to the L2 cache. 2. Loop 2 is the copy loop, consisting of an lfd and stfd per iteration. The perfor mance of this loop is still deter mined by the store perfor mance. Cache lines corresponding to load instructions are always reloaded into the L1 data cache on an L1 data cache miss; cache lines to which stores are directed are not. 3. Loop 3, commonly known as DA loops, but adds an fmadd. Since with a multiple of the other vecto reloaded, and will then reside in stored-through to the L2 cache. XPY, is load/store bound like the first two the vector being stored has been updated r, the line being stored into must first be the L1 data cache. Still, the modified data is

4. Loop 4 is identical to DAXPY, but w ithout the multiply/add. Therefore it has the same execution perfor mance. Since the arithmetic instr uction is an fadd rather than an fmadd, the wor k done is half that of DAXPY. 5. Loop 5 is the sum reduction of a vector. The compiler unrolls the loop by eight producing eight par tial sums (which are accumulated in registers), and totals the par tial sums at the conclusion of the loop. This breaks the interdependence among the fadd operations, which would otherwise deter mine the perfor mance of the loop, and the resulting perfor mance is deter mined by the rate at which a single stream of floating-point loads can execute. 6. Loop 6 is commonly known as DDOT, or dot product. Just as in the case of loop 5, the sum reduction is split into eight par tial sums to remove the floating-point arithmetic interdependence. The resulting perfor mance is deter mined by the rate at which two streams of floating-point loads can be completed. 7. Loop 7 shows the average performance of floating-point double-precision square root. Floating-point square-root instructions may execute on either floating-point unit, but are not pipelined. Independent wor k can execute in the other floating-point unit concurrently, including another floating-point square-root instruction. Since both execution units can work in parallel, and a floating-point double-precision square root normally takes 36 cycles, the average time is approximately 18 cycles. 8. Loop 8 shows the average performance of floating-point double-precision divide. Floating-point divide instr uctions may execute on either floating-point

50

POWER4 Processor Introduction and Tuning Guide

unit but are not pipelined. Again, two divides execute in parallel, reducing the average time to 15 cycles. 9. Loop 9 exposes the six-cycle dependent operation latency in the floating-point execution unit. The loop represents a true mathematical recurrence: each operation requires the result from the previous operation. Thus, the execution time is limited by the effective pipeline depth of six. The perfor mance ratio relative to POWER3 is simply half the ratio of the processor frequencies, since the dependent operation latency in the POWER3 is three. 10. Loop 10 is an indirect DDOT in which one of the vectors is independently addressed using a vector of integer indices. This is the cr ux of the sparse-matr ix-vector multiply. Each iteration of the loop requires the index to be loaded (as an integer), and that value to be shifted so as to become a byte-oriented offset rather than doubleword index, and the shifted result is used to load the double-precision element of vector a. This is multiplied by the str ide 1 vector y and accumulated into the scalar s. The compiler breaks dependencies on the ar ithmetic by using eight par tial sums, just as in DDOT. The dependent chain of load-shift-(indirect) load is carefully scheduled to avoid stalls.

3.1.8 Other tuning considerations
In this section, the topics of tuning for L2 cache access and a discussion of the branch prediction mechanism are provided.

Tuning for L2 cache access
Blocking for an L2 cache is discussed in "Data cache blocking" on page 38.

Improving store performance to L2 cache
Store perfor mance can be improved across the three L2 controllers. The this, and will perfor m as much as 40 table for vector lengths greater than
nlim=(n/48)*48 do ii = 1,nlim,48 do i=ii,ii+15 x(i)=c0 x(i+16)=c0 x(i+32)=c0 enddo end do do i=nlim+1,n x(i)=c0 end do

with some extra effor t to distribute the stores following loop is a simple way to accomplish percent faster than the code given in the around 90.

Chapter 3. POWE R4 system perfor mance and tuning

51

3.2 Tuning non-floating point applications
In the following sections we discuss aspects of tuning that are relevant to non-numeric applications. However, you should bear in mind that many of the aspects discussed in Section 3.1, "Tuning for numerically intensive applications" on page 25 are also relevant. When tuning applications, you should deter mine w hether to tune for throughput or for response time, depending on the type of application. Different approaches may be required in either case and, rather than studying the subject in detail here, we suggest referring to some of the books written on the subject. Once the approach has been deter mined, we recommend the following steps: If the application is CPU bound, identify the cr itical par ts of the application code using profiling (Section 5.5, "Locating hot spots (profiling) " on page 110) and deter mine whether the critical code can be improved. If the application is paging, identify how much memor y is being used and what it is being used for. Consider using tools such as vmstat and svmon (refer to the AIX commands documentation). If the memor y is allocated by the application, it may be possible to adjust this using configuration files. If there is not enough system memor y, you could use vmtune (see Section 3.3, "System tuning" on page 54). If the application is disk or I/O bound, ident (iostat, svmon, filemon). You may need to for example use asynchronous I/O instead simply be able to move files from hot disks ify the hot disks or volumes change the way I/O is perfor med, of synchronous I/O or you may to disks that are less busy.

If the application is networ k bound, investigate this with tools such as netstat, netpmon, and nfsstat. Tune network parameters w ith the no command. If your application is still not perfor ming satisfactorily, star t again at the top. Chapter 4, "Optimizing with the compilers" on page 69 provides a number of suggestions for tuning code.

3.2.1 The load/store and integer units
Loads, stores, and integer operations for m the majority of non-floating point instructions executed. The load/store perfor mance is documented in Section 8.1, "Memor y to memor y copy" on page 155. Ultimately, the load/store perfor mance depends on the size of the units, that is bytes, 32-bit words or 64-bit words, and using larger units may positively affect perfor mance.

52

POWER4 Processor Introduction and Tuning Guide

There is a small penalty in load/store perfor mance when data items cross 32-byte and 64-byte boundar ies. W here possible, data structures should be organized so that they star t on double word or word boundaries. Note that integer divide instr uctions are relatively slow compared to other arithmetic instr uctions.

3.2.2 Memory configurations
pSeries 690 Model 681 systems suppor t four memor y controllers Physically, the memor y subsystem is implemented using memor y each book contains two memor y controllers, synchronous memor (SMIs) and DIMMs. Each controller can suppor t up to 16 DIMMs. description, see Chapter 2, "The POWER4 system" on page 5. per MCM. books where y interfaces For a detailed

Memor y is interleaved across controllers. Interleaving addresses is a function of the L3 cache controllers and the L3 cache to which the memor y controllers are attached and is implemented by the L3 cache controller on the POWER4 chip. Assuming an MCM has two equal-s ized memor y books attached to it, real memor y is interleaved across the four memor y controllers. Then physical memor y on the next MCM is allocated and so on. The operating system is responsible for mapping real memor y to vir tual memor y. If you have only one memor y book attached to an MCM, the L3 cache configures itself as a 64 MB shared L3 connected to one memor y book, plus a 64 MB shared L3 with no backing storage. Memor y is inter leaved across the two controllers on the book. The bus between the L3 and the book must then process twice the traffic compared to the same amount of memor y spread over two books, thus reducing memor y bandwidth. If an MCM has two books of different sizes installed, they operate independently with each book being two-way interleaved. In the current release of AIX 5L Version 5.1, pages are allocated to a process from any memor y book. In a future update to AIX 5L Version 5.1, pages will be allocated from memor y attached to the MCM where the process is running. This memor y affinity, combined w ith process affinity, will provide an improvement in application perfor mance for most classes of applications.

Chapter 3. POWE R4 system perfor mance and tuning

53

3.3 System tuning
In this section we discuss the aspects of the system that are relevant to application perfor mance running on the pSeries 690 Model 681. We begin with a review of the vir tual memor y architecture before examining large-page suppor t. We then examine some of the system tuning parameters that can have large effects on application perfor mance. Many of these are beyond the ability of the application programmer or user to modify directly because they require root authority to change, but it is useful to understand their possible effects.

3.3.1 POWER4 virtual memory architecture overview
This section provides an over view of the POWER4 vir tual memor y architecture for those readers who may be unfamiliar with it. The architecture is an extension of the POWER3 architecture. Unlike POW ER 3, it provides two page sizes. The default page size is 4 KB but the hardware and operating system provide a large page (16 MB), which can be advantageous in cer tain circumstances. See Section 3.3.2, "Small and large page sizes" on page 58.

Program structure
POWER programs access memor y through segment-based addresses. A segment-based address is calculated using a segment register (pointing to some storage) and a segment offset. In the 32-bit environment there are 16 segment registers and each can reference a segment of up to 256 MB. Some registers are reser ved to address ker nel memor y. Other registers can be used for several pur poses. By default, segment 2 holds process data (Figure 3-5 on page 55). This includes any constants and non-stack var iables and they are allocated from the bottom of the segment upwards. Stack space is allocated from the top of the segment dow nwards. Programmers can prevent the stack overwr iting non-stack data by limiting the size of the stack. This can be done by calling the linker (ld) w ith a -S option. The programmer can also use the shell ulimit command (ksh: ulimit, csh: limit) to limit the size of the stack and/or data area at r un time.

54

POWER4 Processor Introduction and Tuning Guide

15 14 13 12 11 10

Shared library data Kernel Shared libraries

Mem ory mapping and file mapping

4 3 2 1 0 Process data Program code Kernel

Figure 3-5 32 -bit environment segment register usage

Alternatively, you can call the linker specifying the -bmaxdata option. This has two effects: it specifies a maximum size for the user data area and it causes the user data to be placed in segment 3 (and subsequent segments as required up to a maximum of 8 segments for 32-bit programs) while the stack is placed in segment 2. 32-bit programs that need to access more than 256 MB of data can do so by using a contiguous set of segment registers. These programs need to be compiled with the -bmaxdata option. For example, using -bmaxdata:0x80000000 enables the maximum possible data space of 2 GB. In the 64-bit environment, there are effectively an unlimited number of segment registers. Note that in the 64-bit environment, -bmaxdata should not be used because it will limit the addressable data space. Each segment consists of a number of pages (of the same size). By default, pages are 4 KB. Program vir tual memor y is mapped onto physical memor y in units of pages. The operating system maintains a map (page table) of vir tual to physical memor y for each process. Entr ies in the map are called page table entr ies (PTEs).

Chapter 3. POWE R4 system perfor mance and tuning

55

A PTE provides infor mation about a corresponding page frame (which can be 4 KB or 16 MB in size). Pages of both sizes can co-exist on a system though a segment can only have pages of one size. Each PTE contains a number of status and protection bits as well as address infor mation.

AIX Version 4.3 and AIX 5L Version 5.1 executables
Note that 32-bit executables compiled under AIX Version 4.3 will run unchanged under AIX 5.1. Any code compiled in 64-bit mode under AIX Version 4.3 must be re-compiled before it can be used on AIX 5L Version 5.1. This means that: AIX Version 4.3 64-bit executables must be re-compiled from the source code to r un under AIX 5L Version 5.1, not just relinked. AIX Version 4.3 64-bit object modules or librar y files cannot be linked with AIX 5L Version 5.1 object modules or librar y files. The AIX Version 4.3 64-bit modules must be re-compiled. Note that you cannot link 32-bit together with 64-bit object modules under either operating system release.

Address translation
Figure 3-6 gives an over view of the steps in the address translation process.

Effective Address

Lookup in SLB

Virtual Address

Look up in Page Table

Real Address

Figure 3-6 POWER address translation

56

POWER4 Processor Introduction and Tuning Guide

An effective address (EA) is the address of data or an instr uction generated by the processor during the decode of an instr uction. The EA specifies a segment register and offset information within the segment. Address translation occurs in two steps: address (RA). If the EA cannot be trans there are a number of different reasons concer n themselves with those caused cases, the operating system will send a typically ter minate it. EA to vir tual address (VA) and VA to real lated, a storage exception occurs. While for exceptions, programmers need only by invalid data addresses. In these signal to the offending process and

Conversion of a 64-bit effective address to a corresponding vir tual address is perfor med by looking up the segment identifier (ESID) in the Segment Lookaside Buffer (SLB). The SLB is a cache of ESIDs and corresponding vir tual segment identifiers (VSIDs) maintained by the operating system and referenced by the hardware. Each SLB entr y also contains a valid bit and var ious flags. The 80-bit vir tual address is for med by concatenating the VSID with the page and byte address from the EA as shown in Figure 3-7.

64-bit Effective Address
36 28 -p p

ESID

Page

Byte

0 35

36

63-p

64-p

63

Segment Lookaside Buf fer (SLB)

ESID

VSID

Flags

VS ESID
52

Page
28 -p

Byte
p

Virtual Page Number (VPN)

Figure 3-7 Translation of 64-bit effective address to 80-bit vir tual address

Chapter 3. POWE R4 system perfor mance and tuning

57

Conversion of the 80-bit vir tual address to its corresponding real address is done by hardware lookup in the page table. The page table is maintained by the operating system and its base address is held in a hardware register. The vir tual page number (VSID + page number) is used to construct an index into the page table. The real address for the base of the page is extracted from the page table Entr y.

3.3.2 Small and large page sizes
Historically, the PowerPC Architecture suppor ted the mapping between vir tual and physical memor y at a granular ity of 4 KB pages. POW ER4 systems introduce a new PowerPC Architecture feature that provides an alter nate large-page size that can be used in addition to the 4 KB base-page size. The pSeries 690 Model 681 system suppor ts a 16 MB large-page size. The implementation involves the selective use of large vir tual/physical memor y pages to back the process private data segment(s). A process can contain a mixture of small (4 KB) and large (16 MB) pages at a 256 MB vir tual segment granular ity. All pages within a 256 MB segment have the same size. The pr imar y benefit from large-page suppor t is improved perfor mance for applications. This refers to applications that access a large amount of memor y in a sequential manner or have significant gather/scatter components (such as large, randomly accessed user data spaces). Large pages can improve perfor mance for these applications by reducing the translation lookaside buffer (TLB) miss rate. POW ER4 systems use memor y data prefetching (and other techniques) to minimize memor y latencies. Data prefetching star ts when a new page is accessed and grows more aggressive as the page continues to be sequentially accessed. However, data prefetching must be restar ted at page boundaries. The use of large pages can improve perfor mance by reducing the number of prefetch star tups. An update of AIX 5L Version 5.1, targeted for mid-2002, introduces a usage model that allows existing applications to use large pages without requiring source code changes and/or recompilations. The need for investment protection also dictates that the large-page data suppor t must not impact source or binar y compatibility for existing applications and the ker nel extensions they depend upon if large-page suppor t is not used by these applications. With the initial release of AIX 5L Version 5.1, which suppor ts the pSeries 690 Model 681, there is already a low-level shmat/shmget interface, which will also be enhanced with future releases.

58

POWER4 Processor Introduction and Tuning Guide

Note: At the time of writing this document the AIX implementation of large-page suppor t was still under development. The follow ing description is subject to change.

Large-page data areas
Large pages will be used for the data areas of the user address space. For technical applications, these areas consist of the user heap and main program BSS and data storage areas. These are the cr itical data areas for C programs, since the user heap suppor ts malloc storage, BSS holds uninitialized program data, and data storage holds both initialized and (small) uninitialized data. These are also the cr itical data areas for For tran programs because the For tran storage classes that require large pages reside within these areas, as follows: Static Common Controlled Static var iables reside in the data storage area. Large, uninitialized static variables reside in BSS. If a common block var iable is initialized, the whole block resides in the data storage area; otherwise, the whole block resides in BSS. This storage class is used for allocatable arrays. Controlled var iables reside in the user heap.

Large pages are not required for other areas of the user address space, so 4 KB pages are used. These consist of the process stack, librar y data storage area, mmap regions, and user text. At the cost of 16 MB of physical memor y resource per page, large pages would provide little or no benefit to applications if used for the process stack or librar y data because both typically represent small quantities of data. For a single-threaded process automatic and controlled automatic For tran storage classes reside within the process stack and therefore do not use large pages. Typically, the amount of data is small. Use of large pages for thread stacks w ithin a multi-thread process is of value and can provide benefit through larger TLB coverage. The use of large pages for thread stacks is suppor ted through the AIX pthreads librar y, which places thread stacks within the user heap. It is not planned to suppor t large pages for mmap regions. The suppor t of large pages for user text data is not relevant for technical applications.

Chapter 3. POWE R4 system perfor mance and tuning

59

Some technical and commercial (for example database) applications do map and use shared memor y segments within their user address spaces and can benefit from large pages. This is suppor ted by the implementation of a large-page shmat/shmget interface. An application must be modified in order to use large pages for shared memor y segments. This is because a special option must be specified at the time a shared memor y segment is created if large pages are to be used for the segment. In a future release of AIX 5L large pages will be allocated preferentially from physical memor y that is close to the processor (or MCM) that initiated the request. This memor y affinity is intended to hide the non-unifor mity in latency and bandw idth (pr imarily the latter) of the memor y subsystem. Large pages are pinned (cannot be paged out or stolen) to memor y the entire time an application executes. The large-page memor y pool is a limited system resource. A failure will occur when a large-page application tries to allocate a large page and none are available.

Large page application support
Although the 64-bit ker nel is the strategic AIX ker nel, large-page data suppor t is also provided in the 32-bit ker nel. A new bit flag will be maintained within the XCOFF and XCOFF64 executable file headers to record the large-page data attribute of a program. If the flag is set, this indicates that the program uses large pages; otherwise, it only uses small pages. It is deemed better to fail a technical application that requests an additional large page when none is available than have it silently execute with 4 KB pages. The ldedit command will provide the ability to set and unset the large page flag of an executable file without the need for source code changes, recompiling, or relinking. It will also suppor t setting maxdata and maxstack. The dump command will be modified to display the status of a program's large page data flag. The large page data usage is inher ited over for k(). Check the manual pages for details on the memor y duplication scheme. Large page data usage is not inher ited over exec(). Because of different page protection requirements, the data model for 32-bit large page data applications is slightly different from the existing 32-bit process models (default and large memor y) shown in Figure 3-5 on page 55. You can inspect a program's memor y layout with the help of the svmon command.

60

POWER4 Processor Introduction and Tuning Guide

Large-page command support
An extension of the vmtune command will be provided to select at boot time the number of memor y segments (256 MB) that will hold large pages. A system administrator has the ability to control usage of the large-page memor y pool by user ID (such as with the chuser and mkuser commands). This prevents unpr ivileged users causing pr ivileged large-page data applications to fail due to running out of large pages. At this time, there is no Workload Manager (WLM) suppor t provided to manage large-page physical memor y or large-page applications. Large pages are neither pageable nor swappable. They are essentially pinned pages that are treated as unmanaged resources by the WLM. The commands ps, vmstat, and svmon have been extended to repor t on large-page usage.

Large-page performance observations
General large-page suppor t for For tran and C application is not available in AIX 5L Version 5.1. It is expected that when large-page suppor t is available, uniprocessor perfor mance for memor y-bound ker nels such as DAXPY will increase significantly. This is primarily due to the increased efficiency of data prefetching long vectors in large pages (see Section 2.3.8, "Fixed-point execution pipeline" on page 15).

3.3.3 AIX system parameters
System factors that can influence application perfor mance include: Hardware configuration CPU configuration The speed, number of CPUs and the par ticular type of pSer ies 690 Model 681 processor module installed are, of course, extremely impor tant to application performance. In the POWER4 or POW ER 4 Turbo modules, the L2 caches are each shared between two CPUs. In the POW ER4 HPC modules, the L2 caches are not shared. However, these are factors that cannot be adjusted or tuned for a given machine configuration, but are hardware characteristics of which the programmer should be aware.

Chapter 3. POWE R4 system perfor mance and tuning

61

Memor y configuration Similar ly, the memor y configuration is also impor tant. As descr ibed in Section 3.2.2, "Memor y configurations" on page 53, the par ticular memor y configuration of a par ticular machine deter mines the overall memor y bandwidth available to the CPUs. This is not a factor that can be adjusted for a given machine configuration, but the programmer should be aware of the possible effects. Storage configuration A detailed discussion of possible storage configurations that can be attached to the pSeries 690 Model 681 is outside the scope of this book. However, for I/O dependent applications, the underlying storage configuration can have a ver y significant effect on application perfor mance. These can include General Parallel Filesystem (GPFS) or AIX Jour naled Filesystem (JFS) configurations spread across multiple physical disks, which could include SSA, various types of SCSI disk, or fiber-attached disks. For this reason, awareness of the target storage configuration and the available memor y configuration may favor programming choices that trade memor y use for I/O. Software configuration Paging space configuration In the scientific and technical computing domain, it is common for the application mix on a machine to be selected and controlled so as to avoid paging. With the advent of AIX Version 4.3.2 the paging allocation algor ithm only allocates space in paging space when it is necessar y to free up a page in memor y. This means that for a system that is under no pressure for real memor y pages, the paging space utilization will be ver y small. The lsps -a command might show one percent utilization. For this reason, and in order to save disk space, it is becoming common to configure paging space that is somew hat smaller than real memor y. Large memor y systems may be r unning applications that consume large amounts of memor y. If so, it is impor tant to consider the effect if multiples of these jobs are ever star ted such that memor y becomes overcommitted, possibly exhausting paging space. A control mechanism, for example a job scheduling system such as LoadLeveler or AIX Wor kload Manager, should be considered. Alter natively, the paging space should be made large enough to accommodate such an event.

62

POWER4 Processor Introduction and Tuning Guide

32- or 64-bit ker nel In general, if the main application or applications are 64-bit applications, then it is slightly better to use the 64-bit AIX ker nel. For 32-bit applications then it is slightly better to use the 32-bit ker nel. However, the overhead of running 64-bit applications on the 32-bit ker nel (remapping system calls to 32-bit calls, and reshaping the data structures for these calls) is handled in the ker nel and is small. The overhead of running 32-bit applications on the 64-bit ker nel (reshaping data str uctures in system calls) is likewise small. Note that 64-bit applications from AIX Version 4.3.3 must be recompiled to run under AIX 5L Version 5.1, whether they will be r un on the 64-bit ker nel or not. Ker nel parameters Cer tain ker nel tuning parameters can have a large effect on the perfor mance of cer tain applications, depending on the application's use of memor y and files. The vmtune command (provided in the AIX fileset bos.adt.samples) provides a number of parameters that can be adjusted to suit par ticular applications: · Page replacement selection of file or application pages As demand for memor y increases, the AIX Vir tual Memor y Manager (VMM) must occasionally reassign pages in use by programs to maintain a minimum number of free pages. The vmtune parameters minper m and maxper m set thresholds that deter mine the pool from which the AIX VMM page replacement algor ithm will select pages to be reassigned. For the pur poses of this discussion, allocated memor y pages can be considered to be one of two types. File pages are pages containing data from files mapped into memor y by AIX. Computational pages are pages allocated to r unning programs. When the percentage of real memor y occupied by file pages falls below the minper m value, the page replacement algorithm steals both computational and file pages. When the percentage of real memor y occupied by file pages is greater than the maxper m value, the page replacement algor ithm steals only file pages.

Chapter 3. POWE R4 system perfor mance and tuning

63

When the percentage of real memor y occupied by file pages is between the minper m and maxper m values, the page replacement algor ithm can steal pages from both computational and file pages. It will nor mally steal only file pages, unless the repage rate for file pages is higher than that for computational pages. If so, it w ill steal both types of pages. The default settings for these parameters are approximately minper m=20, maxperm=80. That is 20 percent and 80 percent of real memor y. Consider, for example, a program that uses a lot of memor y for computation, but that also wr ites out large files sequentially. As the program wr ites out to the file, more and more file pages will be created in memor y. Once the number of file pages reaches 80 percent of real memor y, an application's computational pages will be largely protected from being stolen by the page replacement algor ithm. Below this level, an application may find that its computational pages are being stolen to make way for file pages. If the working set size of the program is larger than 20 percent of real memor y (100 - maxper m), then its perfor mance may well suffer as its computational pages are stolen to make way for file pages. The larger the program, the greater this effect. Therefore, such an application could well benefit from setting minper m and maxper m lower than their default values, and in the case of maxper m possibly much lower. There is a third parameter, related to minper m and maxper m: str ict_maxper m. This makes the maxper m setting a hard limit rather than a threshold. For example, to set the threshold below which VMM page replacement will steal computational pages to 5 percent and the threshold above which it will steal only file pages to 20 percent, the following command would be used:
/usr/samples/kernel/vmtune -p5 -P20

To set the maxperm threshold as a hard limit using strict_maxper m:
/usr/samples/kernel/vmtune -h 1

64

POWER4 Processor Introduction and Tuning Guide

·

Memor y page replacement parameters The minfree, maxfree, mempools, and lr ubuckets parameters may need to be adjusted together to reduce memor y scanning overhead in a busy system and to maintain a large enough free list to readily satisfy demands from programs allocating memor y. Recommendations for tuning these parameters can be found in the AIX Perfor mance Management Guide (product manual, available on the Web) and the AIX 5L Performance Tools H andbook, SG24-6039. The maxfree parameter should be at least maxpgahead (see the maxpgahead ker nel parameter descr iption that follows in this section) greater than minfree. It is wor th experimenting with larger values for minfree and maxfree than the defaults w hen tr ying to smooth out peaks and troughs of mixed workloads. Page replacement in AIX is perfor med by the lrud daemon. From AIX 4.3.3 onwards, this daemon is multi-threaded, and the system memor y is divided into a number of pools. The number of pools is specified by the mempools parameter. On a large memor y, SMP system, this allows memor y scanning to be perfor med more efficiently than with one large memor y pool. Each memor y pool can be fur ther subdivided into a number of sections called buckets. The size of these buckets is specified by the lrubuckets parameter. These buckets are scanned individually by the lrud using the VMM page replacement algorithm. This involves a two pass process where unreferenced pages are mar ked in the first pass, and, if a free page is not found a second pass is made and pages still mar ked as unreferenced will be replaced. On a large, busy system, with a single bucket across all of memor y, this two pass memor y scan would be too great an overhead. The subdivision of memor y into buckets reduces this overhead.

·

I/O pacing with min_pout and max_pout These parameters are of impor tance in improving the perfor mance of both single, large applications perfor ming sequential I/O, and multiple jobs that perfor m I/O. min_pout and max_pout are system attributes that control I/O pacing. That is, max_pout sets a maximum threshold for pending I/O requests per file. Above this level, an application generating large numbers of I/O requests will be put into a sleep state until the number of pending I/O requests falls to or below the min_pout value. The default settings are zero for both values, which means no checking, but this can allow a

Chapter 3. POWE R4 system perfor mance and tuning

65

high I/O volume application to saturate the system's capabilities and seriously affect the perfor mance of other applications. However, enabling checking with too low values for these parameters can reduce the perfor mance of such a high I/O volume application. Therefore, where multiple applications must share a system, one approach to setting I/O pacing would be to set these values high (several thousand), and measure the effect on all types of wor kload on the system. The aim should be to get these values as high as possible for maximum throughput of the high I/O volume of the application while not reducing the perfor mance of other workloads. These parameters can be set with the chdev -l sys0 command, using the appropriate attr ibute. · Read ahead with minpgahead and maxpgahead These values control the amount by which the VMM will schedule pages in advance of the current page when reading files sequentially. When sequential access is detected, the read-ahead mechanism brings in two pages, and at each confir mation the number of pages read ahead is doubled up to maxpgahead. For applications that perfor m large amounts of ser ial I/O, it may be advantageous to set a relatively large maxpgahead value (the default value is 8). However, the underlying I/O subsystem should be taken into account when selecting this value. For example, if the file is stored on file systems str iped across multiple devices then a higher value may be appropr iate than if it is stored on a single disk device. · max_coalesce This parameter is an attribute of logical disk dr ives that sets the maximum number of bytes to be transferred to the disk by the device driver in a single operation. W hen using SSA RAID arrays for sequential I/O, this value should be set to the number of disks across which the data is striped multiplied by 64 KB. · Sequential and random write-behind Files mapped into memor y are par titioned into 16 KB clusters (four pages when using small pages). W hen writing sequentially, all four pages in a cluster will be modified one after another. The parameter numclust specifies the number of such clusters before the current cluster, which the VMM will allow before scheduling the wr iting of their modified pages. By default this is set to 1, which means that modified pages from sequential files should not accumulate in memor y. For randomly written files, this mechanism does not apply. There is another parameter, maxrandwr t, which sets a maximum number of modified (also known as dir ty) pages for a given file. Once this number is

66

POWER4 Processor Introduction and Tuning Guide

exceeded, then the VMM will schedule these pages for wr iting. It should be noted that when the syncd daemon runs, these modified pages will be written to disk anyway, but these parameters can prevent the buildup of modified pages between r uns of syncd to the extent that syncd r unning affects system perfor mance. These parameters can be modified using the vmtune command. · lgpg_regions, lgpg_size As discussed in Section 3.3.2, "Small and large page sizes" on page 58, the use of large pages for vir tual memor y has the potential to significantly improve the perfor mance of cer tain applications. In order for an application to use large pages, there must have been large pages defined to the system at system IPL. The lgpg_regions parameter specifies the number of large pages to be made available at the next reboot. The lgpg_size parameter specifies the size of these pages, and for the IBM ^ pSeries 690 Model 681 POWER4 machines this would be 16 MB specified in bytes. An example of a sequence of commands and actions to define 8 GB of large pages and make them available might be as follows:
vmtune -g 16777216 -L 512 bosboot -a shutdown -Fr

The exact usage of the bosboot command would depend on the par ticular system being configured for large pages.

3.3.4 Minimizing variation in job performance
Various factors can affect the consistency of the perfor mance of a job from r un to run. These include: System factors Competing jobs Multiple jobs running in the system simultaneously can compete for resources such as CPU, memor y, and I/O bandwidth. One approach to reducing the var iability introduced by running multiple jobs on the system is to carefully select jobs that r equire different types of resources to be r un together. Once jobs have been characterized in this way, the running of the job mix can be controlled using a job scheduling system such as LoadLeveler. In practice, many large applications have requirements for all the above types of resource. Another approach is to use the AIX Workload Manager to guarantee resources to a par ticular job, and perhaps sharing the remaining resource between other jobs in the system. For examples of the effects of multiple jobs r unning on the system, see Chapter 8, "Application performance and throughput" on page 153.

Chapter 3. POWE R4 system perfor mance and tuning

67

For more infor mation on AIX Wor kload Manager, see AIX 5L System Management Concepts: Operating System and Devices, and AIX 5L System Management Guide: Operating System and Devices. For more infor mation on LoadLeveler, see Using and Administering IBM LoadLeveler for AIX, SA22-7311. Exter nal I/O perfor mance If a job is dependent on the perfor mance of shared storage facilities that are heavily utilized at cer tain times, then it may exper ience variation in runtime. In this situation, it may be possible to trade memor y use for I/O to reduce the dependency on the exter nal I/O perfor mance. System software levels Occasionally, different system software levels will implement different default values for cer tain tuning parameters. This can cause unexpected variation in job perfor mance. Software updates and fixes should therefore be checked and tested carefully for such changes. Application factors Processor binding In order to guarantee the shar ing of L2 cache between cer tain threads, or to guarantee that threads are using dedicated L2 cache, you may have bound threads or processes to specific processors. D epending on the thread scheduling scope (see Section 7.1.1, "SMP r untime behavior" on page 126), and the numbers of threads and processors, this could introduce variation in r untime behavior. Variation in data The previous factors apply to var iations in runtimes for the same job running with the same data. It is also the case that variations in the data input to the job can cause var iability, even though the problem to be solved is the same size, and the program may take longer to converge to a solution. W ith parallel jobs variations in input data may lead to hotspots where cer tain processors have more work to do than others, leading to an overall increase in the r untime of the job. An approach to resolving this, which is outside the scope of this book, is to implement a dynamic load balancing design in the parallelization of the program.

68

POWER4 Processor Introduction and Tuning Guide

4

Chapter 4 .

Optimizing with the compilers
In this chapter we descr ibe the features of the XL For tran and C and C++ compilers that relate to optimization for the POWER4 processors. We begin with the optimization options available with par ticular emphasis on those that benefit applications r unning on POWER4 processors. In subsequent sections the par ticular techniques and considerations for improving perfor mance using the compiler are discussed.

4.1 POWER4-specific compiler options
In this section some useful XL For tran compiler options that can be used to improve perfor mance are presented. We then focus on options with specific benefits on POWER4 microarchitecture machines. Finally, we make some recommendations for initial attempts at optimization. It should be noted that, when specifying conflicting compilation options on the command line, the last option wins. For example, consider the following command:
xlf -O3 -qsource -qlist -o monte -O2 monte.f

© Copyr ight IBM Cor p. 2001

69

The optimization flag -O is s optimization level two would applies to those options that -O4 and -O5 in the following

pecified twice with two different levels. In this case, be used because this is specified last. This also are implied by another option. See the descr iption of section.

4.1.1 General performance options
In the following sections, useful For tran, C and C++ compiler general perfor mance options are discussed.

XL Fortran options
The following options are provided by the XL For tran compiler. For more details see the XL For tran for AIX User's Guide, SC09-2866. -O, -O2, -O3, -O4, -O5 The -O flag is the main compiler optimization flag, and can be specified with several levels of optimization. -O and -O2 are currently equivalent. At -O2, the XL For tran compiler's optimization is highly reliable and usually improves perfor mance, often quite dramatically. -O2 avoids optimization techniques that could alter program semantics. -O3 provides an increased level of optimization. It can result in the reordering of associative floating-point operations or operations that may cause r untime exceptions. This could slightly alter program semantics. This can be prevented through the use of the -qstr ict option together with -O3. At this optimization level, the compiler can also replace divides with reciprocal multiplies. -O3 is often used together with -qhot, -qarch, and -qtune. -O4 provides more aggressive optimization and implies the following options: -qhot -qipa -O3 -qarch=auto -qtune=auto -qcache=auto -O5 implies the same optimizations as -O4 with the addition of -qipa=level=2. In general, increasing levels of optimization require more time (sometimes considerably more time), and larger memor y during the compilation. In addition, -O4 and -O5 sometimes need additional space in /tmp (or the location specified by the TMPDIR environment). The recommendation is to have at least 200 MB available, and potentially up to 400 MB.

70

POWER4 Processor Introduction and Tuning Guide

-qarch, -qtune, -qcache These options allow the compiler to take advantage of par ticular hardware configurations for the pur poses of optimization. -qarch specifies the instr uction set architecture of the machine, that is which instr uctions the compiler will generate. Specifying cer tain values for this option can generate code that will not r un on other machine types. For example, -qarch=pwr2 would generate code that might not run on a POWER4 machine. The -qarch=com option generates executable code that will r un on any POWER or PowerPC hardware platform. However, this option also prevents the compiler from generating any of the optional PowerPC architecture instructions. In the case of POW ER4, these instructions include the two floating-point square root instructions: fsqr t and fsqr ts, which are likely to be impor tant in numerically intensive applications. -qtune instr ucts the compiler to perfor m optimizations for the specified processor. These can include taking into account instruction scheduling and memor y hierarchy for the specified architecture. This option only has an effect when used with an optimization level of -O (or -O2) or greater. The -qcache option is only effective if the -qhot option is also specified explicitly or implicitly with, for example, -O4. This option can be machine. This can hierarchy from the cache hierarchy of configurations with capacity and to be L3. used to specify the exact cache hierarchy of the be useful if the target machine has a different cache default. -qcache is designed to descr ibe the complete the system including the TLB. If specifying cache -qcache, then the specifications should be ordered by ver y precise should include the ERAT, TLB, L1, L2, and

At present, the compiler uses the line size of the cache for optimization, but a future level of the compiler may use capacity and miss cost more aggressively. At that time, if compiling for machines w ith different cache hierarchies, then the most conser vative specification would be the larger line size, the smaller capacity, the smaller associativity level, and the larger cost. If the program will be compiled and r un on the same machine, then -qarch=auto should be used or -qarch should be set to the specific processor. The default setting is -qarch=com in 32-bit mode, and -qarch=ppc in 64-bit mode. The compiler will then automatically select default settings for -qtune and -qcache appropr iate to the processor architecture selected. If the compilation machine is different from the target machine, then it can be useful to specify the target architecture for -qarch and -qtune.

Chapter 4. Optimizing with the compilers

71

For example, if compiling and running on a POW ER 4 pSer ies 690 Model 681, then use -qarch=auto -qtune=auto, or -qarch=pwr4 -qtune=pwr4. If compiling on this machine but executing on an RS/6000 SP 375 MHz POWER3 High Node, then -qarch=pw r3 -qtune=pw r3 should be used. Different combinations of these two options can be used to specify the machines on which the executable will run, but produce code that is optimized for one of the target machine types. -qhot The -qhot option perfor ms high-order transfor mations to maximize the efficiency of loops and array language. It can optionally pad arrays for more efficient cache usage and can generate calls to vector intrinsic functions such as square root and reciprocal. As with -O3, some of the transfor mations can slightly alter program semantics, but this can be avoided by also using -qstr ict. The -qhot option is made less effective by the -C array bounds checking option, but remains active. Note that -qhot is selected by default when -O4, -O5, or -qsmp=auto options are specified. -qalias The -qalias option can be used to tell the compiler about the types of aliasing that may be found in the program w here an area of storage may be referred to by more than one name. The compiler may be able to perfor m additional optimization with this infor mation, for example for programs that violate parameter aliasing r ules (see the discussion of -qalias in the XL For tran User's Guide, SC09-2866). Compiling with -O2 -qalias=nostd may give better perfor mance than using no optimization at all. -qalign The -qalign option specifies the alignment of data objects in storage. There are two suboptions: -qalign=4k, which causes cer tain objects over 4 KB to be aligned on 4 KB boundaries and can be useful for optimizing I/O when using data striping. -qalign=str uct, which can specify the alignment of derived type objects such as str uctures. -qasser t The -qasser t option can be used to give the compiler infor mation about loop dependencies and iteration counts, which may allow additional optimizations.

72

POWER4 Processor Introduction and Tuning Guide

-qcompact The -qcompact option reduces optimizations that increase the size of the executable. For current systems with large memor ies this is not commonly used. However, it can be useful in the rare cases where -O3 generates code that perfor ms worse than code generated with -O2. In these cases, -O3 -qcompact is often better. -qpdf, -qfdpr The -qpdf option enables profile-directed feedback. This is a two-step process where profile infor mation from a typical r un or set of runs is used for fur ther optimization. -qfdpr generates object files containing the necessar y infor mation for use with the AIX Feedback Directed Program Restructur ing (FDPR) command. -qipa The -qipa option can improve basic optimization by doing analysis across procedures. This must be specified at both compile and link stages, and there are var ious suboptions to give the compiler more infor mation about the character istics of procedures within the program, and how to handle references to procedures that have not been compiled with -qipa. -qsmp The -qsmp option is used for shared memor y parallelization of cer tain loops within a program. It is possible to make the compiler use the minimum optimization necessar y to achieve parallelization by using -qsmp=noopt. Shared memor y parallelization is covered in more detail in Section 7.1, "Shared memor y parallelization" on page 126. -qstr ict, -qstrict_induction These options prevent the compiler (options -O3, -qhot and -qipa) from perfor ming optimizations that could alter the semantics of the program and potentially producing results that differ from unoptimized code. -qstr ict_induction applies to such optimizations on loop counter var iables. Both of these options can result in reduced perfor mance. -qunroll The -qunroll option allows the compiler to unroll loops within a compilation unit. By default, with optimization level 2 (-O, or -O2) the compiler perfor ms loop unrolling if analysis indicates that it will be beneficial. If such unrolling actually reduces perfor mance for a procedure, then -qnounroll could be used to tur n it off for a par ticular procedure. Loops where it is beneficial to unroll within this procedure could then be marked with the UNROLL compiler directive. See Section 4.2, " XL For tran compiler directives for tuning" on page 80

Chapter 4. Optimizing with the compilers

73

-Q The -Q option allows the compiler to inline functions and procedures. That is, the compiler can move the code from the inlined program unit into the code of the calling unit and potentially achieve fur ther optimizations by doing so. This option can also take the names of functions or procedures to be inlined, or those to be excluded from inlining. -qlibansi, -qlibessl, -qlibposix These options specify that any references to functions that have the same name as a librar y function are references to that function. -qlibansi -qlibessl -qlibposix ANSI C librar y. ESSL librar y. See Section 6.1, "The ESSL and Parallel ESSL librar ies" on page 114 POSIX 1003.1 librar y.

-qnozerosize The -qnozerosize option tells the compiler that there are no zero-sized objects in the program that can improve perfor mance in some programs by removing the need to check for them. -g The -g option is not a perfor mance flag. It generates symbol and line number infor mation in the object files that can be used for debugging. However, it is impor tant to note that compiling with -g has almost no effect on perfor mance. It does not prevent optimizations perfor med by the compiler. -p, -pg These options are used to generate monitor ing infor mation when producing runtime profiles of a program. See Section 5.5, "Locating hot spots (profiling)" on page 110 for more details.

Visual Age C and C++ options
With the following exceptions, all the options mentioned previously are also valid when used with the C and C++ compilers: -qhot -qnozerosize -qlibessl, -qlibposix -qsmp is suppor ted by the C compiler, but is not suppor ted by the C++ compiler. However, C++ can declare (as exter n "C") and call C functions that are coded with shared memor y parallelism through OpenMP pragmas.

74

POWER4 Processor Introduction and Tuning Guide

In addition, the following perfor mance-related options exist for these compilers: -qalias=ansi The -qalias=ansi option specifies the use of type-based aliasing during optimization. This is synonymous with the obsolete -qansialias, and allows the compiler to make assumptions about the types of objects accessed via pointers. -qfold The -qfold option evaluates constant floating-point expressions at compile time. -qinline The -qinline option is equivalent to the -Q option described in the For tran options. -qunroll=n The -qunroll option accepts a value n, where n is the depth to which the compiler should unroll inner loops. The default value of n is four, and the maximum value is eight. This option takes effect when an optimization level of -O2 or higher is specified.

4.1.2 Options for POWER4
This section descr ibes specific optimization actions perfor med by the compiler for POWER4 microarchitecture machines. Compiler options that perfor m specific optimizations for POWER4 microarchitecture machines are as follows: -qarch=pwr4 -qtune=pwr4 -qcache=auto or
-qcache=level=1:type=i:size=64:line=128:assoc=0:cost=13 \ -qcache=level=1:type=d:size=32:line=128:assoc=2:cost=11 \ -qcache=level=2:type=c:size=1440:line=128:assoc=8:cost=125

Note that the cost value for the L2 cache miss above is der ived from an average for data misses across the various L3 caches.

Chapter 4. Optimizing with the compilers

75

4.1.3 Using XL Fortran vector-intrinsic functions
The compiler is capable of generating calls to specially optimized vector versions of intrinsic functions. These are included in the libxlopt.a librar y included with XL For tran, and the standard linkage sequences for the var ious invocations of the For tran compiler (for example xlf, xlf90, xlf90_r) include this librar y. Calls to intrinsic functions can often make up a significant percentage of the CPU usage profile. For example, one weather modelling program spends 22 percent of its time in the intr insic functions. These calls may be generated using the -qhot compiler option and will be satisfied from the libxlopt.a librar y. Cer tain other options will prevent the generation of these calls: -qhot=novector, or -qstr ict. For example, the following code outline could generate vector-intrinsic function calls when compiled with -qhot:
do i=1,n c(i)=cos(a(i)) . . end do

Vector versions of the following functions exist with examples of the calls provided in the following list: Cosine
cos(a(i))

Division (not strictly speaking a function, but a vector division function exists in libxlopt.a)
a(i)/b(i)

At present, although this function exists, the compiler does not generate this function call, but uses a combination of a vrec or a vsrec function call and multiply instructions. Exponential
exp(a(i))

Natural logarithm
log(a(i))

Reciprocal
1.0/a(i)

Reciprocal square root
1.0/sqrt(a(i))

76

POWER4 Processor Introduction and Tuning Guide

Sine
sin(a(i))

Square root
sqrt(a(i))

Tangent
tan(a(i))

There are two versions of each function, a double-precision version and a single-precision version. The compiler will generate the appropr iate call. These functions are der ived from the MASS librar y functions (see Section 6.2, "The MASS libraries" on page 117 for more infor mation on the MASS librar y). Since the XL For tran release schedule is separate from the freely available MASS librar y, the libxlopt.a versions may lag behind the MASS librar y versions. This means that any improvements in the perfor mance of these routines, for example by algor ithm changes that take advantage of the POWER4 architecture, are likely to be available in the MASS librar y first. Also note that the use of these functions is subject to the same considerations as the use of the MASS librar y functions. Examples of the speedups that can be seen with the vector-intr insic functions are shown in Table 4-1.
Table 4-1 Vector-intrinsic function speedups Function cos div exp log reciprocal rsqr t sin sqr t tan Speedup (double precision) 3.94 not generated 4.60 5.80 1.10 2.26 4.03 1.09 4.27 Speedup (single precision) 3.90 not generated 4.55 5.74 2.17 6.23 3.85 2.17 3.79

Chapter 4. Optimizing with the compilers

77

The double-precision numbers were generated with the following program:
program cos_test integer m,n parameter ( n = 1000 ) parameter ( m = 10000 ) real*8 a(n) real*8 b(n) real*8 fns real*8 time1,time2,rtc real*8 ctime integer i,j ctime=0.0d0 call random_seed call random_number(a) call random_number(b) do i=1,m time1=rtc() do j=1,n b(j)=cos(a(j)) end do time2=rtc() ctime=ctime+(time2-time1) call dummy(b,a,n) end do fns=float(m*n) write(6,998)ctime,fns/(ctime*1.0e6) FORMAT('Cosine: intrinsic time (s) = ',F6.2, ' cos/s = ',F8.2) stop end

998 &

For the other vector-intrinsic functions, the call to cos_test in the preceding example was replaced with the appropriate function or operation. The dummy subroutine does nothing, but calling it prevents the compiler from optimizing away the loop.

78

POWER4 Processor Introduction and Tuning Guide

4.1.4 Recommended options
The recommended star ting compiler options for the POWER4 microarchitecture are:
-O3 -qarch=pwr4 -qtune=pwr4

If the executable must execute on POWER3 and POWER4 machines, but the perfor mance on the POW ER4 machine is most impor tant, then -qarch=pwr3 and -qtune=pwr4 should be specified instead. The -qhot option may give significant perfor mance benefits, at the cost of additional compile time. It is also essential for cer tain other optimization flags to take effect, including -qcache. If the program makes extensive use of the intrinsic functions listed in Section 4.1.3, "Using XL For tran vector-intr insic functions" on page 76 programmer does not wish to modify the code to use the MASS librar y then there may be considerable benefit from using the vector versions libxlopt. The -qhot option should then be used. In this case, the benefit vector-intrinsic functions can be determined by compar ing the effect of with -qhot and compiling with -qhot=novector.

and the functions, from from the compiling

Note: As described in Section 4.1.1, "General perfor mance options" on page 70, -O4 implies -qhot.

4.1.5 Comparing C and Fortran compiler code generation
This section compares C and For tran compiler code generation.

Numeric intensive code
The IBM For tran and C compilers share common technology and, in par ticular, a common back-end optimizer. To investigate potential var iations, we examined the code generated by the C and For tran compilers for simple loops (DDOT and DAXPY). The C code was wr itten using arrays instead of pointers for similarity with For tran. Using only the -O2 compiler option, the For tran compiler generates unrolled loops w hile the C compiler does not. The For tran examples r un faster than the C examples. Using the recommended compiler options (-O3 -qarch=pwr4 -qtune=pwr4), the code generated by the compilers was essentially the same. The execution times for the C and For tran loops were within 1 percent var iation. This is as expected since the back-end optimizer is common to both compilers.

Chapter 4. Optimizing with the compilers

79

The essential code is shown below:
ddot.f do j=1,iterations c1=0.0d0 do i=1,array_size c1 = c1 + x(i) * y(i) end do call dummy(x,c1) end do ddot.c for (i=0;i
The example for DAXPY is similar to the code sample above:
x(i) = x(i) +c1 * y(i) x[j] += c1*y[i];

Non-numeric intensive code
We made a br ief investigation of the impact of compiler options on non-numer ic C code by compiling applications with -O3 -qarch=com and -O3 -qarch=pwr4 -qtune=pwr4 and compared execution time. In one case, we compared the perfor mance of the UNIX utility nroff and in another we compared a string manipulation script wr itten in Perl (where the Perl compiler was compiled with the different options). In both the nroff and Per l cases, there was no difference in execution time. We compiled the FASTA program (see Section 8.6, "FASTA genetic sequencing program" on page 168) with both -qarch=com and -qarch=pwr3 -qtune=pwr3 and re-ran the ar p_arath test. Execution times of the com, pwr3, and pwr4 versions were within 1 percent var iation. Since this program perfor ms large amounts of I/O, we consider these execution times equivalent. Note that -qarch=com will ensure that the compiler uses only standard PowerPC instructions. Optional instructions such as the hardware floating-point square root and non-PowerPC instructions such as the POWER2 loadquad instruction will not be used. This can have a significant impact on perfor mance.

4.2 XL Fortran compiler directives for tuning
A number of XL For tran compiler directives exist that the programmer can use to improve perfor mance without extensive modification of source code. These directives can be activated at compile time by specifying the -qdirective option with the trigger expression that has been used. These directives are discussed in the following sections.

80

POWER4 Processor Introduction and Tuning Guide

4.2.1 Prefetch directives
Prefetch directives are directives that generate specific machine instructions for accessing memor y locations. They can be used to influence the hardware prefetch mechanism so that, for example, data that will be needed later in the execution begins to be prefetched before it is actually needed. Not all of these prefetch directives have an effect on all machine architectures. PREFETCH_BY_LOAD This generates a load byte and zero (lbz) instr uction for a memor y location. It can be used to tr igger prefetching for data that may be loaded or stored later. As discussed in Section 2.3.2, "Instruction fetch, group for mation, and dispatch" on page 9, load misses are entered into the prefetch filter queues, and on confir mation will automatically initiate prefetching. This is not the case for store misses. By using the PR EFETCH_BY_LOAD directive and specifying a data element to be stored, it is therefore possible to precede the store miss with a load miss that will be entered into the prefetch filter queue. A second prefetch directive to the next cache line in the desired direction will initiate prefetching of the cache lines where the data will be stored. This directive (and the related technique of multiplying a data element to be stored by 0.0) was quite useful on the POWER3 architecture machines. With POWER4, it is less useful with the exception of store only or initialization operations. An example of its usage follows:
do i=1,n !P4_bl* PREFETCH_BY_LOAD(x(i+17)) x(i) = s . . end do

In this example, where x is a double precision floating-point array, the prefetch directive generates a load instr uction for a data element in the next cache line beyond x(i). Subsequent iterations will issue loads for consecutive elements of x in this cache line, with a new cache line being referenced ever y 16 iterations. The exact offset from i used in the prefetch directive depends on the size of the loop. This directive is not always beneficial and can be detr imental to perfor mance. The use of this directive inser ts extra load instructions into the executable code that must be scheduled and completed among the other instructions, and for which the data will be loaded into L1 cache. This will not benefit the store operation, and may replace data that would otherwise be reused from L1 cache.

Chapter 4. Optimizing with the compilers

81

PREFETCH_FOR_LOAD, PREFETCH_FOR_STORE Each of these directives generates a cache line touch instruction (dcbt and dcbtst respectively). They will cause a cache line to be loaded into L1, but will not by themselves initiate hardware prefetching. H owever, they are treated like load misses and generate entr ies in the prefetch filter queue. Subsequent directives targeting consecutive cache lines will therefore initiate prefetching. These directives have an advantage over the PREFETCH_BY_LOAD directive in that the instructions generated do not have to wait for the cache line to be loaded for completion.

4.2.2 Loop-related directives
These directives are used to specify the character istics of do loops in For tran and instruct the compiler to perfor m cer tain optimizations in relation to do loops. They can also be used in association with automatic parallelization using the -qsmp option. ASSERT This directive can be used to specify likely iteration counts and dependency infor mation between iterations (not within an iteration) for a specific do loop. INDEPENDENT This directive indicates that the iterations of a do loop can be perfor med in any order. UNROLL(n) This directive indicates that the compiler may unroll the following loop to depth n. If the compiler can unroll the specified loop, then it should do so. This is most useful for unrolling a par ticular loop in a compilation unit while preventing other loops from being unrolled with the -qnounroll compiler flag. Another use of this directive is to specify a different depth to unroll from that which the compiler would select automatically at optimization level -O2 and above. CNCALL This directive indicates to the compiler that no dependencies between iterations exist for procedures called by the following loop. PERMUTATION This directive indicates to the compiler that one or more integer arrays have no repeated values. This would be used where the integer array was being used to index another array.

82

POWER4 Processor Introduction and Tuning Guide

4.2.3 Cache and other directives
In this section, the cache_zero and light_sync directives are discussed. CACH E_ZERO This directive generates a dcbz instr uction that zeros a cache line without needing to first load the cache line from memor y. It could be used for efficient initialization of storage, or as a mechanism for establishing cache lines that will be overwr itten without the need for them to be loaded into cache first (either by a load or store miss). However, this instruction should be used with care. It modifies the whole cache line, so the programmer should make sure that only data elements that it intended to set to zero are in this cache line and that no other processor requires access to the cache line until this operation is complete. For example, consider :
CACHE_ZERO(x(1))

This will cause the cache line containing x(1) to be set to zeros. There is no cer tainty that x(1) will be at the beginning of a cache line, and it could be anywhere in the cache line. It is, therefore, essential to check the location of this element with, for example:
MOD( LOC( x(1) ), 128 )

On POWER4 microarchitecture machines, this directive is likely to be of benefit only when the 128-byte line to be zeroed is in memor y and not anywhere in the cache hierarchy. If the data is in L1 or L2 caches, then using this directive is likely to result in significant degradation in perfor mance. If the data is in L3 cache, then there is likely to be a slight degradation in perfor mance. However, when the programmer is sure that the data is not in cache, for example in an initialization near the beginning of a program, then this directive does give a perfor mance benefit. LIGHT_SYNC This directive allows synchronization between multiple processors without waiting for a confir mation from each processor. This can reduce the perfor mance impact of synchronization between processors. It generates a lightweight sync instruction, w hich is a special case of the sync instr uction. It can be used to guarantee the order ing of loads and stores relative to a specific processor. This has some use in the pthread programming model where for example, one thread updates a value that is then used by a second thread. The lightweight sync can be used to ensure that the second thread does not access the value until the first thread has updated it.
Thread 1 flag=0 . . . Thread 2 do while ( flag .ne. 1 ) . .

Chapter 4. Optimizing with the compilers

83

x=newvalue lightweight sync flag=1 .

. . end do y=x

The lightweight sync between th preceding example) in Thread 1 completed in order. This means that it has been set to one, the u to use this value.

e two store operations (shown in the means that these operations must be that when Thread 2 polls the flag and finds pdate of x must be complete and so it is safe

4.3 The object code listing
This section reviews the process of obtaining an object code listing from the compiler. The object code listing shows the instr uction sequences generated by the compiler and can assist the programmer in a number of ways: Understanding the impact of the compiler options used, such as the optimization and unroll flags. Identifying instructions that may be platform specific and therefore cause the code to fail on other platfor ms. Identifying potential problems with the compiler. Obtaining an object code listing from a For tran, C, or C++ compilation is simple. Invoke the compiler with the -qlist option. This will generate a file with the same prefix as the module being compiled but with an .lst extension. For example, consider the following C program:
1 2 3 4 5 #include main() { printf("Hello world.\n"); }

Compiling w ith the -qlist option generates hello.lst. This file contains several sections depending on the language and compiler : The options section lists the compiler options (both default and those specified on the command line) that were used for the compilation. The file table section lists any included source files. The source section (only present if -qsource is specified) lists line-numbered source code and war nings and errors from the compiler. The compilation epilogue section provides a summar y of the number of source lines processed, errors, war nings and other messages.

84

POWER4 Processor Introduction and Tuning Guide

The object section lists the pseudo-assembler code generated. Note: Compiling w ith the -S flag w ill generate assembler code that can be read by the assembler, as (1) and used to generate a .o file. The pseudo-assembler in the object section is somewhat easier to read (than that generated with -s) and the listing includes line numbers that are nor mally invaluable in understanding the listing. In addition to the assembler listing, the object section also contains a map showing register usage. The following is par t of the object section for the hello wor ld program:
| 2| 0| 0| 0| 0| 0| 4| 4| 4| 5| 5| 5| 5| 5| 5| 5| 000000 000000 000004 000008 00000C 000010 000014 000018 mfspr stw stw stwu lwz ori bl 7C0802A6 93E1FFFC 90010008 9421FFC0 83E20004 63E30000 4BFFFFE9 60000000 38600000 80010048 7C0803A6 38210040 83E1FFFC 4E800020 1 0 2 0 1 2 0 1 0 1 2 1 0 2 PDEF main PROC LFLR gr0=lr ST4A #stack(gr1,-4)=gr31 ST4A #stack(gr1,8)=gr0 ST4U gr1,#stack(gr1,-64)=gr1 L4A gr31=.+CONSTANT_AREA(gr2,0) LR gr3=gr31 CALL gr3=printf,1,gr3,printf",gr1 ,cr[01567]",gr0",gr4"-gr12",fp0"-fp13" LI CL.1: L4A LLR AI L4A BA gr3=0 gr0=#stack(gr1,72) lr=gr0 gr1=gr1,64 gr31=#stack(gr1,-4) lr

00001C ori 000020 addi 000024 000028 00002C 000030 000034 lwz mtspr addi lwz bclr

The left-hand column in the example shows the corresponding source line number. Column two contains the relative instr uction address and column three contains the instruction. The right-hand column contains the instr uction operands. Column five is a number indicative of the instruction. A zero means the instr uction instructions. Note, these numbers should from cycle times, because they do not ac microarchitecture. number of cycles to execute the can be overlapped with previous not be used to estimate execution time curately reflect the POW ER4

Chapter 4. Optimizing with the compilers

85

In the hello wor ld example, locate the five instr uctions on line zero (w hich doesn't exist in a program). These instructions set up the stack and retur n address for the program. Instr uctions for line four set up for and then call the printf function. Instructions for line five restore the link register (program retur n address), collapse the stack, and exit. Optimized code frequently shows more complex behavior. The optimizer will move code sequences, unroll loops, and use a number of other techniques that make it more difficult to inter pret the object code listing. However, the line numbers associated with each instruction are preser ved and you can identify code for given source lines without completely understanding why the compiler has generated the specific sequences. Consider the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #define NX 1000000 main() { int j; double f1; double e[NX], q[NX];

f1 = 1.5; for (j=0;j
This example was compiled with optimization for POWER4 and no loop unrolling. In this case we specified no loop unrolling to keep the object code small. The command used to compile is as follows:
xlc -O3 -qarch=pwr4 -qtune=pwr4 -qnounroll -qlist loop.c

The corresponding segment of the object list produced from this command is as follows:
| 3| 0| 12| 0| 10| 0| 0| 000000 000000 000004 000008 00000C 000010 000014 addis addi addis lwz addi addi 3CE0000F 38600000 3D80FF0C 80A20004 398CDBC0 38074240 1 1 1 1 1 1 PDEF PROC LIU LI LIU L4A AI AI main gr7=15 gr3=0 gr12=-244 gr5=.+CONSTANT_AREA(gr2,0) gr12=gr12,-9280 gr0=gr7,16960,ca"

86

POWER4 Processor Introduction and Tuning Guide

10| 0| 0| 0| 0| 0| 0| 0| 0| 14| 14| 14| 0| 0| 12| 14| 14| 14| 14| 0| 0| 14| 16| 16|

000018 00001C 000020 000024 000028 00002C 000030 000034 000038 00003C 000040 000044 000048 00004C 000050 000054 000058 00005C 000060

lfs mtspr stwux addis addi addi addis addi addi lfd lfdu fmadd bc ori lfdu lfd stfdu fmadd bc

C0650000 7C0903A6 7C21616E 3C810001 38848038 38848000 3CC1007A 38C67898 38C699A0 C8260008 CC040008 FC03007A 4340001C 60000000 CC240008 C8460010 DC060008 FC0308BA 4320FFF0 DC060008 80210000 4E800020

1 1 1 1 1 1 1 1 1 1 1 1 0 2 1 1 1 1 0

LFS LCTR ST4U CFAA AI CFAA AI LFL LFDU FMA BCF

fp3=+CONSTANT_AREA(gr5,0) ctr=gr0 gr1,#stack(gr1,gr12)=gr1 gr4=32760,gr1,1 gr4=gr4,-32768,ca" gr6=8026200,gr1,1 gr6=gr6,-26208,ca" fp1=q[]0(gr6,8) fp0,gr4=e[]0(gr4,8) fp0=fp0,fp3,fp1,fcr ctr=CL.26,taken=0%(0,100)

000064 stfdu 000068 lwz 00006C bclr

CL.3: LFDU LFL STFDU FMA BCT CL.26: 1 STFDU 1 L4A 0 BA

fp1,gr4=e[]0(gr4,8) fp2=q[]0(gr6,16) gr6,q[]0(gr6,8)=fp0 fp0=fp1,fp3,fp2,fcr ctr=CL.3,taken=100%(100,0) gr6,q[]0(gr6,8)=fp0 gr1=#stack(gr1,0) lr

To locate a loop, we can look for a BCT instr uction that branches back to a label and confir m this by checking line numbers. In our example, there are two BCT instructions. The relevant one is the second one with the additional hint taken=100%. Note that some instructions associated with the loop appear to be outside the loop code. This is caused by the instruction scheduling knowledge built in to the optimizer. Before entering the special register) ins loaded into fp3 and pointing to arrays e loop, the loop counter is loaded using a mtspr (move to tr uction at address 01C and the single precision constant f1 is conver ted to double (at 018). We also set up registers and q (020 through 038). double into fp0 is not fp0.

Star ting at address 03C, q[j] is loaded into fp1. The lfd instruction loads a (8 bytes) into a floating-point register. Then the lfdu instr uction loads e[j] and also updates the register pointer to e[j]. Note that the address of e[j] computed from j but rather by incrementing by the size of a double. A floating-point multiply/add is initiated to generate the new value for q[j] in

Chapter 4. Optimizing with the compilers

87

The following points are related to the inner loop processing: The lfdu loads fp1 with the next value of e[j], updating the pointer. The lfd loads fp2 with the next element in the array q[j]. Note carefully the offset of 16 bytes from the pointer (gr6,16) and compare this with the previous lfd, where the offset was 8 bytes (gr6,8). By this time, the previous FMA has completed and put its result in fp0. The stfdu stores the new value of q[j] and then updates the pointer. Note the stfdu also uses (gr6,8). We then initiate an FMA for the e[j] and q[j] just loaded. The bc conditional branch tests the counter and branches back to CL.3 if appropr iate. If we did not branch, we still need to store the result of the last FMA, hence the stfdu following CL.26. From this basic example, we can see how the compiler can optimize code by use of registers as pointers and by appropriate scheduling of possibly overlapping instructions. In production code, optimized code can appear ver y complex. A complete description of the instruction set can be found on the AIX Extended Documentation CD. It can also be found on the Web site:
http://www.ibm.com/servers/aix/library/techpubs.html

4.4 Basic coding practices for performance
In this section we list coding practices that can help the compiler to generate more efficient code.

4.4.1 Language-independent tips
Do not excessively hand-optimize your code (for example, unrolling or inlining). This often confuses the compiler (and other programmers) and makes it difficult to optimize for new machines. Avoid unnecessar y use of globals and pointers. When using them in a loop, load them into a local before the loop and store them back after. Avoid breaking your program into too many small functions. If you must use small functions, seriously consider using the -qipa option. Use register-sized integers (long in C/C++ and INTEGER*4 or INTEGER*8 in For tran) for scalars.

88

POWER4 Processor Introduction and Tuning Guide

For large arrays or aggregates of integers, consider using 1- or 2-byte integers or bit fields in C or C++. Use the smallest floating-point precision appropriate to your program. Use long double, REAL*16, or COMPLEX*32 only when extremely high precision is required. Obey all language aliasing rules (tr y to avoid -qasser t=nostd in For tran and -qalias=noansi in C/C++). Use locals wherever possible for loop index var iables and bounds. In C/C++, avoid taking the address of loop indices and bounds. Keep array index expressions as simple as possible. Where indexing needs to be indirect, consider using the PERMUTATION directive.

4.4.2 Fortran tips
Use the [mp]xlf90[_r] or [mp]xlf95[_r] dr iver invocations where possible to ensure por tability. If this is not possible, consider using the -qnosave option. When writing new code, use module variables rather than common blocks for global storage. Use modules to group related subroutines and functions. Use INTENT to describe usage of parameters. Limit the use of ALLOCATABLE arrays and POINTER variables to situations that demand dynamic allocation. Use CONTAINS in subprograms only to share thread local storage. Avoid the use of -qalias=nostd by obeying For tran alias r ules. When using array assignment or WH ERE statements, pay close attention to the generated code with -qlist or -qrepor t. If performance is inadequate, consider using -qhot or rewr iting array language in loop for m.

4.4.3 C and C++ tips
Use the xlc[_r] invocation rather than cc[_r] when possible. Always include str ing.h when doing string operations and math.h when using the math librar y. Pass large class/str uct parameters by address or reference and pass ever ything else by value where possible. Use unions and pointer type-casting only when necessar y and tr y to follow ANSI type rules.

Chapter 4. Optimizing with the compilers

89

If a class or str uct contains a double, consider putting it first in the declaration. If this is not possible, consider using -qalign=natural. Avoid vir tual functions and vir tual inher itance unless required for class extensibility. These are costly in object space and function invocation perfor mance. Use volatile only for truly shared var iables. Use const for globals, parameters and functions whenever possible. Do limited hand-tuning of small functions by defining them as inline in a header file.

4.4.4 Inlining procedure references
Inlining involves replacing a procedure reference with a copy of the procedure's code, so that the overhead of referencing the procedure, and of retur ning from it, is eliminated. In cer tain situations inlining can enable the compiler to perfor m more optimization than without inlining. The general advice is to avoid inlining in areas of a program that are infrequently executed and to ensure that small functions are inlined in frequently executed areas. D o not inline large functions. Inlining does not always improve perfor mance; therefore you should test the effects of this option on your code. A program with inlining might slow down because of larger code size resulting in more cache misses and page faults, or because there are not enough registers to hold all the local var iables in some combined routines (check the compiler output for register spills). Inlining by the compiler is controlled through the -Q and -O options and the suboptions of the -qipa (not available for C++). You must specify at least optimization level -O (equivalent to -O2) for -Q inlining to take effect. In For tran, by default, -Q only affects a procedure if both the caller and callee are in the same source file or set of files that are connected by INCLUDE directives. To tur n on inline expansion for calls to procedures in different source files, you must also use the -qipa option. The compiler decides whether to inline procedures based on their size. Other criter ia might help to improve perfor mance. For procedures that are unlikely to be referenced in a standard execution (for example, error-handling or debugging procedures), you might selectively disable inlining by using the -Q-names option. For procedures that are referenced within hot spots, specify the -Q+names option to ensure that those procedures are always inlined.

90

POWER4 Processor Introduction and Tuning Guide

Getting the r ight amount of inlining for a par ticular program may require some tr ials. As a good star ting point, consider identifying a few subprograms that are called most often, and inline only those subprograms. To verify whether the compiler has inlined the call of a cer tain procedure or not, you can check whether the call has disappeared in the object listing (-qlist). The following example shows how a call to a subroutine (named foo) may appear :
9| 000038 bl 4BFFFFC9 0 CALL gr3=foo,3,a",gr3,#1",gr4,i",gr5, fcr",foo",gr1,cr[01567]",gr0",gr4"-gr12",fp0"-fp13",mq",lr",fcr",xer",fsr",ca", ctr"

In C++ there are two methods to define a functions as inline: by using the inline keyword or by defining (not just declar ing) member functions within the class definition. Inline functions are not necessar ily inlined by the compiler, and functions that are not defined as inline may still be inlined, depending on the optimization level and the -Q compiler flag.

4.4.5 Structuring code for optimal grouping
The grouping of the inter nal microprocessor instr uctions is impor tant in order to exploit the potential performance of the different hardware execution units for a specific calculation. As a general rule, it is desirable to fill out all four slots (five in case of a branch) of an instruction group. Instructions that have to be first or last in a group may prevent optimal grouping. Flushing instruction groups and refetching instructions for reorder ing is a worst case situation to be avoided if possible. There are no means to language level directly. grouping. Only when w instructions in order to influence instruction grouping from the C or For tran The compiler has to cope with the requirements of riting assembler code is it possible to arrange the order of optimize grouping.

Wr iting suitably structured high-level code might help the compiler to generate an instruction stream, which can be grouped nicely. The key issues to be considered carefully are proper alignment and data dependencies and these tuning techniques are beneficial for overall perfor mance anyway.

4.5 Tuning for 64-bit integer performance
Given a program that uses 64-bit integer data types, you need to compile with the -q64 option in order to exploit the 64-bit integer hardware suppor t of POW ER3 and POWER4. Note that specifying the -q64 compiler option does not affect the default setting for -qintsize.

Chapter 4. Optimizing with the compilers

91

In 64-bit mode, the use of INTEGER(8) induction (loop counter) var iables can improve perfor mance. The XLF 7.1 compiler automatically conver ts induction variables declared as INTEGER or IN TEGER(1) or (4) to INTEGER(8) unless -qSTRICT_INDUCTION is specified. It is no longer necessar y to set the size of default INTEGER and LOGICAL data entities (-qintsize) to 8 bytes in order to achieve this goal without source code changes. In this case the usage of -qintsize=8 could increase the memor y consumption and bandwidth requirements unnecessarily. Figure 4-1 shows some perfor mance implications. A simple add operation, B(I) = A(I) + C, is selected. In the context of this example 32-bit denotes 32-bit array elements and 32-bit address space; 64-bit indicates 64-bit integer elements. Fetching data from L1 and L2 cache, the 64-bit version with hardware suppor t (-q64) is not slower than the 32-bit version. The 32-bit version is faster when going out to L3 cache and to memor y. Twice the number of elements are kept in L2 and L3 cache, so the perfor mance degradation is delayed. Without the 64-bit integer hardware suppor t, the perfor mance of the operation with 64-bit operands is significantly worse. One should expect twice the number of load, store, and add instructions. But the object code listing reveals that 64-bit emulation is more complex. In addition, addic (add with carr y) instr uctions are generated, which probably lead to inefficient instr uction grouping.

Figure 4-1 Integer computation: B(I)=A(I)+C

92

POWER4 Processor Introduction and Tuning Guide

5

Chapter 5 .

General tuning guidelines
This chapter covers general code tuning and application optimization techniques that are not specific to POWER4 microarchitecture. It is intended to be a repositor y of recommended coding practices.

5.1 Hand tuning code
Many of the following tips and advice can be found in the IBM For tran and C compiler manuals.

5.1.1 Local or global variables?
Use local variables, preferably automatic variables, as much as possible. The compiler can accurately analyze the use of local var iables, but it has to make several worst-case assumptions about global var iables. These assumptions tend to hinder optimization. For example, if you w rite a function that uses global variables heavily, and that function also calls several exter nal functions, the compiler assumes that ever y call to an exter nal function could change the value of ever y global var iable. If you know that none of the function calls affects the global var iables that you are using, and you have to read them frequently with function calls interspersed, copy the global variables to local var iables and then use these local var iables. The compiler can then perfor m optimization that it could not otherwise perfor m.

© Copyr ight IBM Cor p. 2001

93

In C, if you must use global var iables, use static var iables w ith file scope rather than exter nal variables wherever possible. In a file with several related functions and static variables, the optimizer can gather and use more infor mation about how the variables are affected. To access an exter nal var iable, the compiler has to make an extra memor y access to obtain the address of the var iable. W hen the compiler removes extraneous address loads, it has to use a register to keep the address. Using many exter nal variables simultaneously takes up many registers. Those that cannot fit into the available registers during optimization are spilled into memor y. The C compiler organizes all elements of an external str ucture so that they use the same base address and hence base address register. Therefore, you should group exter nal data into structures or arrays w herever it makes sense to do so. Do not group together data whose address is taken (either explicitly using an ampersand (&) or implicitly, including arrays passed as parameters and C++ class objects passed as this parameters) in the same structure with other data whose address is not taken. In C, because the compiler treats register variables the same as it does automatic variables, you do not gain perfor mance by declar ing register var iables. Note that this differs from other vendors' implementations, w here using the register attribute can greatly affect program performance. However, declaring a variable as register is a good hint to the compiler and means the variable cannot be dereferenced.

5.1.2 Pointers
Keeping track of pointers dur ing optimization is difficult and in some cases impossible. Using pointers inhibits most memor y optimization (such as dead store elimination and store motion). Using the C #pragma disjoint preprocessor directive to list identifiers that do not share the same physical storage can improve the r untime perfor mance of optimized code.

5.1.3 Expressions
The For tran compiler is good at recognizing identical expressions but not per mutations of them. For example, in the code:
x=a+b+c+d y=a+c+b+d

94

POWER4 Processor Introduction and Tuning Guide

the comp However, separate variables

iler will only load the var iables a, b, c, and d into registers once. it evaluates the expressions separately and stores the results from registers. Wherever possible, wr ite identical expressions specifying in the same order.

5.1.4 Data type conversions
Avoid forcing the compiler to conver t numbers between integer and floating-point inter nal representations. Do not use floats (datatype real) for loop var iables. While this is not a perfor mance issue, when comparing floating-point numbers for equality, bear in mind that values may var y in the least significant bit, depending on how the value is calculated. Where appropriate, test that the unsigned difference between the values is less than an acceptable threshold of accuracy.

5.1.5 Tuning loops
There are a variety of techniques, basically good coding practice, that can be applied to tuning loops. These techniques are not POW ER4 specific. They are described in the For tran context but also apply to C. Keep the size of do loops manageable Loops should not spread over many lines of code. If they do, there probably exists a better algor ithm. Large loops also make program maintenance more difficult. Access data sequentially (unit stride) Wherever possible, organize data arrays so that elements are accessed with unit stride to improve cache utilization. N ote, in For tran arrays, elements a(1,n), a(2,n), a(3,n) and so on are in sequential memor y locations. In C arrays, the order is reversed, that is a[n][1], a[n][2] and a[n][3] are in sequential locations. Minimize loop invariant IF statements in loops Reduce the loop instr uction path length by moving IF statements outside the loop and coding two separate loops. This is more impor tant for small loops where the IF test may be a significant contributor to the loop execution time. Avoid subroutine and function calls in loops (give routine its own loop) Avoid subroutine calls within loops (where possible) to save the cost of the branch and link instr uctions. Use code inlining instead.

Chapter 5. General tuning guidelines

95

Alternatively, replace the loop w ith its subroutine call for each element with a subroutine call, passing an array parameter, where the subroutine contains a loop operating on each element. Simplify array subscripts Avoid wr iting complex expressions for array subscripts if possible, par ticular ly expressions involving loop var iables. Complex expressions may cause the compiler to compute the array index even if the increment is fixed. For fixed increments, the compiler can generate array element addresses with register add instructions instead of a more complex calculation. Use INTEGER loop var iables Integer loop var iables simplify loop counter optimization, since the counter can be kept in a register and also used as an address index. Do not use INTEGER*8 loop variables unless in 64-bit mode. Use of REAL loop var iables is strongly discouraged because both of the following affect perfor mance: Calculation of the loop var iable requires one or more floating-point operations Use of the loop var iable as an index requires conversion In addition, the test for loop ter mination may be inexact and the loop may be executed one fewer or one more time than expected. In C, declare loop var iables as type long. Long var iables are the natural or register size in 32-bit and 64-bit environments, that is they are 32 or 64 bits accordingly. Loop variable arithmetic using the register size has significantly better perfor mance than using, for example, integer (32-bit) loop variables in a 64-bit environment. Note that in the 64-bit environment, the C compiler will optimize integer loop variables to long if you specify -O3 (or greater) and do not specify -qstrict_induction. Avoid the following constructs within loops: Flow control statements such as GOTO, STOP, PAUSE, RETURN computed GOTO, ASSIGN, or ASSIGN ED GOTO EQUIVALENCE data items These constructs impair the ability of the compiler to optimize the loop. Avoid non-optimizable data types such as LOGICAL*1, BYTE, INTEGER*1, INTEGER*2, REAL*16, COMPLEX*32, CHARACTER, and INTEGER*8 in 32-bit mode. These data types do not correspond to the native hardware types and require additional instr uctions for each operation, impacting perfor mance.

96

POWER4 Processor Introduction and Tuning Guide

For perfor mance-critical do loops, avoid the following: Access data with large str ide Large str ides reduce the effectiveness of the cache Do few iterations of the loop For ver y small numbers of loop iterations, it may be preferable to unroll the loop by hand. Include I/O statements I/O can introduce indeter minate delays in processing. I/O function calls will also prevent automatic parallelization of loops.

Examples
These are some examples of how to correct some inefficient coding practices that have been found in real codes:

Removal of invariant IF
Untuned ------DO I=1,N IF(D(J).LE.0.0)X(I)=0.0 A(I)=B(I)+C(I)*D(I) E(I)=X(I)+F*G(I) ENDDO Tuned ----IF(D(J).LE.0.0)THEN DO I=1,N A(I)=B(I)+C(I)*D(I) X(I)=0.0 E(I)=F*G(I) ENDDO ELSE DO I=1,N A(I)=B(I)+C(I)*D(I) E(I)=X(I)+F*G(I) ENDDO ENDIF

The compiler will recognize that the IF test is invariant within the loop but will not generate two versions of the loop as in the tuned example.

Chapter 5. General tuning guidelines

97

Boundary condition IF testing
A frequent requirement is to perfor m a different calculation for the first and/or last iteration of a loop. If the loop is perfor mance-critical, then it is impor tant to treat these special cases separately and remove the IF code from the main loop:
Untuned ------DO I=1,N IF(I.EQ.1)THEN X(I)=0.0 ELSEIF(I.EQ.N)THEN X(I)=1.0 ENDIF A(I)=B(I)+C(I)*D(I) E(I)=X(I)+F*G(I) ENDDO Tuned ----A(1)=B(1)+C(1)*D(1) X(1)=0.0 E(1)=F*G(1) DO I=2,N-1 A(I)=B(I)+C(I)*D(I) E(I)=X(I)+F*G(I) ENDDO X(N)=1.0 A(N)=B(N)+C(N)*D(N) E(N)=1.0+F*G(N)

Repeated intrinsic function calculation
In this example, the untuned code calls SIN() N2 times, whereas in the tuned code, it is called N times and saved in a separate array. In the inner loop, the call is replaced by a significantly cheaper load.
Untuned ------DO I=1,N DO J=1,N A(J,I)=B(J,I)*SIN(X(J)) ENDDO ENDDO Tuned ----DIMENSION SINX(N) . DO J=1,N SINX(J)=SIN(X(J)) ENDDO DO I=1,N DO J=1,N A(J,I)=B(J,I)*SINX(J) ENDDO ENDDO

98

POWER4 Processor Introduction and Tuning Guide

Replacing divides by reciprocal multiply
This optimization can sometimes be done automatically by the compiler by specifying at least -O3 optimization level. Since divides are costly, any loop that divides by the same value more than once can be easily optimized by taking the reciprocal of the value and then multiplying by the reciprocal, as in this example:
Untuned ------DO I=1,N A(I)=B(I)/C(I) P(I)=Q(I)/C(I) ENDDO Tuned ----DO I=1,N OC=1.0/C(I) A(I)=B(I)*OC P(I)=Q(I)*OC ENDDO

In practice, any improvement will depend on the ratio of divides to loads and stores. For tr ivial loops, there is no benefit for reals but there is a benefit for integers. The following example shows a similar method that has been used when there are two (or more) different divisors:
Untuned ------DO I=1,N A(I)=B(I)/C(I) P(I)=Q(I)/D(I) ENDDO Tuned ----DO I=1,N OCD=1.0/( C(I)*D(I) ) A(I)=B(I)*D(I)*OCD P(I)=Q(I)*C(I)*OCD ENDDO

Here, two divides have been replaced by one divide and five multiplies. In the untuned case, the compiler can take advantage of the multiple FPU pipelines, whereas in the tuned case, the code is dependent on a single floating-point divide.

Chapter 5. General tuning guidelines

99

Array dimensions that are high powers of two
The following discussion is an extension of what was described in "Set associativity" on page 29. The following code elements illustrate a problem that can ar ise with array dimensions:
integer nx,nz parameter (nx=2048,nz=2048) real p(2,nx,nz) ... ... do 25 ix=2,nx-1 do 20 iz=2,nz-1 p(it1,ix,iz)= -p(it1,ix, iz) & +s*p(it2,ix-1,iz) & +s*p(it2,ix+1,iz) & +s*p(it2,ix, iz-1) & +s*p(it2,ix, iz+1) 20 continue 25 continue

The second dimension of p multiplied by the first dimension (in this case two) is precisely one half of the size of the L1 cache. Array elements p(it2,ix-1,iz) and p(it2,ix+1,iz) will (normally) be found in the same cache line as p(it2,ix,iz). However, accessing p(it2,ix,iz-1) and p(it2,ix,iz+1) will displace this cache line because each of these elements map to the same congruence class. In this example, there are five loads to the same congr uence class but only two cache lines available, because L1 is two-way associative. A simple solution is to increase the dimension of the array, as in
integer nx,nz parameter (nx=2048,nz=2048) real p(2,2080,nz)

Note that we are simply changing the dimension of the array, not the number of elements accessed. In this example, the application perfor mance increased by a factor of two. You can also get multiple loads to the same congruence class in loops that access a large number of arrays because there are only 128 classes. In this case, it is possible to improve perfor mance by splitting the loop into multiple loops and relocating array accesses into these separate loops.

100

POWER4 Processor Introduction and Tuning Guide

5.2 Using pre-tuned code
Do not spend time duplicating tuning work that has already been done. If your program perfor ms standard functions, such as matrix multiply, equation solving, other BLAS functions, FFTs, convolution, and so on, then modify your code to call the equivalent ESSL function. ESSL is descr ibed in 6.1, "The ESSL and Parallel ESSL librar ies" on page 114, and contains probably the most highly tuned code available for RS/6000 and pSeries numerically intensive functions. Other commercially and publicly available librar ies, such as NAG, IMSL, LAPAC K, and so on, have also been tuned for cache-based superscalar architectures.

5.3 The performance monitor
The POWER3 and POWER4 processor designs (as well as RS64) include hardware perfor mance monitor ing facilities. These facilities provide access to counters that record highly detailed infor mation about processor behavior and instruction execution. At the lowest level, the interface consists of special-pur pose registers that control the state of counters and multiplexors within the processor. These registers are only accessible at the operating system level; therefore a programming interface is provided that accesses these registers using a ker nel extension. The hardware provides eight counters each of which can count the number occurrences of one event. Events are things that happen inside the process such as the completion of an instr uction or a load from a cache line. Events platfor m specific, therefore, cer tain events may exist on one processor type not another. of or are but

The programming interface provides a set of C routines to specify which events should be counted, w hether they should be counted for the ker nel, the user, or at the process level. Counting can be tur ned on or off within a program, thereby providing a ver y accurate mechanism for determining processor usage in specific par ts of an application. The API and documentation are provided on the AIX 5L installation media. There is also a command, pmcount, which will execute a command or scr ipt. You can specify countable events as options to pmcount. Using another set of options, pmcount will display event numbers and their definitions for the current hardware platfor m.

Chapter 5. General tuning guidelines

101

The POWER3 and RS64 implementations allow the counting of events in any counter for which the event is defined (although not all combinations may be meaningful in the sense that the set of multiplexors used to accumulate into a specified counter may not produce a meaningful result). Event counting has been refined to provide a number of groups of events for each processor type. The definition of a group is simply a set of eight events and the par ticular counters on which they are counted. C ombinations of events in a group are meaningful. A group may be specified as an option to pmcount in place of a set of events. The increased level of complexity of the POWER4 design means that it is more difficult to guarantee meaningful results from counting events. Therefore, only counting by groups is suppor ted. The following example illustrates some of the techniques that may be useful in programming the API:
#include #include #include #include #include "pmapi.h"

#define STRIDE_MAX 4096 #define NUM_LOOPS 100 void timevalsub(struct timeval *, struct timeval *); void timevaladd(struct timeval *, struct timeval *); void invalidate_tlb(); main(int argc, char **argv) { int i,j, testcount, /* various loop variables */ rc, /* return code */ stride, group_no;/* parameters */ char *progname; float x, array[512][513]; /* timestamps for loop start, end */ struct timeval loop_start, loop_end, total; /* process monitor data structures */ pm_info_t myinfo; pm_groups_info_tmygroupinfo;

102

POWER4 Processor Introduction and Tuning Guide

pm_prog_t pm_data_t

myprog; mydata;

stride=1; total.tv_sec=0; total.tv_usec=0; /* make sure all pages in array exist to minimize later timing issues */ for (i=0;i<512;i++) { for (j=0;j<513;j++) { array[i][j]=0.0; } } progname = *argv; if (argc == 3 ) { argv++; group_no=atoi(*argv); argv++; stride=atoi(*argv); } else { printf("usage: %s group stride\n",progname); exit(1); } /* initialize API. Allow all possible events. */ if ((rc = pm_init(PM_VERIFIED|PM_UNVERIFIED|PM_CAVEAT, &myinfo,&mygroupinfo)) > 0) { pm_error("pm_init", rc); exit(-1); } /* set up counting modes for call to pm_set-program_mythread() */ myprog.mode.w = 0; /* count in user mode, not kernel mode */ myprog.mode.b.user = 1; myprog.mode.b.kernel = 0; /* defer starting counting until we call pm_start_mythread */ myprog.mode.b.count = 0; /* set is_group to say we're counting groups rather than events */ myprog.mode.b.is_group = 1; /* since we're counting groups, put the group number into events[0]. The API won't look at other events[] structures. */ myprog.events[0]=group_no; if ((rc=pm_set_program_mythread(&myprog)) != 0 ) {

Chapter 5. General tuning guidelines

103

pm_error("Calling pm_set_program_mythread",rc); exit(1); } testcount=NUM_LOOPS; while (testcount-- > 0 ) { invalidate_tlb(); gettimeofday(&loop_start,NULL); /* Start counting. We don't want to include the overhead of the invalidate_tlb() and gettimeofday() calls so we start and stop counting accordingly */ if ((rc=pm_start_mythread()) != 0 ) { pm_error("Calling pm_start_mythread",rc); exit(1); } for (i=0;i<512;i++) { for (j=0;j<512;j+=stride) { array[i][j]=array[i][j] * 1.5; } } /* Stop counting but don't reset the counters. Therefore, counting will simply continue on the next call to pm_start_mythread() */ if ((rc=pm_stop_mythread()) != 0 ) { pm_error("Calling pm_stop_mythread",rc); exit(1); } gettimeofday(&loop_end,NULL); timevalsub(&loop_end,&loop_start); timevaladd(&total,&loop_end); } /* retrieve counter data */ if ((rc=pm_get_data_mythread(&mydata)) != 0 ) { pm_error("pm_get_data_mythread",rc); exit(1); } x=(total.tv_sec*1000000)+total.tv_usec; printf("Time (usecs) = %8.2f\n",x/NUM_LOOPS); for (i=0;i<8;i++) printf("Counter %d = %-8lld\n", i+1,mydata.accu[i]/NUM_LOOPS); return(0); }

104

POWER4 Processor Introduction and Tuning Guide

Running this program produces the output below:
bu30b$ ./pm_example 5 8 Inside pm_set_program_mythread: prog->events[0] is 5. prog->mode.b.is_group is 1. Time (usecs) = 375.19 Counter 1 = 249 Counter 2 = 215 Counter 3 = 0 Counter 4 = 7966 Counter 5 = 0 Counter 6 = 0 Counter 7 = 0 Counter 8 = 0 bu30b$

The first two lines of output are generated by the pm_set_program_mythread() call, apparently as diagnostic infor mation. The program pr ints the elapsed time and counter values. We previously used pmcount to identify the groups and counters. Group 5 counts infor mation on sources of data. The definitions for the individual counters used in this example are as follows: Counter 1 The number of times data was loaded from L3 Counter 2 The number of times data was loaded from memor y Counter 3 The number of times data was loaded from L3.5 Counter 4 The number of times data was loaded from L2 Counter 5 The number of times data was loaded from L2 par tition 1 in shared mode Counter 6 The number of times data was loaded from L2 par tition 2 in shared mode Counter 7 The number of times data was loaded from L2 par tition 1 Counter 8 The number of times data was loaded from L2 par tition 2 Thus, in this example, we can see the relative sources of data for the calculation. Other groups can be used to identify the efficiency of the prefetch mechanism, floating-point unit and so on. The API described above is provided in C. There is no For tran API. However, it is a reasonable task to write a suitable, simplified API.

Chapter 5. General tuning guidelines

105

Subroutines to initialize the perfor mance monitor, star t and stop counting, and print results are required. Here is some pseudo-code that implements these subroutines.
#include #include "pmapi.h" int pminit(int group) { pm_info_t pm_myinfo; pm_groups_info_t pm_mygroupinfo; pm_prog_t myprog; pm_init(PM_VERIFIED|PM_UNVERIFIED,&pm_myinfo,&pm_mygroupinfo); myprog.mode.b.user=1; myprog.mode.b.kernel=0; myprog.mode.b.count=0; myprog.mode.b.is_group=1; myprog.events[0]=group; pm_set_program_mythread(&myprog); return(0); } int pmstart() { pm_start_mythread(); return (0); } int pmstop() { pm_stop_mythread(); return(0); } int pmprint() { int i; pm_data_t my_data; pm_get_data_mythread(&my_data); for (i=0;i<8;i++) { printf("Counter %d = %-8lld\n",i+1,my_data.accu[i]); } return(0); }

You will need to compile these functions and save the object file for later use. The -c option tells the compiler that the source file is not a complete program and it should stop after the compilation stage and not attempt to link. For example:
xlc -O3 -c -o pm_subroutines.o mysourcefilename.c

106

POWER4 Processor Introduction and Tuning Guide

Here is how you might use these subroutines to monitor a For tran program. Note that we have passed the group to be monitored by value explicitly because For tran, by default, passes parameters by reference:
program pm_test integer pminit,pmstart,pmstop,pmprint,i,j integer result,group real*8 x,a(512,512) group=5 result = pminit(%VAL(group)) result = pmstart() do i=1,512 do j=1,512 a(i,j) = a(i,j) *1.5 end do end do result = pmstop() result = pmprint() end program

To compile this program, we need to include the C subroutines and the perfor mance monitor libraries:
xlf -O3 -o pm_test pm_subroutines.o -lpmapi -L/usr/pmapi/lib pm_test.f

Note that the perfor mance monitor has not been tested in LPAR environments.

5.4 Tuning for I/O
If I/O is a significant par t of the program, it may well dominate the overall run time and render CPU tuning unproductive. Some guidelines for improving I/O efficiency in For tran and C are discussed in the following sections. However, the best advice is simply to eliminate or minimize I/O as much as possible. If I/O is your perfor mance bottleneck, then using the best hardware and software options (high-speed storage arrays, str iping over multiple devices and adaptors, and asynchronous I/O, for example) may be the best tuning options. A detailed discussion of these subjects is outside the scope of this publication. Large-memor y SMP systems are capable of generating large amounts of I/O, but different I/O subsystems have different performance character istics, so it is difficult to make specific recommendations.

Chapter 5. General tuning guidelines

107

Asynchronous I/O
Programs nor mally perform I/O synchronously. That is, execution continues after the operating system has completed the I/O. AIX also suppor ts asynchronous I/O. In this case, a program executes an I/O call that retur ns immediately. The program can then perform other useful work. The operating system will perfor m the I/O and infor m the program when it's complete. There are a var iety of techniques for the program to detect that the I/O has finished. Taking advantage of asynchronous I/O can result in reduced run time because you can over lap computation and I/O. The degree of improvement will depend on the amount of I/O the program perfor ms. Implementing asynchronous I/O will require program changes and the degree of difficulty will var y from program to program. In For tran (introduced in XL For tran Version 5), a program can open a file with the ASYNC qualifier. Read and wr ites will be perfor med asynchronously. The program needs to be changed to iss ue a wait for each asynchronous read or write. a descr iption of asynchronous I/O, including a discussion of error handling, can be found in the XL For tran Language documentation. In C, asynchronous I/O is only suppor ted for unbuffered I/O. The program changes required for asynchronous I/O are typically more complex than those required in For tran. Refer to "Asynchronous I/O Over view", in the AIX Version 4.3 Ker nel Extensions and Device Suppor t Programming Concepts documentation.

Direct I/O
Direct I/O is a for m of synchronous I/O. By default, the operating system transfers data between the application program and a file using inter mediate buffers. For example, for file system files, the operating system caches file data and this typically improves I/O perfor mance. Using direct I/O data is transferred between the device and the application's data buffers without inter mediate buffer ing. This can sometimes lead to degraded perfor mance, typically with file system files.

Paging I/O
Paging is a special case of I/O. You can measure paging rates using vmstat. This command displays, among other statistics, the paging rates for a specified time inter val. It displays these statistics for the whole system, which must be taken into account when evaluating the effect of a par ticular application. A cer tain amount of paging dur ing star tup or when the program changes from one phase to another is to be expected. However, any measurable paging rate over a sustained period during program execution is an indication that you are over-committing memor y or are on the edge of doing so. This is likely to cause

108

POWER4 Processor Introduction and Tuning Guide

serious perfor mance problems. The only solution is to reduce the level of memor y over-commitment. Either tune the program to use less memor y, or r un on a computer with more memor y (or fewer users). It should be noted that from AIX Version 4.3.2 on, the paging space allocation algorithm only allocates a page of paging space when it actually needs to write to that page. This means that it is common for the amount of paging space configured on a large memor y system to be considerably less than the size of memor y. In this situation any significant amount of paging can have more ser ious effects than poor perfor mance of an application, as the system can quickly reach a state where vir tual memor y is exhausted and thrashing ensues. The topas command is also a useful real time monitor of system I/O activity.

C unbuffered and buffered I/O
C programs can make use of two techniques for I/O. These are buffered I/O, also referred to as streams, and unbuffered I/O. Unbuffered I/O is implemented by calls to operating system functions and offers the greatest oppor tunity for perfor mance at a cost in coding complexity. Buffered or stream I/O is implemented by standard librar y functions that provide a higher level interface. Refer to the "Input and Output Handling Programmer's Over view " in AIX Version 4.3 General Programming Concepts: W riting and Debugging Programs.

Fortran I/O
Some guidelines for efficient I/O in For tran follow: Reduce the number of calls to the I/O subsystem. For example, the following three ways of writing the whole of a 2-D array to a sequential file differ ver y considerably in perfor mance. As well as perfor ming ver y slowly, Case 3 w ill create a file almost tw ice as large as Case 1 (if A is REAL*8) because of the extra record length indicators.
DIMENSION A(N,N) . . Case 1. Best. 1 record of N*N values. WRITE(1)A Case 2. N records, each of N values. DO I=1,N WRITE(1)(A(J,I),J=1,N) ENDDO Case 3. Worst. N*N records, each of one value. DO I=1,N DO J=1,N

Chapter 5. General tuning guidelines

109

WRITE(1)A(J,I) ENDDO ENDDO

Use long record lengths when reading or wr iting files in a sequential fashion. Use at least 100 KB if possible, preferably 2 MB or more. This allows the I/O to access the under lying devices more effectively. Prefer For tran unfor matted I/O to for matted. This reduces binar y to decimal conversion overhead. Prefer For tran direct files to sequential. This avoids For tran record length and overflow checking. A For tran direct file in AIX is a simple sequential series of data bytes. A For tran sequential file has record length indicators at both ends of each record. Use asynchronous I/O to over lap computation with I/O activity. If you wr ite a large temporar y file sequentially and need to read through it again at a later stage in processing, make it a direct access file and then tr y to read the end records of the file first. Ideally, read it sequentially backwards. This is because AIX will automatically use memor y to buffer the file. Assuming the file is larger than memor y, after the w rite is completed, memor y is likely to contain a large number of buffers corresponding to the last par t of the file. If you then read these records, AIX will supply them to the program from memor y without physically reading the disk. If you read the file forwards, the incoming records from the front of the file will flush out the in-memor y buffers before you reach them.

5.5 Locating hot spots (profiling)
Profiling tells you how the CPU time used by a program dur ing execution is distributed over the code. It identifies the active subroutines and loops so that tuning effor t can be applied most effectively. It is impor tant to understand that a profile relates just to the par ticular r un of the program for which the profile was obtained. The same program r un with different data may produce a different profile. Some numerically intensive programs produce ver y consistent profiles with widely var ying sets of input data. Others produce quite different profiles when the data is changed. From the point of view of the person tuning the code, the ideal situation is a consistent profile with ver y pronounced concentrations of time spent in a few routines. Tuning effor t can then be concentrated on those routines.

110

POWER4 Processor Introduction and Tuning Guide

The AIX tools available for profiling the programs include: The AIX prof and gprof commands The AIX tprof command The prof and gprof commands provide profiling at the procedure (subroutine and function) level. The tprof command uses the AIX trace facility to interr upt your program at each tick (10 milliseconds) of the AIX CPU clock and construct a trace table that contains the hardware instr uction address register. At the end of your program execution, tprof creates a repor t (using the trace table) showing the number of ticks that relate to each line of your source code. To use prof and gprof, do the follow ing: 1. Compile your program with the -p or -pg option in addition to the nor mal compiler options 2. Run the program (this produces the gmon.out file) 3. Run prof or gprof by entering:
prof > filenam e

or
gprof > filename

The standard output, filename, of prof will contain the following infor mation: The percentage of the program's CPU time used by the procedure. The time in seconds required for all references to the procedure. The cumulative total of seconds required for all procedures in the list. The number of times the procedure was called and the time required to perfor m each call. The output of gprof contains all the information provided by prof, and in addition the timing infor mation of the calling tree for the procedures in the program.

Chapter 5. General tuning guidelines

111

To use tprof on a program myprog.f, do the following: 1. Compile your program with the -g option 2. Run tprof on the program:
tprof -p myprog -x "myprog params"

This procedure creates two output files, namely __myprog.all and __t.myprog.f. The first file shows all the processes involved in running your program and provides a count of the timer ticks associated with each process. It also lists the percentage of ticks that are associated w ith the complete program. The second file is only produced if you compile your program with the -g option. It is an annotated version of your source file, that indicates the CPU ticks associated with each line of the source code being executed. For more details on how to use prof, gprof, and tprof, see Optimization and Tuning Guide for For tran, C, and C++, SC09-1705. By far the most user-friendly and powerful tool, providing graphically assisted profiling down to the For tran or assembler statement level, is xprofiler. xprofiler is a suppor ted IBM tool distributed as par t of the IBM Parallel Environment for AIX licensed program product (5765-D93). The specific fileset component that supplies this tool is ppe.xprofiler. If you are r unning on a workstation where PE is not installed, your profiling option is to use prof, gprof, or tprof. To use xprofiler, compile and link as for gprof with -g -pg options together with -O3 or whatever other optimization you are using. It is impor tant to use the same optimization options as you will use for production, since changing the optimization is highly likely to also change the profile. Then simply run the executable against the chosen test data. This will produce the standard gmon.out file containing the profiling data. Then run xprofiler. Graphics will appear showing the subroutine tree of the program, with each subroutine represented by a rectangle. The area of each rectangle is roughly propor tional to the CPU time spent in that routine, giving an immediate visual indication of hot-spot locations. C licking on a rectangle w ill produce a set of options, one of which creates a source code listing with each statement annotated with the amount of CPU time (in units of 1/100 of a second) used. This enables the active loops to be easily identified.

112

POWER4 Processor Introduction and Tuning Guide

6

Chapter 6 .

Performance libraries
In this chapter we discuss perfor mance-enhancing techniques that take advantage of highly tuned var iants of commonly needed operations. Scientific and technical computational problems often contain common mathematical constr ucts, such as matrix-vector multiply or matrix-matr ix multiply, which use a large por tion of an application's computational time. Many of these common constructs have been extensively researched and tuned for efficient computation. Basic linear algebra subprograms (BLAS) that compute many of these common constructs are available from var ious sources. For example, you can download the source or precompiled BLAS from:
http://www.netlib.org/blas

Subprograms for solving systems of linear equations, eigenvalue problems, singular value problems, and so for th can be found in the public domain LAPACK librar ies:
http://www.netlib.org/lapack

LAPACK uses BLAS calls whenever possible to simplify its use and to be able to take advantage of any available optimized BLAS librar ies.

© Copyr ight IBM Cor p. 2001

113

The public domain BLAS or LAPACK are not highly tuned for a par ticular architecture and locally compiled versions seldom approach peak GFLOPS rates. An alter native is to download the package from the automatically tuned linear algebra software (ATLAS) site:
http://math-atlas.sourceforge.net

With ATLAS, you must first generate a tuned librar y for a system by running an extensive testing suite on a quiet system. The adjustable parameters in the tuned librar y are deter mined by extensively testing processor speed, cache sizes and speed, and memor y size and speed. Some impressive perfor mance results c an be obtained in this manner. However, the resulting librar y may not be well tuned if it is subsequently used on a somewhat different machine configuration. Another alter native is to use the IBM Engineer ing and Scientific Subroutine Librar y (ESSL) and the parallel version, Parallel ESSL. These librar ies are highly tuned for IBM hardware, having been tested on many different PowerPC processor configurations. ESSL and Parallel ESSL are discussed in The ESSL and Parallel ESSL librar ies section that follows. A different type of specia less accurate versions of IBM has produced tuned the MASS librar y. MASS lized perfor mance tuning is applying faster, but slightly For tran intrinsic functions such as SIN , LOG, and EXP. versions of functions like these, which can be found in can be downloaded from:

http://www.rs6000.ibm.com/resource/technology/MASS

MASS is discussed in detail in Section 6.2, "The MASS librar ies" on page 117.

6.1 The ESSL and Parallel ESSL libraries
The Engineer ing and Scientific Subroutine Librar y (ESSL) family of products is a state-of-the-ar t collection of mathematical subroutines. Running on IBM pSeries ser vers and IBM RS/6000 wor kstations, ser vers and SP systems, the ESSL family provides a wide range of high-perfor mance mathematical functions for a variety of scientific and engineer ing applications. The ESSL family includes: ESSL for AIX, which contains over 400 high-perfor mance mathematical subroutines tuned for IBM UNIX hardware. Parallel ESSL for AIX, which contains over 100 high-perfor mance mathematical subroutines specifically designed to exploit the full power of RS/6000 SP hardware with scalability of up to 512 nodes.

114

POWER4 Processor Introduction and Tuning Guide

Complete infor mation on the ESSL and Parallel ESSL librar ies, including infor mation on obtaining them, can be found at:
http://www-1.ibm.com/servers/eserver/pseries/software/sp/essl.html

6.1.1 Capabilities of ESSL and Parallel ESSL
ESSL provides a variety of mathematical functions, such as: Basic Linear Algebra Subprograms (BLAS) Linear Algebraic Equations Eigensystem Analysis Four ier Transfor ms ESSL products are compatible with public domain subroutine libraries such as Basic Linear Algebra Subprograms (BLAS), Scalable Linear Algebra Package (ScaLAPACK) , and Parallel Basic Linear Algebra Subprograms (PBLAS). Thus, migrating applications to ESSL or Parallel ESSL is straightforward. Both ESSL and Parallel ESSL have SMP-parallel capabilities. The ter m parallel in the Parallel ESSL product name refers specifically to the use of MPI message passing, usually across the SP switch. For SMP parallel use within a single pSeries 690 Model 681, Parallel ESSL is not required. An SMP-parallel example (DGEMM) for the pSer ies 690 Model 681 is provided in Figure 6-1 on page 116.

6.1.2 Performance examples using ESSL
ESSL V3.3 and Parallel ESSL V2.3 are available for the IBM ^ pSeries 690 Model 681 and contain highly optimized routines tuned for the POWER4 processor. Significant optimizations have been done in ESSL to effectively use the L1 and L2 cache, maximize data reuse in the caches, and minimize memor y bandwidth requirements. Any application that can be for mulated with BLAS calls, especially BLAS3 calls such as SGEMM or DGEMM, will benefit greatly from the ESSL librar y. Attention: Some perfor mance numbers repor ted here used a pre-release version of ESSL that was the latest available at the time this document was written. Readers should perfor m their own studies to establish fir m perfor mance metr ics.

Chapter 6. Perfor mance libraries

115

Single processor DGEMM
The ESSL version of DGEMM was used to perfor m matrix-matr ix multiplies on a single POWER4 processor on a pSer ies 690 Turbo system. The measured GFLOPS as a function of matrix size are shown in Figure 6-1. The best perfor mance is greater than 3.6 GFLOPS and the perfor mance is excellent for a wide range of matr ix sizes.

Figure 6-1 ESSL DGEMM single processor GFLOPS

SMP-parallel DGEMM - optimal POWER4 GFLOPS from ESSL
Table 6-1 lists the perfor mance of SMP parallel DGEMM on a pSer ies 690 Turbo for square REAL*8 matrices.
Table 6-1 DGEMM throughput summar y Parallelism 32 CPU pSeries 690 Turbo 8-way 16-way 24-way 32-way 25.80 47.86 69.00 84.30 not measured not measured not measured 96.13 2000x2000 REAL*8 10000x10000 REAL*8

116

POWER4 Processor Introduction and Tuning Guide

A sustained rate of over 96 GFLOPS was measured on a 1.3 GHz pSer ies 690 Turbo for 32-way parallel DGEMM on 10000x10000 matrices, as provided in Table 6-1 on page 116. This is the highest perfor mance seen for a single job on a single pSer ies 690 Turbo dur ing the preparation of this publication. Also see Section 8.4.1, "ESSL DGEMM throughput perfor mance " on page 161 for the perfor mance obser ved for multiple copies of a single processor DGEMM application.

6.2 The MASS libraries
The mathematical acceleration subsystem (MASS) librar y provides high-perfor mance versions of a subset of For tran intr insic functions. These versions sacr ifice a small amount of accuracy to allow for faster execution. Compared to the standard mathematical librar y, libm.a, the MASS librar y differs, at most, only in the last bit. Thus, MASS results are sufficiently accurate in all but the most str ingent conditions. There are two basic types of functions available for each operation: A single instance function A vector function The single instance function simply replaces the libm.a call with a MASS librar y call. The vector function is used to produce a vector of results given a vector operand. The vector MASS functions may require coding changes while the single instance functions do not.

6.2.1 Installing and using the MASS libraries
The MASS libraries can be downloaded from:
http://www.rs6000.ibm.com/resource/technology/MASS

This site also has extensive documentation and should be referred to for more detailed explanations. The download file is a compressed tar file that can be unpacked into /usr/lpp and the resulting librar y files linked to /usr/lib, or the tar file may be unpacked into any other location for inclusion at link time. There are separate librar ies for the single instance functions and the vector functions. The following is an example using MASS. If libmass.a and the other librar ies are installed in /home/somebody/mass, it is used as:
xlf90 -c -O3 -qarch=pwr4 -qtune=pwr4 myprogram.f xlf90 -o myjob -L/home/somebody/mass -lmass myprogram.o

Chapter 6. Perfor mance libraries

117

All references to SINE, LOG, EXP and other functions in myprogram.f will have been satisfied from the single instance functions in libmass.a rather than the nor mally chosen functions in libm.a. Some of the functions available in the MASS librar y have now b een included in the XL For tran r untime environment. This means that at higher specified levels of compiler optimization, a For tran intrinsic function or operation may be replaced with a faster version found in /usr/lib/libxlopt.a even if the MASS librar ies have not been installed. If you w ish to track exactly which version of an intr insic has been used you can produce a detailed, sor ted cross reference map using -bsxref:myxref when creating the executable. The following is an example using the MASS librar y with For tran code:
real(8) a(*) ... do i=1,n a(i)=1.0d0/a(i) enddo

This code would be rather expensive using the hardware divide function and may be replaced using the vector MASS reciprocal approximation function vrec as:
call vrec(a,a,n)

Using the vector for m, the speedup for this example is approximately 2.25 for n>~50. See Table 6-2 on page 119 for more infor mation. The executable is linked as:
xlf90 -o myjob -bsxref:myxref -L/home/somebody/mass -lmassv myprogram.o

Examination of the file myxref show s that vrec has been loaded from libmassv.a. However, if you are using XL For tran Version 7.1 or later, compiling and linking as:
xlf90 -c -O3 -qhot -qarch=pwr4 -qtune=pwr4 myprogram.f xlf90 -o myjob -bsxref:myxref myprogram.o

you will find that a version of vrec has been loaded from libxlopt.a. Several other functions are recognized and may be substituted by the compiler such as exp, sin, cos, sqr t, and reciprocal square root.

118

POWER4 Processor Introduction and Tuning Guide

6.2.2 Description and performance of MASS libraries
Table 6-2 lists the functions available in the MASS libraries and an approximate measure of perfor mance. The perfor mance numbers are based on POWER3 measurements. Similar speedups are expected on POWER4. While MASS functions are somewhat less accurate than the standard function, errors are mostly less than 1 bit.
Table 6-2 Mass librar y functions and perfor mance Function 64-bit exponential 32-bit exponential 64-bit natural log 32-bit natural log 64-bit sine or cosine 32-bit sine or cosine 64-bit sine and cosine 32-bit sine and cosine 64-bit tangent 32-bit tangent 64-bit inverse tangent of complex number 32-bit inverse tangent of complex number Truncate to whole number Conver t to nearest whole number 64-bit reciprocal 32-bit reciprocal 64-bit square root 32-bit square root 64-bit reciprocal square root 32-bit reciprocal square root Real raised to real power
a b

mass call exp exp log log sin,cos sin,cos sin,cos sin,cos tan tan atan2 atan2 dint dnint n/a n/a sqr t sqr t rsqr t rsqr t x**y

speedup 2.37 2.37 1.57 1.57 2.25b 2.17b 2.42b 2.08b 2.13 2.02 4.75 4.70 1.0 2.0

massv call vexp vsexp vlog vslog vsin,vcos vssin,vscos vsincosc vssincos vtan vstan vatan2 vsatan2 vdint vdnint vrec vsrec vsqr t vssqr t

speedupa 6. 7 9.7 10.4 12.3 7.2b 9.75b 10.0b 13.2b 5.84 5.95 16.5 16.7 7.86 7.06 2.6 3.8 1.2 2.3 6.2 13.2 N/A

1.34 1.34 2.35

vrsqr t vsrsqr t N/ A

Per result for vector length 1000 Speedup for data range [-1,1] c See libm assv.f in installation director y for usage

Chapter 6. Perfor mance libraries

119

6.3 Modular I/O (MIO) library
The Modular I/O (MIO) librar y was developed by the Advanced Computing Technology Center (ACTC) of the Watson Research Center at IBM to address the need for an application-level method for optimizing I/O. Applications frequently have ver y little logic built into them to provide users the oppor tunity to optimize the I/O perfor mance of the application. The absence of application level I/O tuning leaves the end user at the mercy of the operating system to provide the tuning mechanisms for I/O perfor mance. Typically, multiple applications are run on a given system that have conflicting needs for high-perfor mance I/O resulting, at best, in a set of tuning parameters that provide moderate perfor mance for the application mix. The MIO librar y allows users to analyze the I/O of their application and then tune the I/O at the application level for a more optimal perfor mance for the configuration of the current operating system. Sequential access, predominantly reads, of ver y large files (tens of gigabytes) is a common patter n of I/O, for example, in implicit finite element analysis codes. Applications that are characterized by this I/O patter n tend to benefit minimally from operating system buffer pools. Large operating system buffer pools are ineffective since there is ver y little, if any, data reuse and system buffer pools typically do not provide prefetching of user data. However, the MIO librar y can be used to address this issue by invoking a prefetching (pf) module that w ill detect the sequential access patter n and asynchronously preload the needed data into a smaller cache. The pf cache need only be large enough to contain enough pages to maintain sufficient read ahead. The pf module can optionally use direct I/O, which will avoid an extra memor y copy to the system buffer pool and also frees the system buffers from the one-time access of the I/O traffic, allowing the system buffers to be used more productively. Our early exper iences with the aix module have consistently demonstrated that the use of direct I/O with the pf module is highly beneficial to system throughput. The MIO librar y consists of four I/O modules that may be invoked at run time on a per-file basis. The modules currently available are: mio pf trace aix The interface to the user program A data prefetching module A statistics gathering module The MIO interface to the operating system

120

POWER4 Processor Introduction and Tuning Guide

For each file that is opened w ith MIO there are a minimum of two modules invoked: the mio module, which conver ts the user MIO calls (MIO_open, MIO_read, MIO_write, to name a few) into the inter nal calling sequence of MIO, and the aix module, which conver ts the inter nal calling sequence of MIO into the appropr iate system calls (open, read, wr ite, for example). Between the mio and aix module invocations the user may specify the invocation of the other modules, pf and trace. For applications that use the POSIX standard open, read, write, lseek, and close I/O calls the application programmer should only need to introduce #define's to direct the I/O calls to use the MIO librar y. MIO is controlled through four environment variables. Among other things, these variables deter mine which modules are to be invoked for a given file when MIO_open is called. As an example, the output of a MIO trace invocation is shown for a simple program. It opens a file, tr uncating it back to zero bytes in length, and then writes 100 records of 16 KB. The file is then read forwards with 100 reads of 16 KB, and then read backwards w ith 100 reads of 16 KB.
MIO statistics file : Wed Feb 9 16:03:17 2000 hostname=v01n01.vendor.pok.ibm.com program=a.out MIO library built Feb 1 2000 12:53:59 : with aio calls MIO_STATS =example.mio MIO_DEBUG =OPEN MIO_FILES = *.dat [ trace/stats ] MIO_DEFAULTS= trace/kbytes

Opening file file.dat modules=trace/stats ========================================================================== Trace close : mio <-> aix : file.dat : (4800/1.80)=2659.71 kbytes/s demand rate=2611.47 kbytes/s=4800/(1.85-0.02)) current size=1600 max_size=1600 mode =0640 sector size=4096 oflags =0x302=RDWR CREAT TRUNC open 1 0.03 write 100 0.03 1600 1600 16384 16384 read 200 1.65 3200 3200 16384 16384 seek 101 0.00 fcntl 1 0.00 close 1 0.12 size 100 ==========================================================================

Chapter 6. Perfor mance libraries

121

For more infor mation about MIO, refer to the following Web site:
http://www.research.ibm.com/actc/Opt_Lib/mio/mio_doc.htm

The MIO librar y was shipped first with the AIX Version 4.3 and 5L Bonus Pack in July 2001. More infor mation on this is found at the following Web site:
http://www.ibm.com/servers/aix/products/bonuspack

6.4 Watson Sparse Matrix Package (WSMP)
The Watson Sparse Matrix Package (WSMP) is a high-perfor mance, robust, and easy-to-use software package for solving large sparse systems of linear equations using a direct method on pSeries ser vers, RS/6000 workstations, and the RS/6000 SP. It can be used as a serial package, in a shared-memor y multiprocessor environment, or as a scalable parallel solver in a message-passing environment, where each node can either be a uniprocessor or a shared-memor y multiprocessor. WSMP is compr ised of two par ts, both of which are bundled in the same librar y. Par t I of W SMP replaces the older software called WSSMP for the solution of symmetric sparse systems of linear equations. Par t II of the WSMP librar y deals with the solution of general sparse systems of linear equations. Currently, WSMP does not suppor t the solution of general/unsymmetr ical sparse systems in a message-passing parallel environment. WSMP does not have out-of-core capabilities. The problems must fit in the main memor y for reasonable perfor mance. Technical papers related to the software, some example programs, and infor mation about the latest updates can be obtained from the following Web site:
http://www.cs.umn.edu/~agupta/wsmp.html

IBM Research intends to provide a version of WSMP compiled for POWER4 when the hardware and compiler become available. For solving symmetric systems, WSMP uses a modified version of the multifrontal algor ithm for sparse Cholesky factor ization and a highly scalable parallel sparse Cholesky factorization algor ithm. The package also uses scalable parallel sparse tr iangular solvers and an improved and parallelized version of the previously released package WGPP for computing fill-reducing order ings. Sparse symmetric factor ization in WSMP has been clocked at up to 3.6 GFLOPS on an RS/6000 workstation with four 375 MHz POWER3 CPUs and 90 GFLOPS on a 128-node SP with two-way SMP 200 MHz POWER3 nodes.

122

POWER4 Processor Introduction and Tuning Guide

For solving general sparse systems, WSMP uses a modified version of the multifrontal algor ithm for matr ices with an unsymmetr ical patter n of nonzeros. WSMP suppor ts threshold par tial pivoting for general matrices with a user-defined threshold. W SMP automatically exploits SMP parallelism on an RS/6000 wor kstation or SP node with multiple CPUs and this parallelism is transparent to the user. On an R S/6000 with four 375 MHz POWER3 CPUs, WSMP has been clocked at up to 2.4 GFLOPS for factor ing general sparse matr ices with par tial pivoting.

Chapter 6. Perfor mance libraries

123

124

POWER4 Processor Introduction and Tuning Guide

7

Chapter 7 .

Parallel programming techniques and performance
There are several methods available to the application programmer to achieve parallel execution of a program and more rapid job completion compared to running on a single processor. These methods include: Directive-based shared memor y parallelization (SMP) Compiler automatically generated shared memor y parallelization Message passing based shared or distributed memor y parallelization POSIX threads (pthreads) parallelization Low -level UNIX parallelization using fork() and exec() E ac h best t skills mach of these techniques has been used to produce efficient parallel codes. The echnique to use is highly dependent on the application, the programmer's and preferences, por tability requirements for the application, and the target ine's character istics.

In this chapter we discuss shared memor y parallelization, both directive-based and automatic, message passing based parallelization using the MPI standard, and pthread parallelization.

© Copyr ight IBM Cor p. 2001

125

7.1 Shared memory parallelization
Shared memor y parallelization describes parallelization that can take place in a computer in which all memor y used by a program is locally addressable from within the program. For current IBM computers and the current AIX 5L operating system, this means r unning on a single node. In the future, non-unifor m memor y access (NUMA) may be available in which memor y on separate, remote nodes may be addressable locally from within a single program. A detailed description of shared memor y parallelization, or SMP programming, can be found in Scientific Applications in RS/6000 SP Environments, SG24-5611. A br ief over view is given in this section. All discussions refer to the OpenMP standard implementation of SMP parallelism.

7.1.1 SMP runtime behavior
Shared memor y parallelization is implemented by creating user threads that are scheduled to r un on ker nel threads by the operating system. This parallel job flow is illustrated in Figure 7-1 on page 127. A single thread is created when a program star ts. Additional threads are created when the first parallel region is entered. After all parallel work for a thread is completed, it spin waits for the next parallel section for a period, but it consumes processor time while waiting. After the spin wait time has expired and if a yield wait time has been specified, the thread can yield its place on the ker nel thread to another runable thread. If the yield wait time has expired and no new parallel region has been entered, the thread goes to sleep. Reactivating a thread from a sleep state is more costly than if the thread is in a yielded state.

126

POWER4 Processor Introduction and Tuning Guide

Time

Program
Sequential Code Parallel loop Sequential Code Parallel loop Sequential Code

Threads

Useful work Wait time

Threads created

Synchronization

} } } }

Threads spinning

Threads active

Synchronization
Threads spinning Threads yielding Threads sleeping

Figure 7-1 Shared memory parallel job flow

There are some impor tant environment variables that can affect parallel perfor mance at r un time. Different settings would be appropr iate on a busy machine compared to a quiet machine. Some of the more impor tant environment variables are: AIXTHREAD_SCOPE = S or P (default = P) The thread contention scope can be system (S) or process (P). When system contention scope is used, each user thread is directly mapped to one ker nel thread. This is appropriate for typical scientific and technical applications in which there is a one-to-one ratio between threads wanted and processors wanted. Process contention scope is best when there are many more threads than processors. When process contention scope is used, user threads share a ker nel thread with other (process contention scope) user threads in the process.

Chapter 7. Parallel pr ogramming techniques and perfor mance

127

OMP_DYNAMIC = FALSE or TRUE (default = TRUE) The OMP_DYNAMIC environment variable disables or enables dynamic adjustment of the number of threads available for the execution of parallel regions. If this variable is TRUE, the r untime environment can adjust the number of threads it uses for executing parallel regions so it makes the most efficient use of system resources. The dynamic checking can add a small amount of overhead, so for benchmar king, scaling tests, or if an application depends on a specific number of threads, this variable should be set to FALSE. SPINLOOPTIME=n (default = 40) If a user thread cannot acquire a lock (which is necessar y to begin a parallel loop, for example), it will attempt to spin for up to SPINLOOPTIME times. Once the spin count has been exhausted, the thread will go to sleep waiting for a lock to become available unless the YIELDLOOPTIME is set to a number greater than zero. You want to spin rather than sleep if you are waiting for a previous parallel loop to complete, provided there is not too much sequential work between the loops. If YIELDLOOPTIME is set, upon exhausting the spin count, the thread issues the yield() system call, gives up the processor, but stays in a runable state rather than going to sleep. On a quiet system, yielding is preferable to sleeping since reactivating the thread after sleeping costs more time. For benchmarking or scaling tests, SPINLOOPTIME can be ver y large, for example 100000 or more. On a busy system, it should not be too large or much processor time that could otherwise be shared with other jobs is consumed spinning. The best value to use depends on var ious system character istics such as processor frequency, and several values should be tested to achieve optimal tuning. YIELDLOOPTIME = n (default = 0) YIELDLOOPTIME controls the number of times that the system yields the processor when tr ying to acquire a busy spin lock before going to sleep. The processor is yielded to another kernel thread, assuming there is another runable one with sufficient priority. YIELDLOOPTIME is only used if SPINLOOPTIME is also set. MALLOCMULTIHEAP (default = not set) Multiple heaps are useful so that a threaded application can have more than one thread issuing memor y allocation subroutine calls. W ith a single heap, all threads tr ying to do a malloc(), free(), or realloc() call would be ser ialized (that is, only one thread can do malloc/free/realloc at a time) which could have a serious impact on multi-processor machines. With multiple heaps, each thread gets its ow n heap, up to 32 separate heaps.

128

POWER4 Processor Introduction and Tuning Guide

SMP stack size (default = 4 MB/thread) For 32-bit OpenMP applications, the default limit on stack size per thread is rather small and if it is exceeded it will result in a runtime error. Should this occur, the stack size may be increased using the XLSMPOPTS environment variable with:
export XLSMPOPTS=stack=n

where n is the stack size in bytes. However, the total stack size for all threads cannot exceed 256 MB (one memor y segment). This limitation of one segment does not apply to 64-bit applications.

7.1.2 Shared memory parallel examples
Shared memor y parallelization (SMP) programming can be done at a ver y high level such as:
SUBROUTINE EXAMPLE(M,N,A,B) REAL(8) A(N),B(N) !$OMP PARALLEL DO PRIVATE(J), DEFAULT(SHARED) DO J=1,M CALL DOWORK(J,N,A,B) ENDDO ...

The subroutine DOWORK and all subsequent subroutine calls must be carefully checked to ensure they are, in fact, thread safe. This high level of parallelization is usually the most efficient, and is recommended when possible. It is also common to use shared memor y parallelization at a low level, although scaling efficiencies are often quite limited when little work is done in a parallel region. The ease of implementation is an attractive feature of low-level parallelization. The discussion and examples that follow demonstrate parallelism at the loop level. We have tested three loops from the solver of a computational fluid dynamics code and use them as examples. The loops are:
LOOP A DO J=1,NX Q(J)=E(J)+F2*Q(J) ENDDO LOOP B DO J=1,NX I1=IL(1,J)

Chapter 7. Parallel pr ogramming techniques and perfor mance

129

I2=IL(2,J) I3=IL(3,J) I4=IL(4,J) I5=IL(5,J) I6=IL(6,J) E(J)=Y3(J)*Q(J)-( * Q2(1,J)*Q(I1)+Q2(2,J)*Q(I2)+ * Q2(3,J)*Q(I3)+Q2(4,J)*Q(I4)+ * Q2(5,J)*Q(I5)+Q2(6,J)*Q(I6)) F3=F3+Q(J)*E(J) ENDDO LOOP C DO J=1,NX Z0(J)=Z0(J)+X2*Q(J) B1(J)=B1(J)-X2*E(J) T1=B1(J) E(J)=T1*DBLE(C1(J)) F1=F1+T1*E(J) F4=F4+ABS(T1) ENDDO

The declarations are:
REAL(8) Z0(NX),B1(NX),E(NX),Q(0:NX) REAL(4) Y3(NX),Q2(6,NX),C1(NX) INTEGER(4) IL(6,NX)

In this example, NX is typically 100000 to 10000000. Loop A is a simple multiply /add loop. Loop B is a complicated loop with 20 memor y loads, a single store, and a reduction sum. Six of the memor y references, such as Q(I1) are indirect address references. Loop C is a moderately complicated loop with five memor y loads, three stores, and two reduction sums.

7.1.3 Automatic shared memory parallelization
Automatic shared memor y parallelization is successful when the compiler can recognize parallel code constr ucts and safely produce efficient parallel code. The IBM XL For tran Version 7.1 compiler has state-of-the-ar t capabilities for automatically parallelizing For tran programs. A major concer n with automatic parallelization is the potential that a loop with little work or few iterations is parallelized and runs more slowly than it would had it remained sequential. However, when a large For tran code is well wr itten and it is compiled for automatic parallelization, good speedups can be realized with ver y little effor t.

130

POWER4 Processor Introduction and Tuning Guide

The three example loops are automatically parallelized with:
xlf90_r -c -qsmp=auto -qnohot -qreport=smplist -O3 -qarch=pwr4 -qtune=pwr4 -qfixed sub.f

The option -qsmp=auto initiates automatic parallelization and it also implies -qhot. The option -qnohot was used to be consistent with the directive-based SMP runs. The option -qrepor t=smplist repor ts the line number of each successfully parallelized loop. The resulting file, sub.lst, has additional infor mation including reasons why parallelization may have been unsuccessful for a loop. The xlf90_r compiler invocation should be used rather than xlf90 to ensure the resulting object code is thread safe. Perfor mance results are shown in Section 7.1.5, "Measured SMP perfor mance" on page 132.

7.1.4 Directive-based shared memory parallelization
Directive-based shared memor y parallelization is more labor intensive than automatic parallelization, but it does allow for more control over which loops get parallelized and more options for scheduling individual loops. For the example loops, the following directives were used:
LOOP A !$OMP PARALLEL DO PRIVATE(J),DEFAULT(SHARED),SCHEDULE(GUIDED) LOOP B !$OMP PARALLEL DO PRIVATE(J,I1,I2,I3,I4,I5,I6) !$OMP* REDUCTION(+:F3) !$OMP* DEFAULT(SHARED),SCHEDULE(GUIDED) LOOP C !$OMP PARALLEL DO PRIVATE(J,T1) !$OMP* REDUCTION(+:F1,F4) !$OMP* DEFAULT(SHARED),SCHEDULE(GUIDED)

The loops use guided scheduling, which initially divides the iteration space into one chunk equal to NX divided by N and then exponentially decreases the chunk size to a minimum size of 1. This scheduling algor ithm, which allows for a processor that found more data in L1 or L2 cache to get another chunk of data quickly while a processor requir ing many L3 or memor y references is working, is often most efficient.

Chapter 7. Parallel pr ogramming techniques and perfor mance

131

Compiling for directive-based parallelization uses the following command options:
xlf90_r -c -qsmp=omp -O3 -qarch=pwr4 -qtune=pwr4 -qfixed sub.f

The option noauto is implied when -qsmp=omp is used. However, with -qsmp=omp, -qhot is not implied.

7.1.5 Measured SMP performance
The three example loops were run from within an application and timed using realistic data for 200 repetitions with NX set to 1000000. The environment settings used were:
export export export export export AIXTHREAD_SCOPE=S SPINLOOPTIME=100000 YIELDLOOPTIME=40000 OMP_DYNAMIC=false MALLOCMULTIHEAP=1

All results were r un on a two-MCM, eight-processor pSeries 690 HPC. The results for each of the three loops are shown separately in Table 7-1, Table 7-2, and Table 7-3 on page 133.
Table 7-1 Loop A parallel perform ance elapsed tim e Processors 1 2 4 6 8 -O3 1.211 -qsmp=auto 1.378 0.815 0.481 0.364 0.307 speedup 0.88 1.49 2.52 3.33 3.94 -qsmp=omp 1.238 0.706 0.431 0.335 0.262 speedup 0.98 1.72 2.81 3.61 4.62

Table 7-2 Loop B parallel perform ance elapsed tim e Processors 1 2 4 6 8 - O3 4.190 -qsmp=auto 4.758 2.000 1.143 0.885 0.787 speedup 0.88 2.10 3.67 4.73 5.32 -qsmp=omp 4.268 2.068 1.146 0.883 0.720 speedup 0.98 2.03 3.66 4.75 5.82

132

POWER4 Processor Introduction and Tuning Guide

Table 7-3 Loop C parallel perform ance elapsed time Processors 1 2 4 6 8 - O3 2.795 -qsmp=auto 3.056 1.430 0.906 0.719 0.633 speedup 0.91 1.95 3.08 3.89 4.42 -qsmp=omp 2.830 1.541 0.933 0.736 0.588 speedup 0.99 1.81 3.00 3.80 4.75

The data show s that for these test loops there is little difference between automatic and manual parallelization. Some overhead due to parallelization can be seen comparing the single processor results. Note that the compiler may be using different optimization strategies when creating parallel code as well. The reduction sums in loops B and C require the creation of critical sections in which only one processor can update the reduced var iable at a time. These critical sections can significantly reduce parallel efficiency if the amount of work in the loop is too small or too many processors are used. The conclusions from this analysis are: SMP parallelization does result in improved run times. SMP parallelization is easy to implement. Overall speedups are limited for small loops, especially when there are reduction sums.

7.2 MPI in an SMP environment
This section examines how existing MPI programs, written for distr ibuted memor y systems, can make the best use of both SMP and distr ibuted memor y systems. We do not attempt to provide a detailed discussion of distr ibuted memor y parallelization or the use of MPI and refer the reader to the IBM Parallel Environment for AIX product documentation, and the IBM Redbook Scientific Applications in RS/6000 SP Environments, SG24-5611.

Chapter 7. Parallel pr ogramming techniques and perfor mance

133

In the following discussion, processes executing in parallel and communicating using MPI calls are referred to as tasks. A number of different scenarios are considered: MPI only The MPI implementation in IBM Parallel Environment for AIX (PE) can use several protocols for communication between tasks. Inter net Protocol (IP) can be used between tasks on the same node and between tasks on different nodes. This incurs relatively high latencies and IP overheads. In the IBM RS/6000 SP environment with nodes attached to one of the types of SP switch, then another protocol know n as user space can be used for communication between tasks. Depending on the type of switch involved and the version and release of PE, there may be restrictions on the number of user space tasks allowed per node. At the time of wr iting, the SP Sw itch2 with PE 3.1 can suppor t up to 16 user space tasks per POWER3 node. User space significantly reduces the overhead and latency when compared to IP, but it may still be higher between processes on the same node than using shared memor y. MPI communication calls can also use shared memor y for message passing between MPI tasks on the same node. The PE MPI librar y is capable of using shared memor y automatically. In a cluster or SP configuration of POWER3 nodes, then IP or user space would be used between tasks on other nodes. In this case, overall perfor mance can still be limited by communication between the nodes. This could be reduced for group operations (such as broadcast) by having one processor per node handle all the inter node communication. This process would use shared memor y to collect and distribute data to other processes on the same node. Since the different tasks on the same node are different processes, they have different address spaces and the shared memor y MPI librar y will communicate though a shared memor y segment. This mean a double copy of the data (into and out of the shared memor y segment). It would be possible for each task to keep its data in the shared memor y segment and not use MPI for this communication but this would require some degree of reprogramming. The advantage of using the PE shared memor y MPI librar y is that no reprogramming is required.

134

POWER4 Processor Introduction and Tuning Guide

In order to use shared memor y for communication calls within a shared memor y machine one of the following two procedures should be followed: Use the following PE environment variable settings:
export MP_SHARED_MEMORY=yes export MP_WAIT_MODE=poll

MP_WAIT_MODE is not essential in order to use shared memor y, but setting it to poll is recommended for perfor mance in most scenarios where MPI tasks will use shared memor y. Use the following command line arguments either with the parallel program or with the poe command depending on the way the parallel program is star ted:
-shared_memory yes -wait_mode poll

MPI and SMP For tran In this scenario, also known as the hybr id or mixed-mode programming model, there are fewer MPI tasks than processors per node. Shared memor y parallelization techniques such as OpenMP directives can be used to execute sections of the code between MPI calls in parallel. This means that each MPI task has multiple threads executing in parallel, and the aim would be to keep all of the processors busy all of the time. In practice, it will be difficult to achieve this during the MPI communication phases of the program. However, the benefit of this programming model is that it can be used to reduce the amount of communication traffic between nodes, especially during global communications, by reducing the total number of MPI tasks. This could be especially impor tant for large multi-processor systems such as 32-way POWER4 systems clustered together. The overhead of shared memor y parallelization is similar to that of MPI data transfers, so it is desirable to parallelize at a sufficiently coarse granular ity to keep the effect of this overhead small. Some recoding may be required to achieve this hybrid parallelization. MPI and explicit large chunk threads In this scenario, there is only one MPI process per node. The init (or master thread) creates threads which, instead of issuing MPI pthread techniques to transfer data between themselves and the thread. The master thread uses MPI to transfer all data between ial process calls, use master the nodes.

Data does not have to be copied between threads since they all use the same address space. Synchronization can be achieved either with standard pthread calls, or, with even less overhead, by using spin loops and the atomic fetch_and_add function (which guarantees that only one thread at a time can update a var iable).

Chapter 7. Parallel pr ogramming techniques and perfor mance

135

The total number of messages between nodes is reduced and hence delays due to latency are reduced. Since the master thread handles all messages, it should perhaps be coded to do less work than the other threads. However, all of this may imply considerable reprogramming. The program may have used the MPI task ID to create its arrays and organize its data. The threads will have to arrange this differently, because they share the same task ID, and are using the same address space. The advantages and disadvantages of these scenarios are summar ized in Table 7-4.
Table 7-4 Advantages and disadvantages of message passing techniques Programming model MPI only Advantages No program changes. Same coding for calls between all tasks, uses shared memory on sam e node. MPI exchanges reduced. Can reduce off node com munication. MPI exchanges reduced. Exchanges and overhead between threads reduced. Disadvantages Double copy between processes on sam e node.

Hybrid mode

May not be possible to fully use the CPUs. Som e reprogramming required. Considerable reprogram ming may be required.

MPI and large chunk threads

To summar ize, all of the scenar ios can be useful depending on the par ticular application requirements and the target environment. Descending the table, the efficiency of the solution increases, but the amount of reprogramming required also increases. To gain addressability to 8 GB with a 32-bit MPI, the sPPM ASCI benchmark code used the Hybr id mode. More information about this can be obtained from:
http://www.llnl.gov/asci_benchmarks/asci/limited/ppm/sppm_readme.html

136

POWER4 Processor Introduction and Tuning Guide

7.3 Programming with threads
The thread programming paradigm is a flexible, low-level model of distr ibuting the work of a given application into multiple streams of execution that share a single memor y address space. Each thread can execute its own function and can be controlled independently. In the context of high-perfor mance computing, threads are used to distribute a workload onto multiple processors of an SMP system, rather than to dispatch many threads onto a single processor, as is common for graphical user interfaces. There is a standardized application interface for threads called Pthreads (POSIX threads) that is par t of the UNIX specification. The redbook Scientific Applications in RS/6000 SP Environments, SG24-5611, provides a compact introduction to Pthreads for multi-processor applications on the AIX platfor m. The corresponding AIX reference manual General Programming Concepts: Wr iting and Debugging Programs (par t of the AIX Programming Guides) can be found at:
http://www.ibm.com/servers/aix/library/techpubs.html

Programming explicitly with threads is not recommended for the casual user. In many cases the benefits of multiple threads can be more easily obtained by using the automatic parallelization capabilities of the compiler or OpenMP directives.

7.3.1 Basic concepts
Threads can be described as light-weight processes. Each thread has its own private program counter, stack, and registers. The memor y state and file descriptors are shared. For a brief over view of the usage of Pthreads a simple hello world program is shown in Example 7-1 on page 138. Although this program does no complicated work, it provides a useful template for thread creation. A Pthread program begins to execute as a single thread. Additional threads are created and ter minated as necessar y to concurrently schedule work onto the available processors. In this example, the initial thread creates three wor ker threads, which will print hello messages and ter minate. As will be familiar to message passing programmers, it is a good practice for the master thread (or MPI task) to take par t in the computation. This yields good load-balancing when N threads are dispatched on N processors. A threaded application should be compiled and linked with the _r-suffixed invocation of the C compiler, for example xlc_r, which defines the symbol _THR EAD_SAFE and links with the Pthreads librar y.

Chapter 7. Parallel pr ogramming techniques and perfor mance

137

Example 7-1 Pthread version of a hello world program #include #include void * thfunc(void * arg) { int id; id = *((int *) arg); printf("hello from thread %d \n", id); return NULL; } int main(void) { pthread_t thread[4]; pthread_attr_t attr; int arg[4] = {0,1,2,3}; int i; /* setup joinable threads with system scope */ pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM); /* create N-1 worker threads */ for (i=1; i<4; i++) { pthread_create(&thread[i], &attr, thfunc, (void *) &arg[i]); } /* let the master thread also take part in the computation */ thfunc((void *) &arg[0]); /* wait for all other threads to finish */ for (i=1; i<4; i++) { pthread_join(thread[i], NULL); } return 0 ; }

Threads are created using the pthread_create function. This function has four arguments: A thread identifier, which is returned upon successful completion, a pointer to a thread-attributes object, the function that the thread will execute, and the argument of the thread function. The thread function takes a single pointer argument (of type void *) and retur ns a pointer (of type void *). In practice, the

138

POWER4 Processor Introduction and Tuning Guide

argument to the thread function is often a pointer to a str ucture, and the str ucture may contain many data items that are accessible to the thread function. In this example, the argument is a pointer to an integer, and the integer is used to identify the thread. The previous simple example creates a fixed number of threads. In many applications, it is useful to have the program decide how many threads to create at r un time while providing the ability to overr ide the default behavior by setting an environment var iable. For example, for OpenMP programs the default is to create as many threads as processors are available. In AIX, you can get the number of online processors by calling the sysconf routine from libc, as shown in Example 7-2.
Example 7-2 Sample code for setting the number of threads at run time #include #include #include ... char * penv; int ncpus, numthreads; ... /* get the number of online processors */ ncpus = sysconf(_SC_NPROCESSORS_ONLN); if (ncpus < 1) ncpus = 1; /* check the NUMTHREADS environment variable */ penv = getenv("NUMTHREADS"); if (penv == NULL) numthreads = ncpus; else numthreads = atoi(penv); ...

A thread terminates implicitly when the execution of the thread function is completed. A thread can ter minate itself explicitly by calling pthread_exit. It is also possible for one thread to ter minate other threads by calling the pthread_cancel function. The initial thread has a special proper ty. If the initial thread reaches the end of its execution stream and retur ns, the exit routine is invoked, and, at that time, all threads that belong to the process will be terminated. However, the initial thread can create detached threads, and then safely call pthread_exit. In this case, the remaining threads will continue execution of their thread functions and the process will remain active until the last thread exits. In many applications, it is useful for the initial thread to create a group of threads and then wait for them to terminate before continuing or exiting. This is can be achieved with threads that are joinable (see

Chapter 7. Parallel pr ogramming techniques and perfor mance

139

pthread_attr_setdetachstate). The AIX default is detached. The function pthread_join suspends the calling thread until the referenced thread has terminated. The system scope attribute is appropr iate when N threads are supposed to r un on N processors concurrently.

Synchronization
As with OpenMP directive-based parallelization the distinction between threadprivate and shared variables is essential for the correctness and perfor mance of a program. The access to shared variables has to be synchronized to avoid conflicts and to assure correct results. The use of synchronization should be balanced with its degradation of perfor mance and scalability. A major difficulty of parallel programming for shared memor y is to find the right balance of local and global var iables, since the scoping defines which var iables are private or shared. Contention for global variables, as in a reduction sum, is a major source of perfor mance problems. The introduction of temporar y local variables often helps to resolve such problems. In multi-threaded applications the update of shared memor y locations is usually protected with mutex (mutual exclusion) locks. The operating system ensures that access to the shared data is serialized. At a given time only one thread can enter the region between lock and unlock to modify the data. The usage of mutex locks is shown in Example 7-3. This example demonstrates how to construct a basic barrier synchronization function. It is left as an exercise for the reader to study a Pthread programming reference in order to understand this complex constr uct.
Example 7-3 Usage of mutex locks to modify shared data structures #include int barrier_instance = 0; int blocked_threads = 0; pthread_mutex_t sync_lock = PTHREAD_MUTEX_INITIALIZER; pthread_cond_t sync_cond = PTHREAD_COND_INITIALIZER; int syncthreads(int nth) { int instance; /* the calling thread implements a lock, other threads block */ pthread_mutex_lock(&sync_lock); /* the thread with the lock proceeds */ instance = barrier_instance; blocked_threads++;

140

POWER4 Processor Introduction and Tuning Guide

if (blocked_threads == nth) { /* notify all threads that the sync condition is met */ blocked_threads = 0; barrier_instance++; pthread_cond_broadcast(&sync_cond); } while (instance == barrier_instance) { /* release the lock and wait here */ pthread_cond_wait(&sync_cond, &sync_lock); } /* all threads call the unlock function and return */ pthread_mutex_unlock(&sync_lock); return(0); }

Frequent mistakes
The most common mistakes of thread programming shown in this section occur more frequently than we would like: Process exits before all threads have finished The following example of code is not correct because when a process exits or returns from main(), all of the process memor y is deallocated and all threads belonging to the process are terminated.
#include int main(void) { pthread_t tid[NUMBER_OF_THREADS]; ... /* create threads */ for (i =0; i
Parent thread exits before child A problem similar to the previous one; do not forget to call the pthread_join routine before exiting.

Chapter 7. Parallel pr ogramming techniques and perfor mance

141

Dangling pointer The following code fragment is incorrect because errorcode resides on the local stack of the thread and will be freed when the thread is destroyed, leading to a dangling pointer.
void * thfunc(void * arg) { int errorcode; /* do something */ /* if error condition detected, errorcode = something; */ pthread_exit(&errorcode); ... }

Using Pthreads in Fortran
On IBM systems, a For tran version of the Pthreads interface is available in addition to the standard C Pthreads interface. This makes it relatively simple to introduce threads into numerically intensive For tran applications. The reader should recognize that the For tran interface is not backed by an industr y-wide standard. For example, Pthread constr ucts can be used within OpenMP programs in rare instances when some direct control of thread management or data access synchronization is necessar y. In such a mixed mode the OpenMP runtime environment will create and manage all threads used for the execution of OpenMP parallel constructs. Explicit pthread creation is the responsibility of the programmer. The IBM For tran version of the Pthreads API is similar to the C version, where the function names and data types from C are preceded w ith f_. For tran programs that use explicit Pthread routines must have a statement to include the f_pthread module. In general the -qnosave option is essential for correct behavior of a program. A number of For tran routines, including f_pthread_create, have call sequences that differ from the standard C version. For example, the f_pthread_create function takes an additional parameter to specify proper ties of the argument to the thread function. The IBM For tran implementation of Pthreads is descr ibed in the XL For tran for AIX Language Reference, SC09-2867.

142

POWER4 Processor Introduction and Tuning Guide

7.3.2 Coding and performance considerations
The following perfor mance considerations apply to both Pthread hand-coded programs and OpenMP based programs.

Thread creation
The time required for creation of a thread is of the order of magnitude of 100 microseconds. You should only create/awake/ter minate threads that execute for a significantly longer time.

Lock contention
Shared data structures that are modified within an inner most loop of a thread and need to be protected against concurrent access can cause severe perfor mance degradation. The following example shows a loop that counts the number of elements in a vector with a value equal to one. The counter is a shared var iable that has to be protected by a mutex lock when incremented. Assuming that ever y second element is equal to one, this example takes more than 30 times longer to execute on four processors than an equivalent sequential loop, which does not need to call the Pthread lock routines.
int shared_count=0; ... void * thfunc(void *id) { ... for (i=start; i
Without increasing the default values of the SMP r untime var iables SPINLOOPTIME or YIELDLOOPTIME the perfor mance is even slower. For details on AIX environment var iables that deter mine the r untime behavior of a thread when waiting for a lock (spin, yield, sleep), see Section 7.1.1, "SMP runtime behavior" on page 126.

Chapter 7. Parallel pr ogramming techniques and perfor mance

143

In this simple case, the perfor mance problem can be resolved with the help of a local counter variable, w hich turns the tremendous speed down into an expected parallel speedup.
int shared_count=0; ... void * thfunc(void *id) { int private_count=0; ... for (i=start; i
Avoiding locks and OpenMP critical sections
In many multi-threaded programs, a barrier synchronization routine can help to reduce extensive use of locks or OpenMP cr itical sections. For example, suppose that multiple threads are wor king to fill out different entr ies of a table, and, once that is done, each thread needs read access to the table for the next step. A barrier synchronization point would ensure that no thread could proceed to the next step until all threads have finished filling out the table. Instead of wor king directly with the low-level pthread_mutex functions, a higher level thread synchronization function is ver y useful. If you cannot avoid a lock or cr itical section: Reduce the amount of time a lock is held. Move all unnecessar y code outside a cr itical section. Combine access to shared data in order to reduce the number of single lock/unlock calls.

False sharing
False shar ing of a cache line occurs when multiple threads on different processors with private caches modify independent data str uctures that happen to belong to the same cache line. In this situation, the cache line of a par ticular CPU is flushed out due to another processor store operations and has to be transferred repeatedly from remote caches. This is likely to happen when, for example, a thread local variable is stored in a global array indexed by the logical thread number. This causes the data to be located close together in memor y.

144

POWER4 Processor Introduction and Tuning Guide

By using appropr iate padding, false cache-line shar ing can be avoided. Referring to the proceeding example, the following implementation cures the lock contention problem, but suffers from false shar ing. On our machine, this leads to a moderate speed down (of about 1.5) on four processors compared to the sequential program.
int shared_count=0; int private_count[4]={0,0,0,0}; ... void * thfunc(void *id) { ... myid = *((int *)id); for (i=start; i
By introducing appropr iate padding space, to fill up a cache line of 128 byte, false sharing can be eliminated.
struct count{ int private_counter; char pad[124]; }counter[4]; int shared_count=0; ... void * count_ones(void *id) { ... for (i=start; i
For example, a similar false shar ing problem can occur when storing a set of mutex lock objects in a global array. The following For tran code avoids false sharing. The type pthread_mutex_t is declared in /usr/include/sys/types.h. In 64-bit mode it has a different size.

Chapter 7. Parallel pr ogramming techniques and perfor mance

145

use f_pthread integer, parameter :: maxthreads=8 type plock sequence type(f_pthread_mutex_t) :: lock integer :: pad(19) end type plock type(plock) :: locks(maxthreads) common /global/ locks

As a rule of thumb, in an SMP program, global data whose access is not serialized should not be close together.

Reducing OpenMP overhead
In general, an SMP parallel program does generate some computational overhead. Even when executed by just a single thread a cer tain amount of overhead compared to the execution of the equivalent sequential (non threaded) version of a program can be obser ved. For discussion pur poses, call this the sequential overhead. Thread-safe system libraries, such as for I/O, that are referenced by an _r-suffixed compiler invocation may also contr ibute to this overhead. For a fine-grain OpenMP program, the sequential overhead can be significant. If the overhead exceeds, for example, 30 percent of the elapsed time, this can be an indication of inefficient use of OpenMP directives. If global variables need to be scoped threadprivate this often causes problems. The following example is taken from a real application. To suppor t the threadprivate pragma (or directive in For tran) the compiler generates calls to the inter nal function _xlGetThreadValue. These calls are relatively expensive. In general it is a good idea to reduce the number of calls by packing several threadprivate variables into a single str ucture. This way we will encounter only one call per dynamic path through each function where the threadprivate variables are referenced. Otherwise, we will encounter one call per threadprivate variable. As an example, consider the following lines of code:
static static static static int n_nodes, num_visits; Node *node_array; int *val, *stack; Align_info *align_array;

#pragma omp threadprivate( \ n_nodes, num_visits, \ node_array, \ val, stack, \ align_array \ )

146

POWER4 Processor Introduction and Tuning Guide

This code could be substituted as follows to improve perfor mance. More code changes in the subsequent code are necessar y to make this complete.
struct nn{ int n_nodes, num_visits; Node *node_array; int *val, *stack; Align_info *align_array; } glob; #pragma omp threadprivate(glob)

7.3.3 The best approach for shared memory parallelization
As discussed, there are many different ways to parallelize a program for shared memor y architectures. The appropriate approach depends on several considerations, for example: Does a sequential code already exist or will the parallel program be written from scratch? What are the parallel programming skills of the project team? and so on. The following are some pros and cons of the different paradigms: Auto-parallelization by the compiler : Easy to implement (just a few directives) Enables teamwor k easily Limited scalability because data scoping is neglected Compiler dependent (even on the release of a par ticular compiler) Not necessar ily por table OpenMP directives: Por table Potentially better scalability of the auto-parallelization Unifor m memor y access is assumed RYO (subset or mixture of OpenMP and Pthreads, or UNIX for k() and exec() parallelization, or platfor m-specific constructs) Might enable teamwor k Needs a well-tested concept to assure performance and por tability Not necessar ily por table Pthread: Por table Potentially best scalability Needs experienced programmers

Chapter 7. Parallel pr ogramming techniques and perfor mance

147

SMP-enabled librar ies Least effor t Limited flexibility

7.4 Parallel programming with shared caches
On all models of POWER4 microarchitecture machines currently available, each processor has a dedicated L1 cache. As discussed in Section 2.4.3, "L2 cache" on page 17, the L2 cache organization is that each L2 cache unit is shared between two processors in the pSer ies 690 Turbo and pSeries 690 Model 681, but in the pSeries 690 HPC each processor has a dedicated L2 cache. In all models L2 cache remains the level of coherency. The shar ing of the L2 cache raises a number of considerations, such as: Processors that share L2 cache will compete for the bandwidth from the L2 cache to L3 and memor y. However, the effect of this is unlikely to differ ver y much from the effect seen by independent processes shar ing the bandwidth from L2. In the configurations where two processors share L2 cache, and each are accessing different memor y addresses, then for each cache line loaded into L2, there may be conflict. Interference when accessing the same cache lines In the the sa cache cache extreme case, two threads of a shared memor y application may access me cache line. In this case, there may be a benefit to the shared L2 configuration, since there will be a higher percentage of hits in the L2 and there will be fewer cache snooping events.

An example of this would be the follow ing, rather ar tificial loop:
!$OMP PARALLEL DO PRIVATE(s,j,time1,time2), SHARED(a,b,s1,ttime), & !$OMP& SCHEDULE(STATIC,1) do i=1,m do j=1,n b(i,j)=a(i,j)+a(i,j)*c1 end do end do !$OMP END PARALLEL DO

In this example, when parallelized across two threads, each thread w ill access alter nate elements of the array.

148

POWER4 Processor Introduction and Tuning Guide

Measurements were perfor med on a 32-way pSeries 690 Turbo, with the arrays dimensioned to fit in L2 but not in L1, and the results provided in Table 7-5 were obtained.
Table 7-5 Shared memor y cache results, pSeries 690 Turbo Threads 1 2 4 8 16 Unshared cache time [s] 14.76 14.11 7.57 6.78 6.60 Shared cache time [s] 14.40 8.63 8.02 5.69 5.75 Unshared / shared 0.98 0.61 1.06 0.84 0.87

The unshared cache times were obtained by binding the threads of the program to alter nate processors. Thread one was bound to processor one, thread two was bound to processor three, and so on. The shared cache times were obtained by binding the threads to adjacent processors. The times are the average of the time obtained in three separate r uns. Apar t from the four-thread case, it seems that for this example there is a clear benefit in the shared cache. This is most noticeable with two threads. With a larger number of threads, the amount of wor k done by the individual threads is reduced and so the overhead of running in parallel star ts to dominate the time. Fur ther examples where the loop appeared similar to the following were also r un:
!$OMP PARALLEL DO PRIVATE(s,j), SHARED(a,b,s1), & !$OMP& SCHEDULE(STATIC,1) do i=1,m do j=1,n s=s+a(i,j)*b(i,j) end do s1(i)=s end do !$OMP END PARALLEL DO

These did not show any difference in speed between shared and dedicated L2 cache configurations. We also tested two examples w here we compared perfor mance of two processes accessing a shared cache line where the cache line was in a single L2 cache or moved between L2 caches.

Chapter 7. Parallel pr ogramming techniques and perfor mance

149

The test programs we used for ked child processes which then bound themselves to specific processors. Each process acquired a semaphore, updated a counter, and released the semaphore. The code ker nel is as follows:
for (i=0;i
The msem_ routines are par t of the AIX libsys.a librar y. They implement an atomic lock using the lwarx instr uction. In the first example (Table 7-6), the counter and semaphore were in the same cache line.
Table 7-6 Counter and semaphore sharing cache line Case Single process (no sharing) Two processes. L2 cache shared Two processes. L2 caches on same MCM Two processes. L2 caches on separate M CMs Time [s] 3.36 8.95 13.31 13.40

In the second example (Table 7-7), the counter and semaphore were in separate cache lines.
Table 7-7 Counter and semaphore in separate cache line Case Single process (no sharing) Two processes. L2 cache shared Two processes. L2 caches on sam e MCM Two processes. L2 caches on separate MCMs Time [s] 3.34 8.85 12.75 12.82 Ratio to shared counter/semaphore 0.99 0.98 0.96 0.95

As expected, there is a significant perfor mance benefit when two processes share data in the L2 cache. There is also a benefit in separating data structures and the semaphores that control them.

150

POWER4 Processor Introduction and Tuning Guide

The previous code example makes use of a sleeping semaphore. In addition, the amount of wor k done on the cache line is relatively small. We created a second example using spin/wait instead of sleeping semaphores and increased the amount of work on the cache line. The semaphore and the shared data structure were in separate cache lines. The relevant code segments are:
struct shared_data { long long n; int counter; int i; } *p_shared; msemaphore *p_shared_sem; .... for (i=0;icounter)++; p_shared->n=1; /* now do some real work */ for (j=0;j<200;j++) { p_shared->n = p_shared->n * (p_shared->n +1); } msem_unlock(p_shared_sem,0); }

We obser ved the results provided in Table 7-8.
Table 7-8 Heavily used shared cache line perfor mance Case Single process (no sharing) Two processes. L2 cache shared Two processes. L2 caches on same MCM Two processes. L2 caches on different MCMs Four processes on two chips on the sam e MCM Four processes, one on each chip on an MCM Four processes, each on a different M CM Time [s] 38.72 75.91 84.57 84.51 143.27 156.03 156.12

Chapter 7. Parallel pr ogramming techniques and perfor mance

151

The shared L2 cache enables two processes to share the workload ver y efficiently. Two processes run in tw ice the time of one process. When the cache is not shared there is an 18 percent penalty. This is independent of whether or not the processes r un on the same MCM. We also examined the case where four processes ran on two chips, that is, four processors sharing two L2 caches, and compared this with unshared L2 caches. We see a small benefit in shar ing the L2 cache but once the L2 cache is not shared, the impact of on or off the MCM is the same as for two processes. When testing four processes, we obser ved that the r un times of the individual processes varied (results above are averages for two or four processes). We saw that three processes would run in approximately the same time and one would run in approximately 60 percent to 75 percent of the others. We were not able to investigate this effect for this document. We assume it is either an ar tifact of the operating system scheduler or an error in the test program. Note that this effect was not obser ved in the two process test runs.

152

POWER4 Processor Introduction and Tuning Guide

8

Chapter 8 .

Application performance and throughput
This chapter examines system perfor mance achievable from running multiple copies of a program or programs compared to a single copy of a program. On a multiple processor machine (or node), throughput issues include: Processor utilization Oversubscribing processors (for example, 12 concurrent jobs on an 8-processor machine) when r unning processor-bound jobs usually does not increase total processor time significantly as measured by user processor time and system processor time, since the operating system efficiently schedules the jobs to run. Processor-bound jobs refer to programs that are not bottlenecked by any other major system resources. Memor y bandwidth utilization A 32-process bandwidth of to sustain 32 that obtained or pSeries 690 Turbo has a ver y high aggregate memor y approximately 200 GB/s. For many workloads, this is sufficient concurrent processes with a perfor mance per process close to if the processes were to r un standalone.

However, a standalone process is capable of driving the memor y bandwidth at a far greater rate than 1/32nd of the total. As detailed in 8.4.3, "Memor y stress effects on throughput" on page 162, on a 32-way pSer ies 690 Turbo or a 16-way pSeries 690 HPC, a standalone application can use approximately

© Copyr ight IBM Cor p. 2001

153

1/8th of the total bandwidth. This is a significant benefit of pSeries 690 design in cases where such memor y-stressing applications can be r un in a mixed workload together with low memor y stress applications. However, if 32 copies of a job that does use maximum bandwidth are r un concurrently on a pSer ies 690 Turbo, each job will necessar ily take at least 4 times as long as a standalone job. For 16 processes on a 16-way pSeries 690 HPC, it would be at least twice as long. Most applications, when r un standalone, use far less than the maximum bandwidth and there are many techniques, such as blocking, available for reducing the extent of memor y stress. Some of these techniques are described in 3.1.4, "Tuning for the memor y subsystem" on page 34. Never theless, applications that, run standalone, use more than their propor tionate share of the total bandwidth, will necessarily r un more slow ly when ever y processor is loaded with them. For such wor kloads, the pSer ies 690 HPC is likely to be a more appropr iate configuration than a pSer ies 690 Turbo. Shared L2 cache On a pSeries 690 Model 681 or pSeries 690 Turbo, two processors on a single chip share the 1440 KB L2 cache. When two similar jobs are r unning on the same chip they can effectively utilize only half of the L2 cache and the L2 to L3 cache bandwidth and it could be anticipated that there will be some perfor mance degradation. In a pSer ies 690 HPC, there is only one processor that can access the L2 cache, which implies more predictable behavior. I/O channels When a program has high I/O requirements the I/O channels and subsystems often prove to be the perfor mance bottleneck. When multiple copies of high I/O jobs are run, perfor mance can seriously degrade unless attention is given to separating or hiding I/O transfers. See Section 6.3, "Modular I/O (MIO) librar y" on page 120, which describes one useful tool that can be used to hide I/O transfers. The rest of this chapter shows some examples of throughput testing done on POWER4 pSeries 690 Model 681 systems.

154

POWER4 Processor Introduction and Tuning Guide

8.1 Memory to memory copy
Figure 8-1 shows perfor mance for a simple copy (a[i] = b[i]) for a range of array sizes and numbers of copies. Maximum throughput is achieved when the array fits into the L2 cache. The system was configured with two MCMs and four memor y books. Tuning by using methods such as
a[i] = b[i] + zero*a[i];

does not make any significant difference to copy perfor mance, whereas this technique was often beneficial on the POWER3.

Figure 8-1 Memory copy perform ance

Figure 8-2 on page 156 shows corresponding perfor mance when using the C librar y memcpy() function. Perfor mance is less than that achieved in the case above because the load/stores are 8 bytes (For tran REAL*8) whereas the memcpy() function loads and stores bytes to a word boundar y then loads and stores words (4 bytes) and completes the copy with bytes if required.

Chapter 8. A pplication performance and throughput

155

Figure 8-2 C librar y memcpy perfor mance

The memor y subsystem provides a reasonably linear response to increasing processor load as provided in Table 8-1.
Table 8-1 Memor y copy perfor mance relative to one CPU 2 CPUs 16 KB 32 KB 128 KB 1 MB 2 MB 4 MB 8 MB 2.0 2.0 2.0 1.8 1.9 1.9 1.9 4 CPUs 3.9 4.0 4.0 3.2 3.2 3.3 3.2 6 CPUs 6.0 6.0 6.1 4.4 4.5 4.7 4.7 8 CPUs 8.0 8.0 8. 1 5. 3 5. 4 5. 6 5. 6

156

POWER4 Processor Introduction and Tuning Guide

8.2 Memory bandwidth limited throughput
In contrast to the perfor mance descr ibed in the previous section, this section describes a throughput test that deliberately challenges the total system memor y. The program solves for the dot product of two REAL*8 arrays of length N. For this throughput test, N was chosen to be 110000000 to ensure that most of the data would not be resident in L3 cache. A single copy of this program achieved a 2.3 GB/s transfer rate on a 1.3 GHz processor. When eight copies of this job were run on a two MCM, eight processor pSeries 690 HPC, the aggregate data transfer rate was 11.3 GB/s, a speedup of 4.9. The aggregate transfer rates for job counts of 1 through 8 are shown in Figure 8-3.

Figure 8-3 System memor y throughput for pSeries 690 HPC

Chapter 8. A pplication performance and throughput

157

A second throughput test using the same program was r un on a 32-way four MC M,1.3 GHz pSer ies 690 Turbo with 96 GB of real memor y. This system had 512 MB of shared L3 cache so the array sizes were increased to 310 million double-precision elements up through 16 processors to ensure most of the data was not in L3 cache. For 32 copies of the program, the array sizes were reduced to 150 million double-precision elements, to prevent the programs from exceeding the real memor y on the system. The aggregate system throughput rates are shown in Figure 8-4.

Figure 8-4 System memor y throughput on pSeries 690 Turbo

The shared L2 cache and non-shared L2 cache throughput rates for 2 through 16 jobs were obtained using the AIX bindprocessor command and related system calls. For the shared L2 cache r uns, pairs of jobs were bound to processors on the same POWER4 chip. For the non-shared L2 cache r uns, at most one job was bound to any POW ER4 chip. The non-shared L2 cache perfor mance would be near ly identical to the perfor mance of a pSeries 690 HPC system. As expected, when two jobs share the L2 cache, the system throughput decreased. It should be noted that this program relies on hardware prefetch streams. The perfor mance of the prefetch streams are highly dependent on the size of memor y pages. At the time this document was wr itten, only 4 KB pages were available in AIX 5L. Large page sizes, which will be available in early 2002, are expected to significantly increase the single job and multiple job throughput for these examples.

158

POWER4 Processor Introduction and Tuning Guide

8.3 MPI parallel on pSeries 690 and SP
This section descr ibes a hydrodynamics benchmar k application called Hydra from AWE, Aldermaston in the UK and is included as an example to illustrate the comparative perfor mance of pSer ies 690 Model 681 and the RS/6000 SP 375 MH z POWER3 SMP High Node both with respect to absolute perfor mance and parallel scalability. The RS/6000 SP 375 MH z POWER3 SMP High Node, is a shared memor y unit that contains 16 POWER3-II processors r unning at a frequency of 375 MHz. Hydra is written in For tran with MPI message passing and scales well to at least 512 processors for large problems. It does not use OpenMP or similar paradigms. The results for two test cases, a medium one called 2mm and a large one called 1mm are shown in Table 8-2. All MPI communication is shared memor y with the single exception of the 32-way RS/6000 SP 375 MHz POWER3 SMP High Node case where user space MPI (EUILIB=us) was used over the IBM Switch2 connecting two SP nodes. For this application, conclusions that can be drawn include: Up to 32-way parallel, a 1.3 GHz pSeries 690 Model 681 system is between 2.1 and 3.1 times faster than the same number of R S/6000 SP 375 MHz POWER3 SMP High Node processors. Scalability character istics of a single pSer ies 690 Model 681 system are similar to that of an RS/6000 SP 375 MHz POWER3 SMP High Node.
Table 8-2 MPI perfor mance results for AWE Hydra code 2mm test case Parallelism Elapsed seconds Parallel speedup Ratio over NH2 1mm test case Elapsed seconds Parallel speedup over 4-way run Ratio over NH2

16-processor RS/6000 SP 375 MHz POWER3 SMP High Nodes Serial 2-way 4-way 8-way 16-way 32-way 8776.4 4598.9 2331.1 1286.2 754.6 388.5 1 1.91 3.76 6.82 11.63 22.59 1 1 1 1 1 1 not measured not measured 28645 14065 8033 3913 N/A N/A 1 2.04 3.57 7.32 1 1 1 1 1 1

Chapter 8. A pplication performance and throughput

159

2mm test case Parallelism Elapsed seconds Parallel speedup Ratio over NH2

1mm test case Elapsed seconds Parallel speedup over 4-way run 13.77 26.02 Ratio over NH2

64-way 128-way

229.0 141.2

38.32 62.16

1 1

2080 1101

1 1

8-processor pSer ies 690 HPC, results normalized to 1.3 GHz Serial 2-way 4-way 8-way 3297.5 1701.6 926.9 537.9 1 1.93 3.56 6.13 2.66 2.70 2.51 2.39 not measured not measured 11349 6307 not m easured not m easured 1 1.80 not m easured not m easured 2.52 2.23

32-processor pSeries 690 M odel 681, 1.3 GHz. 4-way 8-way 16-way 32-way 752.3 468.9 272.1 160.5 3.56* 5.71* 9.84* 16.68* 3.10 2.74 2.77 2.42 not measured not measured not measured 1821 not m easured not m easured not m easured N/A not m easured not m easured not m easured 2.15

* Assuming 4-way speedup is same as pSeries 690 HPC, that is, 3.56.

8.4 Multiple job throughput
This section discusses the extent to w hich the total execution times for different types of jobs increases when multiple jobs are r un concurrently. Examples are given of two jobs that only lightly stress the I/O and memor y subsystems and hence give excellent throughput scaling. Then results from an ar tificial job are shown adjusted to provided var ying degrees of memor y subsystem stress. We have not been able to investigate throughput effects for I/O intensive jobs. This is a ver y impor tant subject area, but only a ver y limited I/O configuration was available to us on the pSer ies 690 Model 681 systems we tested. Results from any I/O intensive applications would, therefore, have been unrealistic.

160

POWER4 Processor Introduction and Tuning Guide

8.4.1 ESSL DGEMM throughput performance
Multiple copies of DGEMM from ESSL (see Section 6.1, "The ESSL and Parallel ESSL librar ies" on page 114) were r un together on a 32-way pSeries 690 Turbo. Each job multiplied matrices of 5000x5000 REAL*8 numbers, which require 600 MB of memor y for the three arrays involved. However, because ESSL blocks the code to achieve good memor y locality, and because matrix multiply involves a high ratio of computation to memor y access, almost no slowdown was seen when multiple copies were run. Table 8-3 lists the perfor mance of the jobs in GFLOPS. Any slowdown due to running multiple copies would be evidenced by decreasing values for GFLOPS as the number of jobs increases. However, the multiple job slowdown is ver y small in all cases, being only 11 percent when 32 concurrent jobs were r un on a 32-way pSeries 690 Turbo.
Table 8-3 Effects of running multiple copies of DGEMM Number of jobs GFLOPS for 5000x5000 REAL*8 matrices Slowdown ratio to single job

1.3 GHz 32-way pSeries 690 Turbo 1 8 16 24 32 3.417 3.338 3.253 3.167 3.079 1 1.02 1.05 1.08 1.11

As can be calculated from Table 8-3, 32 copies of the same program achieve a total perfor mance rate of 98.5 GFLOPS.

8.4.2 Multiple ABAQUS/Explicit job streams
ABAQUS/Explicit is a commercially available str uctural analysis code from HKS Inc. of Pawtuckett, R hode Island. It uses an explicit (rather than implicit) solution technique and, therefore, does not perfor m heavy I/O or memor y access operations. The jobs run were HKS's seven standard timing tests, t1-exp through t7-exp and the time repor ted is the total elapsed time to r un all seven. In addition to running a single stream of jobs, four and then eight sets of the seven timing jobs were r un concurrently on an eight-processor 1.3 GHz pSeries 690 HPC (different from the 1.1 GHz machine used for most of the other measurements in this publication). The times for the three runs are shown in Table 8-4 and are the total elapsed seconds to complete all jobs.

Chapter 8. A pplication performance and throughput

161

These results show excellent throughput scaling from the pSer ies 690 HPC for this application. The 8-stream run, using all processors, takes only 2 percent longer than a single-stream run.
Table 8-4 Multiple ABAQUS/Explicit job stream tim es Number of job stream 1 4 8 Elapsed seconds 2439 2459 2488 Slowdow n ratio to single stream 1 1.01 1.02

8.4.3 Memory stress effects on throughput
Compared with the previous sections that showed jobs with excellent total throughput, this section describes a worst case example of a job that is designed to stress memor y as much as possible. Most production applications will stress memor y significantly less than this. As will be explained, this study demonstrates the benefits of the pSer ies 690 HPC models for high-memor y stress applications. A simple program was used consisting of repeated calls to a subroutine that executed the statement A(I)=B(I)+C(I)*D(I) in a loop. This code stresses memor y in much the same way as the dot-product test repor ted in Section 8.2, "Memor y bandwidth limited throughput" on page 157. The results presented here are consistent with those in that section but are presented in a way that focuses on the total throughput obtained by r unning multiple copies of the job. Figure 8-5, Figure 8-6, and Figure 8-7 show the interactions between a number of jobs plotted as a function of the total amount of memor y accessed by each program. Results are shown for a 16-way RS/6000 SP 375 MHz POWER3 SMP High Node, an 8-way pSeries 690 HPC and a 32 processor pSeries 690 Model 681. First, these figures are discussed individually and then some overall conclusions are drawn.

162

POWER4 Processor Introduction and Tuning Guide

Figure 8-5 Job throughput effects on a 375 MHz POWER3 SMP High Node

Figure 8-6 Job throughput effects on an eight-way pSeries 690 HPC

Chapter 8. A pplication performance and throughput

163

Figure 8-7 Job throughput effects on a 32-way pSeries 690 Turbo

16-way RS/6000 SP 375 MHz POWER3 SMP High Node Each processor has a local 8 MB L2 cache. The times for multiple jobs star t to exceed the single job time when the memor y accessed by each job approaches this value. pSeries 690 HPC and pSeries 690 Model 681 On the pSer ies 690 HPC, each processor has its own local L2 cache w hereas on the pSer ies 690 Turbo, even/odd pairs of processors share a local L2 cache. To illustrate the effect of this, on the pSeries 690 Turbo, the 8 and 16 job runs were done in two ways. The shared L2 cache runs were done with the jobs bound sequentially to processors. The non-shared L2 cache runs were done with the jobs bound only to even processors so that no two jobs were ever shar ing the cache. The non-shared runs are expected to behave in the same way as a pSeries 690 HPC and it can be seen that the 8 jobs, non-shared L2 graph in Figure 8-7 (pSer ies 690 Turbo) is ver y similar to the 8 jobs graph in Figure 8-6 (pSeries 690 HPC).

Conclusions from the graphs
The following are the conclusions that may be developed from the graphs: The graphs for pSer ies 690 Model 681 are more complicated than for RS/6000 SP 375 MHz POWER3 SMP High Node because of the presence of the Level 3 cache.

164

POWER4 Processor Introduction and Tuning Guide

As a consequence of the pSeries 690 design, in which each processor has a powerful data prefetch engine plus full access the L3 cache and memor y bandwidth on its MCM, it takes relatively few jobs as stressful at this to consume the system's resources. This design feature provides the maximum oppor tunity for a mixture of jobs with arbitrar y resource demands to achieve the best possible system throughput. Once a system resource such as L3 cache or memor y bandwidth is fully consumed, however, adding more jobs to the system will not improve overall system throughput. This effect is seen on the pSer ies 690 Model 681 around the point where each program accesses around 100 MB of data. This happens because of the shared L3 cache (256 MB on the two-MCM pSer ies 690 HPC and 512 MB on the four-MCM pSeries 690 Turbo). Multiple jobs can be slowed down by spilling out of L3 cache, necessitating additional accesses to main memor y subsystem. A similar shared-L2 cache effect can be seen on the three shared L2 lines on the pSer ies 690 Turbo (Figure 8-7) around the point where the jobs access around 750 KB of memor y and spill out of L2. The pSeries 690 HPC-like lines do not show any such effect. Throughput in the worst case region where the programs are working wholly outside any cache shows job times of approximately four times single job times for 32 jobs on a 32- way pSeries 690 Turbo, approximately two times for 16 jobs on a 16-way pSer ies 690 HPC, and approximately 1.5 times for 8 jobs on an 8-way pSer ies 690 HPC. In general, throughput is expected to improve for this example with a future update to AIX 5L in which pages are allocated from memor y attached to the MC M where the process is r unning, thus minimizing MCM-to-MCM traffic. (See 3.2.2, "Memor y configurations" on page 53). The benefit of the pSeries 690 HPC design over pSeries 690 Turbo for a memor y stressing job mix is clear.

8.4.4 Shared L2 cache and logical partitioning (LPAR)
FIRE is a commercially available computational fluid dynamics (CFD) analysis code from AVL List GmbH, Austria (http://www.avl.com). There are optimized versions of FIRE available for scalar, vector, and parallel (shared and distributed memor y) architectures. For the following study, the SMP Version V7.3, compiled with XLF 6.1 for the POWER3 platform, was used. AVL provides several standard test cases for benchmar k pur poses. In the following, the test cases water (water cooling jacket; 284,000 cells) and ext3d (exter nal flow; 711,000 cells) are investigated. A sequential job needs about 300 MB (water) or 620 MB (ext3d) of memor y, respectively. FIRE is a memor y bandwidth demanding application. The time for I/O is negligible.

Chapter 8. A pplication performance and throughput

165

The following machines were used, w hich not only differ by clock frequency but also by memor y speed and micro code level: hpc turbo lpar 8-CPU pSeries 690 Model 681 HPC at 1.1 GHz (memor y at 328 MH z) 32-C PU pSer ies 690 Model 681 Turbo at 1.3 GHz (memor y at 400 MHz) 8-CPU pSeries 690 Model 681 HPC at 1.0 GHz (memor y at 400 MHz, a development system, two logical par titions of four processors each)

Performance impact of shared versus non-shared L2 cache
The perfor mance impact of a shared L2 cache can be studied when binding two threads of a two-CPU parallel job to two processors that either belong to the same POWER4 chip or to different chips. It tur ns out that the shared cache has ver y little impact on perfor mance with respect to FIRE. Table 8-5 contains the job execution times on a pSeries 690 Model 681 Turbo.
Table 8-5 FIRE benchmark: Impact of shared versus non-shared L2 cache Elapsed time [s] Sequential (pSeries 690 Turbo) Two-CPU -- shared (pSeries 690 Turbo) Two-CPU -- not shared (pSeries 690 Turbo) water 228.8 127.1 125.9 ext3d 525.6 303.7 297.8

Impact of par titioning on single job performance
Logical par titioning is expected to introduce a little overhead on memor y access. However, it is possible to distribute a throughput wor kload across several LPAR s in order to isolate single jobs or groups of jobs. W hether throughput perfor mance can benefit from par titioning depends on how the physical resources are mapped onto the different LPARs. For the following benchmar k a pSeries 690 Model 681 HPC system is divided into two LPARs. Each LPAR consists of one MC M (four CPUs). Only the MCM's local memor y was assigned to its LPAR (this configuration is not suppor ted through standard hardware management console function, a system reset after an LPAR reconfiguration was required to achieve this through trial and error). Results for a single sequential job running in a LPAR are presented in Table 8-6. Timings nor malized to 1.3 GHz are given in parentheses. Note, par titioning does

166

POWER4 Processor Introduction and Tuning Guide

not degrade single job perfor mance. A slight benefit was even obser ved, which might be close to the bounds of the exper imental error. The benefit is likely due to the chosen memor y affinity. The ratio between clock frequency and memor y frequency also bias the measurement.
Table 8-6 FIRE benchmark: Uniprocessor, single job versus par titioning Elapsed time [s] no LPAR pSeries 690 HPC LPAR 1 (64-bit ker nel) LPAR 2 (32-bit ker nel) water 260.1 (220.1) 274.9 (211.5) 283.0 (217.7) ext3d 627.0 (530.5) 648.8 (499.1) 659.2 (507.1)

Impact of par titioning on throughput performance
Running eight sequential FIRE jobs on an eight-way machine is the throughput scenario that puts the most stress on the memor y subsystem. For the par ticular setup of this benchmar k, it is obser ved that par titioning can reduce the interference between different processes of a throughput wor kload and therefore improve the throughput performance. The results are presented in Table 8-7. Timings nor malized to 1.3 GHz are given in parentheses.
Table 8-7 FIRE benchmark: Throughput perfor mance versus par titioning Elapsed time [s] Single job, no LPAR (pSeries 690 Turbo) Single job, no LPAR (pSeries 690 HPC) Eight jobs, no LPAR (pSeries 690 HPC) Four jobs using LPAR 2 LPAR 1 idle Four jobs using LPAR 1 water 228.8 260.1 (220.1) 452.8 (383.1) 415.9 (319.9) 409.5 (315.0) 421.8 (324.5) ext3d 525.6 627.0 (530.5) 1004.3 (849.8) 905.6 (696.6) 919.7 (707.5) 926.4 (712.6)

Four jobs using LPAR 2

Chapter 8. A pplication performance and throughput

167

8.5 Genetic sequencing program
A genetic sequencing program was r un on a number of systems including POWER4 to deter mine relative perfor mance. The program is wr itten in C and comprises a mixture of floating-point arithmetic, character manipulation and file I/O. Table 8-8 lists perfor mance results on the different systems. Use of POWER4 specific optimization provides a noticeable benefit compared to -qarch=com.
Table 8-8 Perfor mance on different systems System and compiler flags POWER3 -O3 (375 MHz) S80 -O3 (450 MHz) POWER4 -O3 -qarch=com (1.3 GHz HPC) POWER4 -O3 -qarch=pwr4 -qtune=pwr4 (1.3GHz HPC) Elapsed Time 22m 21s 45m 22s 11m 42s 10m 42s

8.6 FASTA genetic sequencing program
The FASTA program suite provides a number of utilities for local sequence alignment of DNA or protein sequences against corresponding sequence databases. The FASTA utility uses a fast, heur istic algorithm. The search utility uses a Smith-Water man algor ithm. Comparison tests for two well-known sequences, ar p_arath (536AA) and metr_salty (276AA), were run against the Swiss-Prot Release 39 database using both algorithms. Note that these utilities do significant amounts of I/O. The sequence database is approximately 250 MB. Table 8-9 provides a single-processor perfor mance comparison against POWER3.
Table 8-9 Relative perform ance of FASTA utilities Sequence arp_arath (fasta) arp_arath(ssearch) metr_salty(fasta) metr_salty(ssearch) POWER 3 26.47 453.72 20.84 230.45 POWER4 15.54 300.30 12.09 153.03 Speedup 1.7 1.5 1.7 1.5

168

POWER4 Processor Introduction and Tuning Guide

8.7 BLAST genetic sequencing program
BLAST (Basic Local Alignment Search Tool) is a suite of applications for searching DNA sequence databases. The BLAST algorithm makes pairwise comparisons of sequences, seeking regions of local similar ity rather than optimal global alignment. BLAST 2.2.1 can perfor m gapped or ungapped alignments. blastn tblastn D NA sequence quer ies can be perfor med against DNA sequence databases. Protein sequence quer y perfor med against a DNA sequence database dynamically translated in all six reading frames.

As with the FASTA tests, the BLAST programs perfor m var ying and typically significant amounts of I/O. Relative perfor mance on POWER3 and POWER4 for blastn and tblastn are provided in Table 8-10 and Table 8-11, respectively:
Table 8-10 Blastn results Query nt.2655203 nt.3283410 nt,5764416 POWER3 180 70 10 POWER4 81 31 4 Ratio 2.2 2. 3 2.5

Table 8-11 Tblastn results Query nt.1177466 nt.129295 nt,231729 POWER3 170 63 95 POWER4 68 23 35 Ratio 2.5 2.7 2.7

Chapter 8. A pplication performance and throughput

169

170

POWER4 Processor Introduction and Tuning Guide

Related publications
The publications listed in this section are considered par ticular ly suitable for a more detailed discussion of the topics covered in this redbook.

IBM Redbooks
For infor mation on order ing these publications, see "How to get IBM Redbooks" on page 173. RS/6000 Scientific and Technical Computing: POWER3 Introduction and Tuning Guide, SG24-5155 Scientific Applications in RS/6000 SP Environments, SG24-5611 AIX 5L Performance Tools H andbook, SG24-6039

Other resources
These publications are also relevant as fur ther infor mation sources: IBM RISC System/6000 Technology, SA23-2619 XL For tran for AIX User's Guide, SC09-2866 XL For tran for AIX Language Reference, SC09-2867 Optimization and Tuning Guide for For tran, C, and C++, SC09-1705 Accelerating AIX by Rudy Chukran, Addison-Wesley, 1998 AIX Perfor mance Tuning by Frank Waters, Prentice -H all, 1996 You can access all of the AIX documentation through the Inter net at the following U RL: http://www.ibm.com/servers/aix/library The following types of documentation are located on the documentation CD that ships with the AIX operating system: User guides System management guides Application programmer guides All commands reference volumes Files reference Technical reference volumes used by application programmers

© Copyr ight IBM Cor p. 2001

171

Referenced Web sites
These Web sites are also relevant as fur ther infor mation sources: AIX and RS/6000 SP manuals
http://www.ibm.com/servers/aix/library/techpubs.html

MIO librar y
http://www.research.ibm.com/actc/Opt_Lib/mio/mio_doc.htm

Watson Sparse Matrix Package (WSMP)
http://www.cs.umn.edu/~agupta/wsmp.html

AIX Bonus Pack
http://www.ibm.com/servers/aix/products/bonuspack

CFD (computational fluid dynamics) application FIRE
http://www.avl.com

SPPM
http://www.llnl.gov/asci_benchmarks/asci/limited/ppm/sppm_readme.html

BLAS
http://www.netlib.org/blas

LAPACK
http://www.netlib.org/lapack

ATLAS
http://math-atlas.sourceforge.net

MASS
http://www.rs6000.ibm.com/resource/technology/MASS

ESSL
http://www-1.ibm.com/servers/eserver/pseries/software/sp/essl.html

HKS Abaqus
http://www.abaqus.com

SPEC
http://www.specbench.org

TPC
http://www.tpc.org

STREAM
http://www.cs.virginia.edu/stream

172

P OWE R4 P r oc esso r Intr odu ction an d Tuning Guid e

NAS
http://www.nas.nasa.gov//NAS/NPB

How to get IBM Redbooks
Search for additional Redbooks or Redpieces, view, download, or order hardcopy from the Redbooks Web site:
ibm.com/redbooks

Also download additional materials (code samples or diskette/CD-ROM images) from this Redbooks site. Redpieces are Redbooks in progress; not all Redbooks become Redpieces and sometimes just a few chapters will be published this way. The intent is to get the infor mation out much quicker than the for mal publishing process allows.

IBM Redbooks collections
Redbooks are also available on CD-ROMs. Click the CD-ROMs button on the Redbooks Web site for infor mation about all the CD-ROMs offered, as well as updates and for mats.

R elated publications

173

174

P OWE R4 P r oc esso r Intr odu ction an d Tuning Guid e

Special notices
References in this publication to IBM products, programs or ser vices do not imply that IBM intends to make these available in all countries in which IBM operates. Any reference to an IBM product, program, or ser vice is not intended to state or imply that only IBM's product, program, or ser vice may be used. Any functionally equivalent program that does not infr inge any of IBM's intellectual proper ty rights may be used instead of the IBM product, program or ser vice. Infor mation in this book was developed in conjunction with use of the equipment specified, and is limited in application to those specific hardware and software products and levels. IBM may have patents or pending patent applications cover ing subject matter in this document. The fur nishing of this document does not give you any license to these patents. You can send license inquiries, in wr iting, to the IBM Director of Licensing, IBM C or poration, Nor th Castle Drive, Ar monk, NY 10504-1785. Licensees of this program who wish to have infor mation about it for the pur pose of enabling: (i) the exchange of infor mation between independently created programs and other programs (including this one) and (ii) the mutual use of the infor mation which has been exchanged, should contact IBM Cor poration, Dept. 600A, Mail Drop 1329, Somers, NY 10589 U SA. Such infor mation may be available, subject to appropr iate ter ms and conditions, including in some cases, payment of a fee. The infor mation contained in this document has not been submitted to any for mal IBM test and is distributed AS IS. The infor mation about non-IBM ("vendor") products in this manual has been supplied by the vendor and IBM assumes no responsibility for its accuracy or completeness. The use of this infor mation or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own r isk. Any pointers in this publication to exter nal Web sites are provided for convenience only and do not in any manner ser ve as an endorsement of these Web sites.

© Copyr ight IBM Cor p. 2001

175

Any perfor mance data contained in this document was deter mined in a controlled environment, and therefore, the results that may be obtained in other operating environments may var y significantly. Users of this document should verify the applicable data for their specific environment. This document contains examples of data and repor ts used operations. To illustrate them as completely as possible, the the names of individuals, companies, brands, and products. are fictitious and any similar ity to the names and addresses business enter pr ise is entirely coincidental. in daily business examples contain All of these names used by an actual

Reference to PTF numbers that have not been released through the nor mal distribution process does not imply general availability. The pur pose of including these reference numbers is to aler t IBM customers to specific infor mation relative to the implementation of the PTF when it becomes available to each customer according to the nor mal IBM PTF distr ibution process. The following terms are trademarks of other companies: C-bus is a trademar k of Corollar y, Inc. in the U nited States and/or other countries. Java and all Java-based trademar ks and logos are trademar ks or registered trademar ks of Sun Microsystems, Inc. in the United States and/or other countries. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Cor poration in the United States and/or other countr ies. PC Direct is a trademar k of Ziff Communications Company in the United States and/or other countries and is used by IBM Cor poration under license. ActionMedia, LANDesk, MMX, Pentium and ProShare are trademar ks of Intel Cor poration in the United States and/or other countries. UNIX is a registered trademar k in the United States and other countries licensed exclusively through The Open Group. SET, SET Secure Electronic Transaction, and the SET Logo are trademar ks owned by SET Secure Electronic Transaction LLC. Other company, product, and ser vice names may be trademar ks or ser vice marks of others

176

POWER4 Processor Introduction and Tuning Guide

Abbreviations and acronyms
GA MGPRM ASS

ABI AFPA AIX ANSI APAR API ASCI ASCII

Application Binar y Interface Adaptive Fast Path Architecture Advanced Interactive Executive American National Standards Institute Authorized Program Analysis Repor t Application Programming Interface Accelerated Strategic Computing Initiative American National Standards Code for Inform ation Interchange Address Space Register Attached Unit Interface Branch on Count Built-In Self-Test Basic Linear Algebra Subprograms Base Operating System Computer-Aided Design Computer-Aided Engineering Computer-Aided Manufacturing Computer-Graphics Aided Three-Dim ensional Interactive Application Common Data Link Interface CD Recordable Customer Engineer Central Electronics Complex

CFD CGE CHRP CIU CLIO/S CMOS CPU CWS D-Cache DA D DA S DA SD DFL DIM M DIP DMA DOE DOI DPCL DRAM DSA DSE DSU DTE DW EA EC

Computational Fluid Dynam ics Common Graphics Environment Common Hardware Reference Platform Core Interface Unit Client Input/Output Sockets Complim entary Metal-Oxide Semiconductor Central Processing Unit Control Workstation Data Cache Duplicate Address Detection Dual Attach Station Direct Access Storage Device Divide Float Dual In-Line Memor y Module Direct Inser tion Probe Direct Memor y Access Depar tment of Energy Domain of Inter pretation Dynam ic Probe Class Librar y Dynam ic Random Access Memor y Dynam ic Segment Allocation Diagnostic System Exerciser Data Ser vice Unit Data Terminating Equipment Data Warehouse Effective Address Engineering Change

ASR AUI BCT BIST BLAS BOS CAD CAE CAM CATIA

CDLI CD-R CE CEC

© Copyr ight IBM Cor p. 2001

177

ECC EEPROM

Error Checking and Correcting Electrically Erasable Program mable Read Only Memory Extensible Fir mware Interface Extended Hardware Drivers Electronic Industries Association Extended Industr y Standard Architecture Executable and Linking For mat Environmental and Power War ning Effective-to-Real Address Table Event Response resource manager Effective Segm ent ID Engineering and Scientific Subroutine Librar y Extract, Transformation, Movement and Loading Feature Code Fast and Wide Fabric Bus Controller Feedback Directed Program Restructuring First In/First Out First Level Interrupt Handler Floating-point Multiply/A dd operation Floating-Point Register Floating-Point Unit Fast Response Cache Architecture Field Replaceable Unit

GAM ESS

General Atomic and M olecular Electronic Structure System Global Completion Table Billion FLoating-point Operations Per Second General Parallel File System General-Pur pose Register High Availability Control Workstation High Perfor mance Switch Low-Cost Eight-Por t High Performance Switch High Perfor mance For tran High Perfor mance Supercomputer System s Development Laboratory Her tz Instruction Cache Input/Output Inter Integrated-Circuit Communications Intel Architecture Instruction Address Register International Business M achines Identification Integrated Device Electronics Intelligent Decision Server Institute of Electrical and Electronics Engineers Internet Engineering Task Force Instruction Fetch Address Register Independent Hardware Vendor Input Method

GCT GFLOPS GPFS GPR HACWS HiPS HiPS LC-8 HPF HPSSDL

EFI EHD EIA EISA ELF EPOW ERAT ERRM ESID ESSL ETML F/C F/W FBC FDPR FIFO FLIH FMA FPR FPU FRCA FRU

Hz I-CACHE I/O I2C IA IAR IBM ID IDE IDS IEEE IETF IFAR IHV IM

178

POWER4 Processor Introduction and Tuning Guide

INRIA

Institut National de Recherche en Infor matique et en Automatique Initial Program Load Interrupt Request Industry Standard Architecture, Instruction Set Architecture Inter national Organization for Standardization Independent Software Vendor Inter national Technical Suppor t Organization Level 1 Level 2 Level 3 Low-Level Application Program ming Interface Light Em itting Diode Load Float Double Load ID Lawrence Liverm ore National Laborator y Load M iss Queue Linear Programming Long-Pointer 64 Licensed Program Product Load R eorder Queue Least Recently Used Mathematical Acceleration Subsystem Multiple Access Unit Megabits Per Second Megabytes Per Second Mechanical Com puter-Aided Design Multi-chip M odule Non-Cacheable Unit

MD I MES MFLOPS MII MIP MLD MLR1 MO DS MP MPI MPP MPS MST NTF NUM A NUS NVRAM NWP OACK OCS ODB C OEM OLAP OLTP OSL OSLP P2SC

M edia Dependent Interface M iscellaneous Equipment Specification M illion FLoating-point Operations Per Second M edia Independent Interface M ixed-Integer Programm ing M erged Logic DRAM M ulti-Channel Linear Recording 1 Memor y Overlay Detection Subsystem M ultiprocessor M essage Passing Interface M assively Parallel Processing M athem atical Programming System M achine State No Trouble Found Non-U niform Memory Access Numerical Aerodynamic Simulation Non-Volatile Random Access Memor y Numerical Weather Prediction Option Acknowledgment Online Customer Suppor t Open DataBase Connectivity Original Equipment M anufacturer Online Analytical Processing Online Transaction Processing Optimization Subroutine Librar y Parallel Optim ization Subroutine Library POWER2 Single/Super Chip

IPL IRQ ISA

ISO ISV ITSO L1 L2 L3 LAPI LED LFD LID LLNL LMQ LP LP64 LPP LRQ LRU MASS MAU Mbps MBps MCAD MCM NCU

A bbreviations and acronyms

179

PBLAS PCI PDT PDU PE PEDB PHB PHY PID PII PIOFS PMU POE POSIX POST POWER

Parallel Basic Linear Algebra Subprograms Peripheral Component Interconnect Paging Device Table Power Distribution Unit Parallel Environm ent Parallel Environm ent Debugging Processor Host Bridge Physical Layer Process ID Program Integrated Infor mation Parallel Input Output File System Perfor mance M onitoring Unit Parallel Operating Environment Por table Operating Interface for Computing Environm ents Power-On Self-test Perfor mance O ptimization with Enhanced Risc (Architecture) PowerPC Piecewise Parabolic Method Por table Streams Environment Parallel System Suppor t Program Program Temporar y Fix Perfor mance Toolbox Parallel Extensions Permanent Vir tual Circuit Quadratic Programming Random Access Memory Remote Asynchronous Node

RAS RFC RIO RIPL RISC ROLTP RPA RPL RPM RSC RSCT RSE RSVP RTC SAR SAS ScaLAPACK SCB SCO SDQ SDRAM SEPBU SGI SHLAP SID SIT SLB

Reliability, Availability, and Ser viceability Request for Comments Remote I/O Remote Initial Program Load Reduced Instruction-Set Computer Relative Online Transaction Processing RS/6000 Platform Architecture Remote Program Loader Red Hat Package Manager RISC Single Chip Reliable Scalable Cluster Technology Register Stack Engine Resource Reser vation Protocol Real-Time Clock Solutions Assurance Review Single Attach Station Scalable Linear Algebra Package Segment Control Block Santa Cruz Operations Store Data Queue Synchronous Dynamic Random Access M em or y Scalable Electrical Power Base Unit Silicon Graphics Incorporated Shared Librar y Assistant Process Segment ID Simple Internet Transition Segment Look-aside Buffer, Ser ver Load Balancing

PPC PPM PSE PSSP PTF PTPE PVC QP RAM RAN

180

POWER4 Processor Introduction and Tuning Guide

SLIH SLR1 SM SMB SMI SMP SOI SP SP SP SPCN SPEC SPM SPR SPS SPS-8 SRQ SRN SSA SSC SSQ SSL STE STFDU STP STQ SUID SUP SVC SVC

Second Level Interrupt Handler Single-Channel Linear Recording 1 Session M anagement Server Message Block System Memor y Interface Sym metric Multiprocessor Silicon-on-Insulator Service Processor IBM RS/6000 Scalable POWER Parallel Systems Service Processor System Power Control Network System Performance Evaluation Cooperative System Performance Measurem ent Special Purpose Register SP Switch Eight-Port SP Switch Store Reorder Q ueue Service Request Number Serial Storage Architecture System Suppor t Controller Store Slice Queue Secure Socket Layer Segm ent Table Entr y Store Float Double with Update Shielded Twisted Pair Store Queue Set User ID Software Update Protocol Switch Vir tual Circuit Super visor or System Call

SYNC TCE Tcl TCQ TGT TLB TOS TPC TPP TTL UDI UIL ULS UP USLA UTF UTM UTP VA VESA VFB VHDCI VLAN VMM VP VPD VPN VSD VT XCOFF XLF

Synchronization Translate Control Entr y Tool Command Language Tagged Command Queuing Ticket Granting Ticket Translation Lookaside Buffer Type O f Ser vice Transaction Processing Council Toward Peak Perfor mance Time To Live Uniform Device Interface User Interface Language Universal Language Suppor t Uniprocessor User-Space Loader Assistant UCS Transformation Format Uniform Transfer M odel Unshielded Twisted Pair Vir tual Address Video Electronics Standards Association Vir tual Fram e Buffer Ver y High Density Cable Interconnect Vir tual Local Area Network Vir tual M em or y Manager Vir tual Processor Vital Product Data Vir tual Private Network Vir tual Shared Disk Visualization Tool Extended Common Object File Format XL For tran

A bbreviations and acronyms

181

182

POWER4 Processor Introduction and Tuning Guide

Index
Symbols
70 Fortran element order 95 subscripts 96 ASCI benchmark 136 assembler 41, 85 documentation 88 instructions 85 standard instructions 80 ASSERT 82 asynchronous I/O 108, 120 ATLAS 114 automatic parallelization 130 AVL FIRE 165 AWE 159

Numerics
32-bit large page support 60 64-bit performance, integer 91

A
ABAQUS/Explicit 161 addi instruction 9 addic instruction 92 address effective 57 real 57 translation 56 virtual 57 affinity memory 53, 60 AGEN cycle 14 AIX 5.1 58 AIXTHREAD_SCOPE 127 ALLO CATABLE Fortran 89 application FIRE 165 sPPM 136 application tuning memory 34 num erically intensive 26 applications large page 60 argument by reference, by value 89 array order in memory 34 arrays C element order 95 dimension 100

B
bandwidth 148 64-bit 92 barrier 144 PThreads 140 binding process, to a processor 149 bindprocessor command 158 BLAS 113, 115 BLAST 169 blocking 38 bosboot comm and 67 branch prediction 12 buffered I/O 109, 120 built-in self test 7

C
C array order in memory 34 arrays element order 95 com piler options 69 directives #pragma disjoint 94 C/C++ virtual functions, performance 90

© Copyr ight IBM Cor p. 2001

183

volatile 90 cache bandwidth 148 blocking 38 considerations 28 false sharing 144 interference 148 L1 28 L2 30 L2 slices 30 latency 32 lines 29 set associativity 29 shared 148 shared, L2 165 structure 27 cache considerations 28 cache miss L2 31 cache misses avoiding 38 CACHE_ZERO 83 CFD FIRE 165 chdev command 66 Cholesky factorization 122 chuser command 61 CNCALL 82 com mands bindprocessor 158 bosboot 67 chdev 66 chuser 61 dum p 60 fdpr 73 filemon 52 gprof 111 iostat 52 ldedit 60 lim it 54 lsps 62 mkuser 61 netpmon 52 netstat 52 nfsstat 52 prof 111 ps 61 svmon 52, 60, 61 svmon command 61

topas 109 tprof 111 ulimit 54 vmstat 52, 61, 108 vmtum e 63 vmtune 52, 61, 63, 67 xprofiler 112 communication protocol 134 compiler directive Fortran ASSERT 82 CACHE_ZERO 83 CNCALL 82 INDEPENDENT 82 LIGHT_SYNC 83 PERMU TATION 82 PREFETCH_BY_LOAD 81 PREFETCH_FOR_LOAD 82 PREFETCH_FOR_STORE 82 UNRO LL 82 compiler directives Fortran 80 loop-related 82 prefetch 81 compiler option C -qalias 75 -qarch 80 -qfold 75 -qinline 75 -qlist 84 -qsm p 74 -qunroll 75 C++ -qsm p 74 Fortran 74 - g 74 -O 70 -O2 70 -O3 70 -O5 70 - p 74 -pg 74 -Q 74 -qalias 72 -qalign 72 -qarch 71 -qassert 72

184

POWER4 Processor Introduction and Tuning Guide

-qcache 71 -qcompact 73 -qfdpr 73 -qhot 72, 76 -qipa 73 -qlibansi 74 -qlibessl 74 -qlist 84 -qnozerosize 74 -qpdf 73 -qsmp 73 -qstrict 73 -qstrict_induction 73 -qtune 71 -qunroll 73 -O 90 -Q 90 -q64 91 -qalign 90 -qarch 168 -qintsize 92 -qipa 90 -qlist 91 com piler options 69 Fortran conflicting 69 POWER4 specific 75 recommended 79 com pilers com paring code generation 79 congruence class 29 CONTAINS, Fortran Fortran 89 copy performance 155 core interface unit (CIU) 6 counters 105 critical sections 133

DGEM M 161 single processor 116 SMP parallel 117 dim ension arrays 100 direct I/O 108 directives C #pragma disjoint 94 distributed memory, MPI 133 dump command 60 dynam ic threads 128

E
effective address 57 effective address (EA) 13 effective-to-real address table (ERAT) 13 eigenvalue 113 environment variables SMP 127 ESSL 114 events 102 executable format 60 execution unit floating point 32 expressions Fortran 94

F
fabric controller 7 false sharing 144 FASTA 168 fdiv instruction 15 fdivs instruction 15 fdpr command 73 fetch_and_add 135 filemon command 52 FIRE computational fluid dynamics 165 First Failure Data Capture 7 floating point operation 32 floating point registers 32 floating point unit 32 FMA 33 fork process 150 format

D
daemon, page replacem ent 65 dangling pointer 142 data sources of 105 data prefetch 31, 35 DAXPY 79 dcbz instruction 83 DDOT 79

Index

185

executable 60 Fortran ALLO CATABLE 89 array order in memory 34 arrays element order 95 automatic parallelization 130 coding tips 89 com piler directive 80 ASSERT 82 CACHE_ZERO 83 CNCALL 82 INDEPENDENT 82 LIGHT_SYNC 83 PERMUTATION 82 PREFETCH_BY_LOAD 81 PREFETCH_FOR_LOAD 82 PREFETCH_FOR_STORE 82 UNROLL 82 com piler options 69 CONTAINS 89 directives prefetch 81 I/O 109 INCLUDE 90 INTENT 89 intrinsic functions vectorized 76 module 89 option precedence 69 WHERE 89 FPU 32 fres instruction 16 frsqrte instruction 16 fsel instruction 16 fsqrt instruction 15 fsqrts instruction 15

group operations M PI 134 groups 102 GX bus controller 6

H
hand tuning 26 hardware prefetch 31 prefetch, hardware 21 High Node, 375 MHz 162 hot spots locating 110 hybrid programming 135

I
I/O asynchronous 108 buffered 109 direct 108 Fortran 109 optimizing 120 paging 108 performance 120 tuning 107 unbuffered 109 I/O library, M IO 120 I/O pacing 65 IBM SP 159 switch 159 INCLUDE Fortran 90 INDEPENDENT 82 induction variable 92 inlining 90, 95 instruction 85 instruction fetch address register (IFAR) 9 instruction set docum entation 88 instructions standard 80 integer performance 91 interference cache 148 interleaving 53 intrinsic functions 98, 114, 117

G
general sparse system of linear equations 122 genetic sequencing 168 global const 90 variables, thread 140 global completion table (GCT) 10 gprof command 111 group com pletion (GC) stage 9

186

POWER4 Processor Introduction and Tuning Guide

Fortran vectorized 76 invariant functions 97 iostat comm and 52 issue queues 10 issue stage (ISS) 11

L
L1 data cache 28, 32 structuring for 38 L2 cache 30, 32, 148 miss 31 store 30, 51 L2 cache slices 30 L3 cache structure 22 LAPACK 113 large page 158 applications 60 data inheritance 60 usage control 61 vm tune 61 large page data 59 large page memory defining 67 large pages 58 pinned 60 latency MPI 136 ldedit comm and 60 library MIO 120 WSMP 122 library, tuned 114 libsys.a sem aphore 150 LIGHT_SYNC 83 light-weight synchronization 83 lim it comm and 54 lmw instruction 10 load instruction data load 30 load miss queue (LMQ) 14 load reorder queue (LRQ) 14 load-balancing thread programming 137 loadquad instruction 80 local variables, thread 140

lock 128 atom ic 150 contention 143 m utex 140 logical partitioning 165 loops locating 87 performance 95 stride 95, 97 unrolling 86 variables 96 low level parallelization 129 LPAR 107, 165 lrubuckets 65 lrud 65 lsps -a command 62 lswi instruction 10

M
malloc 128 MALLOCMU LTIHEAP 128 mapping (MP) stage 11 MASS 114, 117 math.h 89 mathem atical functions 114 matrix WSMP library 122 matrix multiply 44 max_coalesce 66 max_pout 65 maxfree 65 maxperm 63 maxpgahead 65, 66 maxrandwrt 66 MC M partitioning, LPAR 166 mem ory affinity 53 book 53 configuration 53 controller 53 interleaving 53 mem ory affinity 60 mem ory bandwidth 153, 157 mem ory copy 155 mem pools 65 merged logic DRAM 5

Index

187

message passing 133 WSMP library 122 mfxer instruction 10 millicoded instructions 9 min_pout 65 minfree 65 minim ization, stride 34 minperm 63 minpgahead 66 MIO library 120 mixed-mode programming 135 mkuser com mand 61 module Fortran 89 monitoring I/O 120 MP_SHARED_MEMORY 135 MP_WAIT_MO DE 135 MPI 159 parallelization 133 msem_ 150 mtcrf instruction 10 mtxer instruction 10 multi-chip m odule (MCM) 18 multifrontal algorithm 122 multiple jobs 161 multiply-and-add instruction 8 mutex lock 140

critical section 144 false sharing 144 overhead 146 Pthreads 142 threadprivate 146 operation floating point 32 optimization see also performance see also tuning intrinsic functions 98 invariant statem ent 97 reciprocal multiply 99 optimizer 79 outer loop unrolling 41 overhead parallel, OpenM P 146

P
P2SC 2 page replacement daemon 65 page size 54 page table 55, 58 page table entry 55 pages large 58 paging 108 parallel overhead, OpenMP 146 Parallel Environment 134 Parallel ESSL 114, 115 parallelization automatic 130 comparison 147 directive based SMP 131 general 125 high level 129 low level 129 M PI 133 overhead 133 Pthreads 137 shared memory 126 partitioning 165 performance 166 PCI Host Bridge (PHB) 23 PE 134 Peak Megaflops 33 performance

N
netpmon com mand 52 netstat comm and 52 nfsstat command 52 non-cacheable unit (NCU) 6 nroff 80 NUMA 126 num ber of processors, online 139 num clust 66 num erically intensive applications 26

O
-O flags 70 O3 fortran option 70 object code 84 instructions 85 locating loops 87 OpenM P 126

188

POWER4 Processor Introduction and Tuning Guide

see also optimization see also tuning 64-bit 91 array dimension 100 cache 150 coding tips 88 com parative 159 function arguments 89 I/O 120 inlining 90 integer arithm etic 91 intrinsic functions 98 invariant statement 97 lock contention 143 loops 95 unrolling 86 math.h 89 non numeric code 80 reciprocal m ultiply 99 sem aphores 150 shared cache 166 string operations 89 threads 143 total system 161 variation 67 performance m onitor 23, 101 events 102 groups 102 pmcount 101 performance m onitoring unit (PMU) 7 PERMUTATION 82 pinned memory 60 pipeline floating point 33 POWER4 9 pmcount 101 POE 134 pointer dangling 142 pointers 94 POSIX I/O 121 power on reset 7 POWER1 1 POWER2 2 POWER3 4, 159 POWER4 block diagram 8 caches 27

chip 6 introduction 4 m em ory subsystem 20 overview 5 performance characteristics 27 performance monitor 23 PowerPC 601 2 PowerPC 603 3 PowerPC 604 3 PowerPC 604e 3 prefetch 31, 35 PREFETCH_BY_LOAD 81 PREFETCH_FOR_LOAD 82 PREFETCH_FOR_STORE 82 prefetching I/O 120 large pages 58 process scope 127 processor introduction 4 POWER4 details 5 processor, online 139 prof command 111 profiling 110 program counter thread programm ing 137 programming m odel, MPI 135 protein sequencing 168 ps command 61 pthread_create 138 Pthreads detached 139 Fortran 142 joinable 139 OpenMP 142 programming 137 pthread_cancel 139 pthread_exit 139 pthread_mutex_t 145 thread creation 138, 143 thread termination 139

Q
qarch fortran option 70 qcache fortran option 70 qhot fortran option 70

Index

189

qipa fortran option 70 -qlibposix 74 qtune fortran option 70

R
read command queue 20 real address 57 reciprocal approximation 118 reciprocal m ultiply 99 Redbooks Web site 173 Contact us xiv reduction sum 133 register physical 32 rename 32 segment 54 spilling 41 thread programming 137 usage in assembler listing 85 registers 32 renaming 32 runtime variables SPINLOOPTIME, YIELDLOOPTIME 143

S
ScaLAPACK 115 scaling 160 scaling, MPI 159 scheduling, thread 131 scope, process 127 scope, system 127 scope, thread contention 127 segment addressing 54 segment lookaside buffer 57 segment look-aside buffer (SLB) 13 segment table entry (STE) 13 sem aphore 150 sleep versus spin 151 sequencing genetic 168 serialization, threads 140 set associativity 29 shared cache 148 L2 cache 165 memory segment 134 memory, MPI 133 memory, Pthreads 137

parallelization, shared memory 126 shared L2 cache 158 size apparent cache size 148 SLB 57 sleep versus spin 151 slice queue (SSQ) 15 slices, L2 cache 30 small page 158 small pages 58 Smith-Waterm an algorithm 168 SMP runtime variables 143 SP switch 134 sparse matrix WSMP library 122 special purpose register (SPR) 15 speculative-execution 11 spilling 41 spin versus sleep 151 spin wait 126 SPINLOOPTIME 128, 143 sPPM code 136 stack size, OpenM P 129 thread programm ing 137 storage slice queue (SSQ) 15 store queue (STQ) 15 store reorder queue (SRQ) 14 store-in 30 store-through 30 strict_maxperm 64 stride 34, 95, 97 string operation performance 89 superscalar execution 8 svmon command 52, 60, 61 synchronization 135 synchronization, threads 140 sysconf 139 System Memory Interface (SMI) 20 system performance 61 system scope 127 system tuning 54

190

POWER4 Processor Introduction and Tuning Guide

T
thread programming, see Pthreads 137 scheduling 131 shared cache 148 thread contention scope 127 thread safe 129, 131 threads 126 threadsafe 146 throughput 157, 160, 161, 162 I/O 120 partitioning, LPAR 165 TLB 31 large pages and 58 topas com mand 109 tprof command 111 trace file I/O 121 Translation Lookaside Buffer 31 translation look-aside buffer (TLB) 13 translation lookaside buffer (TLB) 31 translation, address 56 triangular matrix solver 122 tuning see also optimization see also perform ance application 25 application memory 34 floating point 40 for cache 38, 51 I/O 65, 107 vm tune 66 inlining 90 page replacement 63 system 54 VMM 63, 65 type conversion 95

global, thread 140 local, thread 140 loop 96 vector intrinsics 76 virtual address (VA) 13, 57 virtual memory 54 VMM tuning 63 vmstat command 52, 61, 108 vmtune command 52, 61, 63, 65, 66, 67

W
Watson Sparse Matrix Package (WSMP) 122 WHERE Fortran 89 write cache queue 20 write com mand queue 20 write-behind 66 WSMP library 122

X
XLSMPOPTS 129 xprofiler 112 xprofiler com mand 112

Y
yield wait 126 YIELD LOO PTIME 128

U
ulimit command 54 unbuffered I/O 109 UNROLL 82 unrolling 41 user space protocol 134

V
variables 93

Index

191

192

POWER4 Processor Introduction and Tuning Guide

The POWER4 Processor Introduction and Tuning Guide

(0.2"spine) 0.17"<- >0.473" 90<- >249 pages

Back cover

®

The POWER4 Processor Introduction and Tuning Guide

Comprehensive explanation of POWER4 performance Includes code examples and performance measurements How to get the most from the compiler

This redbook is designed to familiarize you with the IBM ^ pSeries POWER4 microarchitecture and to provide you with the information necessary to exploit the new high-end servers based on this architecture. The eight to 32-way symmetric multiprocessing (SMP) pSeries 690 Model 681 will be the first POWER4 system to be available. Thus, most analysis presented in this publication refers to this system. Specifically, this publication will address the following issues: POWER4 features and capabilities Processor and memory optimization techniques, especially for Fortran programming AIX XL Fortran Version 7.1.1 compiler capabilities and which options to use Parallel processing techniques and perform ance Available libraries and programming interfaces Perform ance examples of commonly used kernels While this publication is decidedly technical in nature, the fundamental concepts are presented from a user point of view and numerous examples are provided to reinforce these concepts.

INTERNATIONAL TECHNICAL SUPPORT ORGANIZATION

BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment.

For more information: ibm.com/redbooks

SG24- 7041-00

ISBN 0738423556