Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.parallel.ru/sites/default/files/ftp/benchmarks/pmb/PMB-MPI1.pdf
Äàòà èçìåíåíèÿ: Wed Nov 2 11:53:59 2011
Äàòà èíäåêñèðîâàíèÿ: Tue Oct 2 03:26:22 2012
Êîäèðîâêà:
Pallas MPI Benchmarks PMB, Part MPI-1

Pallas GmbH HermÝlheimer Str. 10 D-50321 BrÝhl Phone: +49-(0)2232-1896-0 Fax: +49-(0)2232-1896-29 http://www.pallas.com


Pallas MPI Benchmarks - PMB, Part MPI-1

1 2
2.1 2.2 2 .3

INTRODUCTION INSTALLATION AND QUICK START OF PMB-MPI1
Download Installation Running PMB-MPI1

3 4
4 4 6

3
3.1 3 .2

OVERVIEW OF PMB-MPI1
General The Benchmarks

6
6 6 7 7 7 7 8 8 8

3.3 Version changes 3.3.1 Version 2.1 vs. 2.0 3.3.2 Version 2.2 vs. 2.1 3.4 PMB-MPI1 vs. PMB1.x Definitions 3 .4 .1 Changed Definitions 3 .4 .2 Throughput Calculations 3.4.3 Corrected Methodology

4

PMB-MPI1 BENCHMARK DEFINITIONS

9
9 10 10 11 11 12 13 14 15 16 16 17 17 17 18 18 18 18 19

4.1 Benchmark Classification 4 .1 .1 Single Transfer Benchmarks 4.1.2 Parallel Transfer Benchmarks 4.1.3 Collective Benchmarks 4 .2 Definition of Single Transfer Benchmarks 4.2.1 PingPong 4.2.2 PingPing 4 .3 Definition of Parallel Transfer Benchmarks 4.3.1 Sendrecv 4.3.2 Exchange 4 .4 Definition of Collective Benchmarks 4.4.1 Reduce 4.4.2 Reduce_scatter 4.4.3 Allreduce 4.4.4 Allgather 4.4.5 Allgatherv 4.4.6 Alltoall 4.4.7 Bcast 4.4.8 Barrier

5
5 .1

BENCHMARK METHODOLOGY
Running PMB, Command Line Control

19
20

2000-03-09

1 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

5.1.1 5 .1 .2

Default Case Command Line Control

20 20 22 22 24 24 24 25 25 25

5 .2 PMB Parameters and Hard Coded Settings 5.2.1 Parameters Controlling PMB 5.2.2 Communicators, Active Processes 5 .2 .3 Message Lengths 5 .2 .4 Buffer Initialization 5.2.5 Warm-up Phase 5.2.6 Synchronization 5 .2 .7 The Actual Benchmark

6
6.1 6.2 6.3

OUTPUT
Sample 1 Sample 2 Sample 3

26
27 29 31

7
7.1 7.2 7.3 7.4

FURTHER DETAILS
Memory Requirements SRC Directory Results Checking Use of MPI

33
33 33 33 34

8

REVISION HISTORY

34 34

REFERENCES

2000-03-09

2 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

1

Introduction

In an effort to define a standard API for message-passing programming, a forum of HPC vendors, researchers and users has developed the Message Passing Interface. MPI-1 [1] and MPI-2 [2] are now firmly established as the premier message-passing API, with implementations available for a wide range of platforms in the high-performance and general computing area, and a growing number of applications and libraries using MPI. To help compare the performance of various computing platforms and/or MPI implementations, the need for a set of well-defined MPI benchmarks arises. This document presents the Pallas MPI Benchmarks (PMB) suite. Its objectives are:
· · ·

provide a concise set of benchmarks targeted at measuring the most important MPI functions. set forth a precise benchmark methodology. don't impose much of an interpretation on the measured results: report bare timings instead. Show throughput values, if and only if these are well defined.

This document accompanies the version 2.2 of PMB. The code is written in ANSI C plus standard MPI (about 10 000 lines of code, 100 functions in 48 source files). The PMB 2.2 package consists of 3 separate parts:
· ·

PMB-MPI1 (the focus of this document) PMB-MPI2 (see [3]), subdivided into PMB-EXT (Onesided Communications benchmarks), PMB-IO (I/O benchmarks).

For each part, a separate executable can be built. Users who don't have the MPI-2 extensions available, can install and use just PMB-MPI1. Only standard MPI-1 functions [1] are used, no dummy library is needed. This document is dedicated to PMB-MPI1. Section 2 is a brief installation guide, in section 3 an overview of the suite is given. Section 4 defines the single benchmarks in detail. PMB introduces a classification of its benchmarks. Single Transfer, Parallel Transfer, Collective are the classes. Roughly speaking, Single transfers run dedicated, without obstructions from other transfers, undisturbed results are to be expected (PingPong being the most well known example). Parallel transfers test the system under global load, with concurrent actions going on. Finally, Collective is a proper MPI classification, these benchmarks test the quality of the implementation for the higher level collective functions. Section 5 defines the methodology and rules of PMB, section 6 shows the output tables format. In section 7, further important details are explained, in particular a results checking mode for PMB.

2000-03-09

3 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

2
· ·

Installation and Quick Start of PMB-MPI1
cpp, ANSI C compiler, make. Full MPI-1 installation, including startup mechanism for parallel MPI programs.

In order to run PMB-MPI1, one needs:

See 7.1 for the memory requirements of PMB-MPI1.

2.1

Download

Get PMB.tar.gz at http://www.pallas.de/pages/pmbd.htm

2.2

Installation

After unpacking, on the current directory is created: PMB2 (directory) PMB2/SRC (subdirectory containing sources and Makefile, see 7.2) PMB2/RESULTS (subdirectory with sample results from various machines) PMB2/DOC (subdirectory containing this document in postscript format) The installation is performed in the SRC subdirectory to keep the structure easy. Here, a generic Makefile can be found. All rules and dependencies are defined there. Only a few (7, precisely) machine dependent variables have to be set, whereafter an easy make will perform the installation. For defining the machine dependent settings, the section ##### User configurable options ##### #include make_i86 #include make_solaris #include make_dec #include make_sp2 #include make_sr2201 #include make_vpp #include make_t3e #include make_sgi #include make_sx4 ### End User configurable options ### is provided in the Makefile header. The listed make_* files are on the directory and have been used successfully on certain systems. First check whether one of these can be used for your purpose (with no or marginal changes). However, usually the user will have to edit an own make_mydefs file. This has to contain 7 variable assignments used by Makefile:

2000-03-09

4 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

CC = CLINKER =

OPTFLAGS = CPPFLAGS = /* Allowed values: -DnoCHECK, -DCHECK (see 7.3; check that proper benchmarks are always with the cpp flag -DnoCHECK!) */ LIB_PATH = , /* in the form -L -L .. */ LIBS = , e.g. -lmpi -l.... MPI_INCLUDE = Activate these flags by editing the Makefile header: ##### User configurable options ##### include make_mydefs #include make_i86 #include make_solaris #include make_dec #include make_sp2 #include make_sr2201 #include make_vpp #include make_t3e #include make_sgi #include make_sx4 ### End User configurable options ### User flags will be used in the following way: $(CC) -I$(MPI_INCLUDE) $(CPPFLAGS) $(OPTFLAGS) -c

(for compilation)
$(CLINKER) -o [.o's] (for linking). Of course, the user may define own auxiliary variables in make_mydefs. Now, type make [PMB-xxx] Compilation should be quite short, and executable PMB-xxx will be generated. $(LIB_PATH) $(LIBS)

2000-03-09

5 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

2.3

Running PMB-MPI1

Check the right way of running parallel MPI programs on your system. Usually, a startup procedure has to be invoked, like mpirun -np P PMB-MPI1 (P being the number of processes to load; P=1 is allowed!). This will run all of PMB on a varying number of processes (2,4,8,..,2x
3
3.1

Overview of PMB-MPI1
General

The idea of PMB is to provide a concise set of elementary MPI benchmark kernels. W ith one executable, all of the supported benchmarks, or a subset specified by the command line, can be run. The rules, such as time measurement (including a repetitive call of the kernels for better clock synchronization), message lengths, selection of communicators to run a particular benchmark (inside the group of all started processes) are program parameters. PMB has a standard and an optional configuration. In the standard case, all parameters mentioned above are fixed and must not be changed. For certain systems, it may be interesting to extend the results tables (in particular, run larger message sizes than provided in the standard case). For this, the user can set certain parameters at own choice. See 5.2.1. The minimum P_min and maximum number P of processes can be selected by the user via command line, the benchmarks run on P_min, 2P_min, 4P_min, ... 2xP_min


3.2
· · · · · · · · · ·

The Benchmarks

The current version of PMB-MPI1 contains the benchmarks PingPong PingPing Sendrecv Exchange Bcast Allgather Allgatherv Alltoall Reduce Reduce_scatter

2000-03-09

6 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

· ·

Allreduce Barrier

The exact definitions will be given in section 4. Section 5 describes the benchmark methodology. PMB-MPI1 allows for running all benchmarks in more than one process group. E.g., when running PingPong on N4 processes, on user request (see 5.1.2.3) N/2 disjoint groups of 2 processes each will be formed, all and simultaneously running PingPong. Note that these multiple versions have to be carefully distinguished from their standard equivalents. They will be called
· · · · · · · · · · · ·

Multi-PingPong Multi-PingPing Multi-Sendrecv Multi-Exchange Multi-Bcast Multi-Allgather Multi-Allgatherv Multi-Alltoall Multi-Reduce Multi-Reduce_scatter Multi-Allreduce Multi-Barrier

For a distinction, sometimes we will refer to the standard (non Multi) benchmarks as primary benchmarks. The way of interpreting the timings of the Multi-benchmarks is quite easy, given a definition for the primary cases: per group, this is as in the standard case. Finally, the max timing (min throughput) over all groups is displayed. On request, all per group information can be reported, see 5.1.2.3, 6.3.

3.3
3.3.1
· ·

Version changes
Version 2.1 vs. 2.0
Alltoall added (see 4.4.6) Optional settings mode included (see 5.2.1.)

3.3.2

Version 2.2 vs. 2.1

Default variable initializations (function set_default) were added (March 2000).

3.4

PMB-MPI1 vs. PMB1.x Definitions

Compared to older PMB1.x releases, all primary benchmark names except Sendrecv and Alltoall are the same. Cshift and Xover of PMB1.x have been removed. PMB1.x did not support the Multi versions.

2000-03-09

7 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

Most important is that certain definitions have changed. Please check carefully. Table 2 shows an overview of the changes.

3.4.1

Changed Definitions

The main changes are in PingPing and Exchange. PingPing even uses a different pattern (elementary messages rather than Sendrecv, see 4.3.1). Moreover, the scaling of timings and throughput data has changed.
Timings PMB-MPI1 PMB1.x 0 .5 0 .2 5 Throughputs PMB-MPI1 1 4 PMB1.x 2 4

PingPing Exchange

1 1

Table 1: Scaling factors PMB-MPI1 vs. PMB1.x Thus, the corresponding tables show different values when comparing PMB1.x and PMB-MPI1 on a particular system. The PMB1.x scaling factors for the timings gives confusing answers for the startup components (small message sizes). The scaling of PingPing throughputs by 2 is reasonable (bi-directional throughput), however PMB imposes a different interpretation, leaving the bi-directional throughput measurement to the Sendrecv benchmark. Sendrecv and Alltoall are new benchmarks. Functionally, Sendrecv exactly corresponds to the Cshift benchmark of PMB1.x, however, displays timings with a different scaling. The Xover benchmark of PMB1.x has been removed, as it has shown no significant information on any tested system.

3.4.2

Throughput Calculations

Throughput results are based on real MBytes (1048576 bytes) in PMB-MPI1, in contrast to PMB1.x, which used 1 MByte = 1000000 bytes. In contrast to PMB1.x, PMB-MPI1 does not display throughput values for the global operations Bcast, Allgather and Allgatherv.

3.4.3

Corrected Methodology

PMB1.x was not cleanly defined in the case that a certain benchmark was run in a process group strictly smaller than the group of all started MPI processes. In PMB-MPI1, all non active processes will wait for the active ones in an MPI_Barrier(MPI_COMM_WORLD). In PMB1.x, non active processes immediately went through the outputcollecting phase (MPI_Gather) and then, eventually, switched to the following benchmark(s). This may induce unpredictable obstructions of the active processes. The MPI_Barrier may also, but now the way is well defined and reasonable. See 5 for precise definitions of the methodology.

2000-03-09

8 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

PMB-MPI1 Benchmark name

Contained in releases 1.x â

Compared to PMB1.x, in PMB-MPI1 there is slight change in throughput data due to re-definition of 1MByte = 1048576 bytes other pattern, no scaling expectation: timings doubled, throughputs halved

PingPong

PingPing

â

Sendrecv Exchange Bcast Allgather Allgatherv Alltoall Reduce Reduce_scatter Allreduce Barrier
â â â â no change no change no change no change â â â â 4 fold timings, equal throughputs no output of throughput data no output of throughput data no output of throughput data

PMB1.x benchmarks that are no longer in PMB-MPI1

Cshift Xover

Sendrecv benchmark is a full substitute
Has shown no significant information

Table 2: PMB-MPI1 vs. PMB1.x benchmarks

4

PMB-MPI1 Benchmark Definitions

In this chapter, the single benchmarks are described. Here we focus on the elementary patterns of the benchmarks. The methodology of measuring these patterns (message lengths, sample repetition counts, timer, synchronization, number of processes and communicator management, display of results) are defined in chapters 5 and 6.

4.1

Benchmark Classification

For a clear structuring of the set of benchmarks, PMB now introduces classes of benchmarks: Single Transfer, Parallel Transfer, and Collective. This classification refers to different ways of interpreting results, and to a structuring of the code itself. It does not actually influence the way of using PMB. Also holds this classification hold for the PMB-MPI2 part [3].

2000-03-09

9 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

PMB-MPI1 Single Transfer Parallel Transfer Collective

PingPong PingPing

Sendrecv Exchange

Bcast Allgather Allgatherv

Multi-PingPong Multi-PingPing Multi-Sendrecv Multi-Exchange

Alltoall Reduce Reduce_scatter Allreduce Barrier
Multi-versions of these

4.1.1

Single Transfer Benchmarks

The benchmarks in this class are to focus on a single message transferred between two processes. As to PingPong, this is the usual way of looking at. In PMB interpretation, PingPing measures the same as PingPong, under the particular circumstance that a message is obstructed by an oncoming one (sent simultaneously by the same process that receives the own one). Single transfer benchmarks, roughly speaking, are local mode. The particular pattern is purely local to the participating processes, there is no concurrency with other message passing activity. Best case message passing results are to be expected. Important for this is that single transfer benchmarks only run with 2 active processes (see 3.4.3, 5.2.2 for the definition of active). For PingPing, and this is in contrast to PMB1.x and other code systems containing this benchmark, pure timings will be reported, and the throughput is related to a single message. Expected numbers, very likely, are between half and full PingPong throughput. W ith this, PingPing determines the throughput of messages under non optimal conditions (namely, oncoming traffic). See 4.2.1 and 4.2.2 for exact definitions.

4.1.2

Parallel Transfer Benchmarks

Benchmarks focusing on global mode, say, patterns. The activity at a certain process is in concurrency with other processes, the benchmark measures message passing efficiency under global load. For the interpretation of Sendrecv and Exchange, more than 1 message (per sample) counts. As to the throughput numbers, the total turnover (the number of sent plus the number of received bytes ) at a certain process is taken into account. E.g., for the case of 2 processes, Sendrecv becomes the bi-directional test: perfectly bi-directional systems are rewarded by a double PingPong throughput here. Thus, the throughputs are scaled by certain factors. See 4.3.1 and 4.3.2 for exact definitions. As to the timings, raw results without scaling will be reported. The Multi mode secondarily introduces into this class
· · ·

Multi-PingPong Multi-PingPing Multi-Sendrecv

2000-03-09

10 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

·

Multi-Exchange

4.1.3

Collective Benchmarks

This class contains all benchmarks that are collective in proper MPI convention. Not only is the message passing power of the system relevant here, but also the quality of the implementation. For simplicity, we also include the Multi versions of these benchmarks into this class. Raw timings and no throughput are reported. Note that certain collective benchmarks (namely the reductions) play a particular role as they are not pure message passing tests, but also depend on an efficient implementation of certain numerical operations.

4.2

Definition of Single Transfer Benchmarks

This section describes the single transfer benchmarks in detail. Each benchmark is run with varying message lengths X bytes, and timings are averaged over multiple samples. See 5 for the description of the methodology. Here we describe the view of one single sample, with a fixed message length X bytes. Basic MPI datatype for all messages is MPI_BYTE. Throughput values are defined in MBytes / sec = 220 bytes / sec scale (i.e. throughput = X / 220 * 106 / time = X / 1.048576 / time, when time is in µsec).

2000-03-09

11 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

4.2.1

PingPong

PingPong is the classical pattern used for measuring startup and throughput of a single message sent between two processes.

Measured pattern based on

As symbolized between in Figure 1; two active processes only (Q=2, see 5.2.2)

MPI_Send, MPI_Recv MPI_BYTE
time = t/2 (in µsec) as indicated in Figure 1 X/1.048576/time

MPI_Datatype
reported timings reported throughput

PROCESS 1 MPI_Send time=t/2 t X bytes

PROCESS 2

MPI_Recv MPI_Send MPI_Recv X bytes

Figure 1: PingPong pattern

2000-03-09

12 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

4.2.2

PingPing

As PingPong, PingPing measures startup and throughput of single messages, with the crucial difference that messages are obstructed by oncoming messages. For this, two processes communicate (MPI_Isend/MPI_Recv/MPI_Wait) with each other, with the MPI_Isend's issued simultaneously.

Measured pattern

As symbolized between Figure 2;

in

two active processes only (Q=2, 5.2.2) based on

MPI_Isend/MPI_Wait, MPI_Recv MPI_BYTE time = t (in µsec) as indicated in
Figure 2

MPI_Datatype
reported timings

reported throughput

X/1.048576/time

PROCESS 1 MPI_Isend( t MPI_Recv MPI_Wait(R) )

PROCESS 2 MPI_Isend( X bytes MPI_Recv MPI_Wait(R) )

request=R

request=R

X bytes

Figure 2: PingPing pattern

2000-03-09

13 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

4.3

Definition of Parallel Transfer Benchmarks

This section describes the parallel transfer benchmarks in detail. Each benchmark is run with varying message lengths X bytes, and timings are averaged over multiple samples. See 5 for the description of the methodology. Here we describe the view of one single sample, with a fixed message length X bytes. Basic MPI datatype for all messages is MPI_BYTE. The throughput calculations of the benchmarks described here take into account the (per sample) multiplicity nmsg of messages outgoing from or incoming at a particular process. In the Sendrecv benchmark, a particular process sends and receives X bytes, the turnover is 2X bytes, nmsg=2. In the Exchange case, we have 4X bytes turnover, nmsg=4. Throughput values are defined in MBytes/sec = 2 (i.e.
throughput = nmsg*X/2 when time is in µsec).
20 20

bytes / sec scale

* 106/time = nmsg*X / 1.048576 / time,

2000-03-09

14 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

4.3.1

Sendrecv

Based on MPI_Sendrecv, the processes form a periodic communication chain. Each process sends to the right and receives from the left neighbor in the chain. The turnover count is 2 messages per sample (1 in, 1 out) for each process. Sendrecv is equivalent with the Cshift benchmark and, in case of 2 processes, the PingPing benchmark of PMB1.x. For 2 processes, it will report the bi-directional bandwidth of the system, as obtained by the (optimized) MPI_Sendrecv function.

Measured pattern based on

As symbolized between

in Figure 3

MPI_Sendrecv MPI_BYTE
time = t (in µsec) as indicated in Figure 3 2X/1.048576/time

MPI_Datatype
reported timings reported throughput

Periodic chain

...

PR. I-1

PR. I

PR. I+1 ...

t

MPI_ X bytes MPI_ X bytes MPI_ Sendrecv Sendrecv Sendrecv

Figure 3: Sendrecv pattern

2000-03-09

15 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

4.3.2

Exchange

Exchange is a communications pattern that often occurs in grid splitting algorithms (boundary exchanges). The group of processes is seen as a periodic chain, and each process exchanges data with both left and right neighbor in the chain. The turnover count is 4 messages per sample (2 in, 2 out) for each process.
Measured pattern based on As symbolized between in Figure 4

MPI_Isend/MPI_Waitall, MPI_Recv MPI_BYTE time = t (in µsec) as indicated in Figure 4 4X/1.048576/time

MPI_Datatype
reported timings reported throughput

Periodic chain

...

PR. I-1 MP MP MP MP MP I_Isend I_Isend I_Recv I_Recv I_Waitall

PR. I MP MP MP MP MP I_Isend I_Isend I_Recv I_Recv I_Waitall

PR. I+1 MP MP MP MP MP

...

t

I_Isend I_Isend I_Recv I_Recv I_Waitall

Each
Figure 4: Exchange pattern

carries X bytes

4.4

Definition of Collective Benchmarks

This section describes the Collective benchmarks in detail. Each benchmark is run with varying message lengths X bytes, and timings are averaged over multiple samples. See 5 for the description of the methodology. Here we describe the view of one single sample, with a fixed message length X bytes. Basic MPI datatype for all messages is MPI_BYTE for the pure data movement functions, and MPI_FLOAT for the reductions. For all Collective benchmarks, only bare timings and no throughput data is displayed.

2000-03-09

16 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

4.4.1

Reduce

Benchmark of the MPI_Reduce function. Reduces a vector of length L = X/sizeof(float) float items. The MPI datatype is MPI_FLOAT, the MPI operation is MPI_SUM. The root of the operation is changed cyclically, see 5.2.7 See also the remark in the end of 4.1.3.
measured pattern

MPI_Reduce MPI_FLOAT MPI_SUM
changing bare time none

MPI_Datatype MPI_Op
root reported timings reported throughput

4.4.2

Reduce_scatter

Benchmark of the MPI_Reduce_scatter function. Reduces a vector of length L = X/sizeof(float)float items. The MPI datatype is MPI_FLOAT, the MPI operation is MPI_SUM. In the scatter phase, the L items are split as evenly as possible. Exactly, when
np = #processes, L = r*np+s (s = L mod np),

then process with rank i gets r+1 items when i measured pattern

MPI_Reduce_scatter MPI_FLOAT MPI_SUM
bare time none

MPI_Datatype MPI_Op
reported timings reported throughput

4.4.3

Allreduce

Benchmark of the MPI_Allreduce function. Reduces a vector of length L = X/sizeof(float) float items. The MPI datatype is MPI_FLOAT, the MPI operation is MPI_SUM. See also the remark in the end of 4.1.3.
measured pattern

MPI_Allreduce MPI_FLOAT MPI_SUM
bare time none

MPI_Datatype MPI_Op
reported timings reported throughput

2000-03-09

17 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

4.4.4

Allgather

Benchmark of the MPI_Allgather function. Every process inputs X bytes and receives the gathered X*(#processes) bytes.
Measured pattern

MPI_Allgather MPI_BYTE
bare time none

MPI_Datatype
reported timings reported throughput

4.4.5

Allgatherv

Functionally the same as Allgather, however with the MPI_Allgatherv function. Shows whether MPI produces overhead due to the more complicated situation as compared to MPI_Allgather.
Measured pattern

MPI_Allgatherv MPI_BYTE
bare time none

MPI_Datatype
reported timings reported throughput

4.4.6

Alltoall

Benchmark of the MPI_Alltoall function. Every process inputs X*(#processes) bytes (X for each process) and receives X*(#processes) bytes (X from each process).
Measured pattern

MPI_Alltoall MPI_BYTE
bare time none

MPI_Datatype
reported timings reported throughput

4.4.7

Bcast

Benchmark of MPI_Bcast. A root process broadcasts X bytes to all. The root of the operation is changed cyclically, see 5.2.7.
measured pattern MPI_Datatype root reported timings reported throughput

MPI_Bcast MPI_BYTE
changing bare time none

2000-03-09

18 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

4.4.8

Barrier
MPI_Barrier
bare time none

measured pattern reported timings reported throughput

5

Benchmark Methodology

Recall that in chapter 4 only the underlying patterns of each benchmark have been defined. In this section, the measuring method for those patterns is explained. Some control mechanisms are hard coded (like the selection of process numbers to run the benchmarks on), some are set by preprocessor parameters in a central include file. Important is that (in contrast to the previous release 2.0) there is a standard and an optional mode to control PMB. In standard mode, all configurable sizes are predefined and should not be changed. This assures comparability for a result tables in standard mode. In optional mode, the user can set those parameters at own choice. For instance, this mode can be used to extend the results tables as to larger message size. The following graph shows an overview of the flow of control inside PMB. All emphasized items will be explained in more detail. For ( all_selected_benchmarks ) For ( all_selected_process_numbers ) Select MPI communicator MY_COMM to run the benchmark, (see 5.2.2) For ( all_selected_message_lengths X ) (see 5.2.3) Initialize communication buffers (see 5.2.4)
X == first_selected_message_length

Yes No Warm_up (see 5.2.5)
MY_COMM != MPI_COMM_NULL

Yes Synchronize processes of MY_COMM (see 5.2.6) Execute benchmark (message size = X ) (see 5.2.7) MPI_Barrier (MPI_COMM_WORLD) Output results (see 6) Figure 5: Control flow in PMB

No

The control parameters obviously necessary are either command line arguments (see 5.1) or parameter selections inside the PMB include file settings.h (see 5.2).

2000-03-09

19 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

5.1

Running PMB, Command Line Control

After installation, see 2.2, an executable PMB-MPI1 should exist. Given P, the (normally user selected) number of MPI processes to run PMBMPI1, a startup procedure has to load parallel PMB-MPI1. Lets assume, for sake of simplicity, that this done by mpirun ­np P PMB-MPI1 [arguments]
P=1 is allowed, will be ignored only by Single Transfer benchmarks. Control arguments (in addition to P) can be passed to PMB-MPI1 via (argc,argv)

which will be read by PMB-MPI1 process 0 (in MPI_COMM_WORLD ranking) and then distributed to all processes.

5.1.1

Default Case

Just invoke mpirun ­np P PMB-MPI1 All primary (non Multi) benchmarks will run on
Q=2, 4, 8, ..., largest 2x
(E.g P=11, then 2,4,8,11 processes will be selected). The Q<=P processes running the benchmark are called active processes. A communicator is formed out of a group of Q processes, see 5.2.2., which is used as communicator argument to the MPI functions crucial for the benchmark.

5.1.2

Command Line Control

The general syntax is mpirun ­np P PMB-MPI1 [Benchmark1 [Benchmark2 [ ... ] ] ] [-npmin P_min] [-multi Outflag] [-input ] (where the 4 major [ ] may appear in any order). Examples: mpirun ­np 8 PMB-MPI1 mpirun ­np 10 PMB-MPI1 PingPing Reduce mpirun ­np 11 PMB-MPI1 ­npmin 5 mpirun ­np 4 PMB-MPI1 ­npmin 4 ­input PMB_SELECT_MPI1

mpirun ­np 14 PMB-MPI1 ­multi 0 PingPong Barrier ­npmin 7

5.1.2.1 Benchmark Selection Arguments
A set of blank-separated strings, each being the name of one primary (non Multi) PMB benchmark (in exact spelling, case insensitive). Default (no benchmark selection): select all primary benchmark names. Given a name selection, either

2000-03-09

20 of 34


Pallas MPI Benchmarks - PMB, Part MPI-1

-multi flag is missing, in which case all selected primary benchmarks are run, or -multi flag is selected, and then the Multi- versions of all selected benchmarks are executed.

5.1.2.2 ­npmin Selection
The argument after ­npmin has to be an integer P_min, specifying the minimum number of processes to run all selected benchmarks.
· · ·

P_min may be 1 P_min > P is handled as P_mi