Heterogeneous networks and the ch_p4 device
Up: Special features of different systems Next: Using special switches Previous: Tuning the P4 device
A heterogeneous network of workstations is one in which the machines
connected by the network have different architectures and/or operating
systems. For example, a network may contain 3 Sun SPARC (sun4)
workstations and 3 SGI IRIX workstations, all of which communicate via
the TCP/IP protocol. The mpirun command may be told to use all of these with
mpirun -arch sun4 -np 3 -arch IRIX -np 3 program.%aWhile the ch_p4 device supports communication between workstations in heterogeneous TCP/IP networks, it does not allow the coupling of multiple multicomputers. To support such a configuration, you should use the ch_nexus device. See the following section for details.
The special program name program.%a allows you to specify the different
executables for the program, since a Sun executable won't run on an SGI
workstation and vice versa. The %a is replaced with the architecture
name; in this example, program.sun4 runs on the Suns and
program.IRIX runs on the SGI IRIX workstations. You can also put the
programs into different directories; for example,
mpirun -arch sun4 -np 3 -arch IRIX -np 3 /tmp/%a/programFor even more control over how jobs get started, we need to look at how mpirun starts a parallel program on a workstation cluster. Each time mpirun runs, it constructs and uses a new file of machine names for just that run, using the machines file as input. (The new file is called PIyyyy, where yyyy is the process identifier.) If you specify -keep_pg on your mpirun invocation, you can use this information to see where mpirun ran your last few jobs. You can construct this file yourself and specify it as an argument to mpirun. To do this for ch_p4, use
mpirun -p4pg pgfile myprogwhere pfile is the name of the file. The file format is defined below.
This is necessary when you want closer control over the hosts you run on, or when mpirun cannot construct it automatically. Such is the case when
- You want to run on a different set of machines than those listed in the machines file.
- You want to run different executables on different hosts (your program is not SPMD).
- You want to run on a heterogeneous network, which requires different executables.
- You want to run all the processes on the same workstation, simulating parallelism by time-sharing one machine.
- You want to run on a network of shared-memory multiprocessors and need to specify the number of processes that will share memory on each machine. This is only a benefit with the ch_p4 device. Nexus is currently developing a shared memory module that should be available in its next release
The format of a ch_p4 procgroup file is a set of lines of the form
<hostname> <#procs> <progname> [<login>]An example of such a file, where the command is being issued from host sun1, might be
sun1 0 /users/jones/myprog sun2 1 /users/jones/myprog sun3 1 /users/jones/myprog hp1 1 /home/mbj/myprog mbjThe above file specifies four processes, one on each of three suns and one on another workstation where the user's account name is different. Note the 0 in the first line. It is there to indicate that no other processes are to be started on host sun1 than the one started by the user by his command.
You might want to run all the processes on your own machine, as a test.
You can do this by repeating its name in the file:
sun1 0 /users/jones/myprog sun1 1 /users/jones/myprog sun1 1 /users/jones/myprogThis will run three processes on sun1, communicating via sockets.
To run on a shared-memory multiprocessor, with 10 processes, you would use
a file like:
sgimp 9 /u/me/progNote that this is for 10 processes, one of them started by the user directly, and the other nine specified in this file. This requires that mpich was configured with the option -comm=shared; see the installation manual for more information.
If you are logged into host gyrfalcon and want to start a job with
one process on gyrfalcon and three processes on alaska, where
the alaska processes communicate through shared memory, you would use
local 0 /home/jbg/main alaska 3 /afs/u/graphics
Up: Special features of different systems Next: Using special switches Previous: Tuning the P4 device
Using special switches
Up: Heterogeneous networks and the ch_p4 device Next: Heterogeneous networks of multicomputers and the ch_nexus device Previous: Heterogeneous networks and the ch_p4 device
In some installations, certain
hosts can be connected in multiple ways. For example, the ``normal'' Ethernet
may be supplemented by a high-speed FDDI ring. Usually, alternate host names
are used to identify the high-speed connection. All you need to do is put
these alternate names in your machines/machines.xxxx file.
In this case, it is important not to use the form local 0 but to use
the name of the local host. For example, if hosts host1 and
host2 have ATM connected to host1-atm and host2-atm
respectively, the correct ch_p4 procgroup file to connect them
(running the program /home/me/a.out) is
host1-atm 0 /home/me/a.out host2-atm 1 /home/me/a.out
Up: Heterogeneous networks and the ch_p4 device Next: Heterogeneous networks of multicomputers and the ch_nexus device Previous: Heterogeneous networks and the ch_p4 device