Workstation clusters

Up: Special features of different systems Next: Checking your machines list Previous: Special features of different systems

Most massively parallel processors (MPPs) provide a way to start a program on a requested number of processors; mpirun makes use of the appropriate command whenever possible. In contrast, workstation clusters require that each process in a parallel job be started individually, though programs to help start these processes exist (see Using the Secure Server below). Because workstation clusters are not already organized as an MPP, additional information is required to make use of them. MPICH should be installed with a list of participating workstations in the file machines.<arch> in the directory /usr/local/mpich/util/machines. This file is used by mpirun to choose processors to run on. (Using heterogeneous clusters is discussed below.) The rest of this section discusses some of the details of this process, and how you can check for problems. These instructions apply to only the ch_p4 device.

Up: Special features of different systems Next: Checking your machines list Previous: Special features of different systems

Checking your machines list

Up: Workstation clusters Next: Using the Secure Shell Previous: Workstation clusters

Use the script tstmachines in /usr/local/mpich/build/<arch>/<device>/bin to ensure that you can use all of the machines that you have listed. This script performs an rsh and a short directory listing; this tests that you both have access to the node and that a program in the current directory is visible on the remote node. If there are any problems, they will be listed. These problems must be fixed before proceeding.

The only argument to tstmachines is the name of the architecture; this is the same name as the extension on the machines file. For example,

    /usr/local/mpich/build/sun4/ch_p4/bin/tstmachines sun4

tests that a program in the current directory can be executed by all of the machines in the sun4 machines list. This program is silent if all is well; if you want to see what it is doing, use the -v (for verbose) argument:

    /usr/local/mpich/build/sun4/ch_p4/bin/tstmachines -v sun4

The output from this command might look like

Trying true on host1.uoffoo.edu ... 
Trying true on host2.uoffoo.edu ... 
Trying ls on host1.uoffoo.edu ...  
Trying ls on host2.uoffoo.edu ... 
Trying user program on host1.uoffoo.edu ... 
Trying user program on host2.uoffoo.edu ...

Up: Workstation clusters Next: Using the Secure Shell Previous: Workstation clusters

Using the Secure Shell

Up: Workstation clusters Next: Using the Secure Server Previous: Checking your machines list

The Installation Guide explains how to set up your environment so that the ch_p4 device on networks will use the secure shell ssh instead of rsh. This is useful on networks where for security reasons the use of rsh is discouraged.

Up: Workstation clusters Next: Using the Secure Server Previous: Checking your machines list

Using the Secure Server

Up: Workstation clusters Next: Heterogeneous networks and the ch_p4 device Previous: Using the Secure Shell

Because each workstation in a cluster (usually) requires that a new user log into it, and because this process can be very time-consuming, MPICH provides a program that may be used to speed this process. This is the secure server, and is located in serv_p4 in the directory /usr/local/mpich/build/<arch>/<device>/bin *. The script chp4_servs in the same directory may be used to start serv_p4 on those workstations that you can rsh programs on. You can also start the server by hand and allow it to run in the background; this is appropriate on machines that do not accept rsh connections but on which you have accounts.

Before you start this server, check to see if the secure server has been installed for general use; if so, the same server can be used by everyone. In this mode, root access is required to install the server. If the server has not been installed, then you can install it for your own use without needing any special privileges with

    chp4_servs -port=1234

This starts the secure server on all of the machines listed in the file /usr/local/mpich/util/machines/machines.<arch>.

The port number, provided with the option -port=, must be different from any other port in use on the workstations.

To make use of the secure server for the ch_p4 device, add the following definitions to your environment:

    setenv MPI_USEP4SSPORT yes 
    setenv MPI_P4SSPORT 1234

The value of MPI_P4SSPORT must be the port with which you started the secure server. When these environment variables are set, mpirun attempts to use the secure server to start programs that use the ch_p4 device. (There are command line arguments to mpirun that can be used instead of these environment variables; mpirun -help will give you more information.)

Up: Workstation clusters Next: Heterogeneous networks and the ch_p4 device Previous: Using the Secure Shell

Heterogeneous networks and the ch_p4 device

Up: Workstation clusters Next: Using special switches Previous: Using the Secure Server

A heterogeneous network of workstations is one in which the machines connected by the network have different architectures and/or operating systems. For example, a network may contain 3 Sun SPARC (sun4) workstations and 3 SGI IRIX workstations, all of which communicate via the TCP/IP protocol. The mpirun command may be told to use all of these with

    mpirun -arch sun4 -np 3 -arch IRIX -np 3 program.%a

While the ch_p4 device supports communication between workstations in heterogeneous TCP/IP networks, it does not allow the coupling of multiple multicomputers. To support such a configuration, you should use the globus device. See the following section for details.

The special program name program.%a allows you to specify the different executables for the program, since a Sun executable won't run on an SGI workstation and vice versa. The %a is replaced with the architecture name; in this example, program.sun4 runs on the Suns and program.IRIX runs on the SGI IRIX workstations. You can also put the programs into different directories; for example,

    mpirun -arch sun4 -np 3 -arch IRIX -np 3 /tmp/%a/program

For even more control over how jobs get started, we need to look at how mpirun starts a parallel program on a workstation cluster. Each time mpirun runs, it constructs and uses a new file of machine names for just that run, using the machines file as input. (The new file is called PIyyyy, where yyyy is the process identifier.) If you specify -keep_pg on your mpirun invocation, you can use this information to see where mpirun ran your last few jobs. You can construct this file yourself and specify it as an argument to mpirun. To do this for ch_p4, use

    mpirun -p4pg pgfile myprog

where pfile is the name of the file. The file format is defined below.

This is necessary when you want closer control over the hosts you run on, or when mpirun cannot construct it automatically. Such is the case when

You want to run on a different set of machines than those listed in the machines file.
You want to run different executables on different hosts (your program is not SPMD).
You want to run on a heterogeneous network, which requires different executables.
You want to run all the processes on the same workstation, simulating parallelism by time-sharing one machine.
You want to run on a network of shared-memory multiprocessors and need to specify the number of processes that will share memory on each machine.

The format of a ch_p4 procgroup file is a set of lines of the form

   <hostname>  <#procs>  <progname>  [<login>]

An example of such a file, where the command is being issued from host sun1, might be

    sun1   0  /users/jones/myprog 
    sun2   1  /users/jones/myprog 
    sun3   1  /users/jones/myprog 
    hp1    1  /home/mbj/myprog    mbj

The above file specifies four processes, one on each of three suns and one on another workstation where the user's account name is different. Note the 0 in the first line. It is there to indicate that no other processes are to be started on host sun1 than the one started by the user by his command.

You might want to run all the processes on your own machine, as a test. You can do this by repeating its name in the file:

    sun1 0 /users/jones/myprog 
    sun1 1 /users/jones/myprog 
    sun1 1 /users/jones/myprog

This will run three processes on sun1, communicating via sockets.

To run on a shared-memory multiprocessor, with 10 processes, you would use a file like:

    sgimp  9  /u/me/prog

Note that this is for 10 processes, one of them started by the user directly, and the other nine specified in this file. This requires that MPICH was configured with the option -comm=shared; see the installation manual for more information.

If you are logged into host gyrfalcon and want to start a job with one process on gyrfalcon and three processes on alaska, where the alaska processes communicate through shared memory, you would use

    local    0  /home/jbg/main 
    alaska   3  /afs/u/graphics

Up: Workstation clusters Next: Using special switches Previous: Using the Secure Server

Using special switches

Up: Workstation clusters Next: Computational grids: the globus device Previous: Heterogeneous networks and the ch_p4 device

In some installations, certain hosts can be connected in multiple ways. For example, the ``normal'' Ethernet may be supplemented by a high-speed FDDI ring. Usually, alternate host names are used to identify the high-speed connection. All you need to do is put these alternate names in your machines/machines.xxxx file. In this case, it is important not to use the form local 0 but to use the name of the local host. For example, if hosts host1 and host2 have ATM connected to host1-atm and host2-atm respectively, the correct ch_p4 procgroup file to connect them (running the program /home/me/a.out) is

    host1-atm 0 /home/me/a.out 
    host2-atm 1 /home/me/a.out

Up: Workstation clusters Next: Computational grids: the globus device Previous: Heterogeneous networks and the ch_p4 device