How
to run jobs on Ravana
Here is what you need to do
to run a parallel (MPI) job on
Ravana.
To compile:
This is the same as on any machine, although you do need to know the
name of the
compiler
that handles parallel code. (You also need to include the MPI
header file "mpi.h" in your code,
but this is found automatically by the compiler, so nothing additional
is required on the link line.)
The c++ compiler is called "mpicxx". If you use fortran, you'll have to
ask someone what the
compiler is called.
An example of a
compile/link is:
mpicxx -03 -c main.cc
mpicxx -03 -c
qobjects.cc
mpicxx -o
exec main.o qobjects.o -lm
An example of the
corresponding makefile is:
# special variables
CC
= mpicxx
OFLAGS = -O3
LFLAGS =
LIBRARIES = -lm
# other variables
MAINFILE = main-mpi
OTHERS
= qobjects
OBJS =
$(MAINFILE:%=%.o) $(OTHERS:%=%.o)
# pattern rules to
define implicit rule
%.o: %.cc
mpicxx -c $(OFLAGS) $<
%.o: %.cpp
mpicxx -c $(OFLAGS) $<
%.o: %.f
mpif77 -c $<
# default rule(s)
all: $(MAINFILE)
$(MAINFILE): $(OBJS)
$(LINK.c) $(LFLAGS) -o exec $^ $(LIBRARIES)
# overridden implicit
rules (ie. extra dependencies)
main-mpi.o:
main-mpi.cc qobjects.h Makefile
qobjects.o:
qobjects.cc qobjects.h Makefile
Submitting the job:
To submit the job you use the command
qsub subscript
where "subscript" is a text file that tells "qsub" what to run, and how
many processors to
use. An example submission script file is:
#!/bin/sh
#PBS -N myjob
#PBS -e myjob.err
#PBS -o myjob.log
#PBS -r n
#PBS -q high
#PBS -l nodes=8:ppn=4
# diagnostics
echo
echo Working
directory is $PBS_O_WORKDIR
cd
$PBS_O_WORKDIR
echo Running on host
`hostname`
echo Time is `date`
echo Directory is
`pwd`
echo This job runs on
the following processors:
echo `cat
$PBS_NODEFILE`
# calculate number of
processors
NPROCS=`wc -l
< $PBS_NODEFILE`
echo This job has
allocated $NPROCS nodes
echo
# Run the parallel
MPI executable
mpirun -machinefile
$PBS_NODEFILE -np $NPROCS ./exec
This requires some explanation. The last line of the above file is line
that actually runs the code.
To run parallel code you use the command "mpirun", and the executable
being run in this
example is called "exec".
This submit script requests 32 processors (nodes=8, processors per node
= 4).
Ravana has 31
compute nodes, with 4
effective processors per node, for a total of 124 processors.
This script runs the job on the queue called "high". Each queue has
different restrictions
on the time the job can be run for, and may have a restriction on the
maximum number of
processors. To see all the queues, and their names, type "qstat -q".
qsub selects 32 different nodes, writes them into a
"nodefile", and then passes this nodefile to
the mpirun command (the name of the nodefile is stored in "$PBS_NODEFILE").
To check that your job (or jobs) is running correctly after you submit
it, type "qstat -a".
This
will return something like:
master.cl.umb.edu:
Req'd
Req'd
Elap
Job
ID
Username
Queue
Jobname SessID
NDS TSK Memory Time
S Time
---------
------------- --------
-----------
--------- ------- ------- -----------
------- --
-------
305.master.cl.umb.ed
kaj
high
myjob
--
32
1
--
24:00 R 0:00
The "R" indicates the job is running, rather than merely
waiting in the queue.
To kill this job one would type "qdel 305".
Things you
may need to change by hand:
Sometimes a node might go down, and qsub wont know, resulting in your
job
failing because it requests that node. Instead of requesting a reboot
of the machine, (if its the
weekend, for example) you can get around the problem by specifiying your own "machine
file".
A machine file is just a list of the nodes you want to run on, with one entry for each of
the
processors you want to run on. An example machine file is:
node25
node25
node25
node25
node10
node10
node10
node10
A code that uses this file will run on two nodes (25 and 10), with 4
processors on each node.
If you call the machine file "mymachfile", then to use it you just
alter the last line of the
submission script:
mpirun -machinefile
mymachfile -np 8 ./exec