Condor Job Example: Monte Carlo Calculation of π
By:
Igor Senderovich
See also: Fortran version of this problem using a random number generator from IMSL.
The Problem and Code
Consider the following simple, well-suited job for a cluster:
comparison of independent Monte Carlo calculations of
π;.
The following C-program implements random sampling of points
withing a square bounding a circle.
(The probability of landing inside the circle can be shown to be
π/4)
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main(int argc,
char *argv[])
{
int i,N,incirc=0;
double x,y,circrad2;
sscanf(argv[1], "%d", &N); // get iteration number from input
srand(time(NULL)); // seed random number generator
circrad2=1.0*RAND_MAX;
circrad2*=circrad2; // Define radius squared
for(i=0;i<N;i++){
x=1.0*rand(); y=1.0*rand(); // get rand. point and
incirc += (x*x+y*y) < circrad2; // check if inside circle
}
printf("pi=%.12f\n",4.0*incirc/N); // display probability
return 0;
}
Compiling this program (that we may save as
calcpi.c
)
gcc calcpi.c -o calcpi
yields an executable
calcpi
that is ready for submission.
Preparation for Job Submission
To prepare the job execution space and inform Condor of the appropriate run environment, create a job description file (e.g.
calcpi.condor
)
Executable = calcpi
Requirements = ParallelSchedulingGroup == "stats group"
Universe = vanilla
output = calcpi$(Process).out
error = calcpi$(Process).err
Log = calcpi$(Process).log
Arguments = 100000000
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Queue 50
The last line specifies that 50 instances should be scheduled on the cluster.
The description file specifies the executable and the arguments
passed to it during execution. (In this case we are requesting
that all instances iterate 10e9 times in the program's sampling loop.)
Output and error files are targets for standard out and standard error
streams respectively.
The log file is used to by Condor to record in real time the progress
in job processing. Note that this setup labels output files
by process number to prevent a job instance from overwritting
files belonging to another.
The current values imply that all files are to be found
in the same directory as the description file.
The
universe variable specifies the condor runtime environment.
For the purposes of these independent jobs, the simplest "vanilla" universe suffices.
In a more complicated parallel task, with migration of checkpoint upon interaction, MPI calls etc.,
more advanced run-time environments are employed, often requiring specilized linking of the binaries.
The lines specifying transfer settings are important to avoid any assumptions about accessibility
over nfs. They should be included whether or not any output files
(aside from standard output and error) are necessary.
Job Submission and Management
The job is submitted with:
condor_submit calcpi.condor
The cluster can be queried before or after submission to check its availability. Two very versatile commands exist for this purpose:
condor_status
and
condor_q
. The former returns the status of the nodes (broken down by virtual machines that can each handle a job instance.) The latter command shows the job queue including the individual instances of every job and the submission status (e.g. idling, busy etc.)
Using
condor_q
a few seconds after submission shows:
-- Submitter: stat31.phys.uconn.edu : <192.168.1.41:44831> : stat31.phys.uconn.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
33.3 prod 1/30 15:37 0+00:00:02 R 0 9.8 calcpi 100000000
33.4 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000
33.5 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000
33.6 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000
33.7 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000
33.8 prod 1/30 15:37 0+00:00:00 R 0 9.8 calcpi 100000000
6 jobs; 0 idle, 6 running, 0 held
By this time, only 6 jobs are left on the cluster, all with status 'R' - running. Various statistics are given including a job ID number. This handle is useful if intervention is required like manual removal of frozen job instances from the cluster.
Now, comparing the results (e.g. with command
cat calcpi*.out
) shows
...
pi=3.141215440000
pi=3.141447360000
pi=3.141418120000
pi=3.141797520000
...
See Also