Condor - simple guide

1. Introduction

This document is intended to provide a quick introduction to Condor, to introduce the most important concepts, and to provide examples to allow users to start using Condor as quickly as possible. Although this document is quite large it should be less intimidating than the Condor Reference Manual!

For any more advanced use, however, you will have to refer to the vast Condor Reference Manual.

You may also wish to look at the Condor Project Homepage

1.1. What is Condor

Condor is a specialized batch system for managing compute-intensive jobs. Users submit their compute jobs to Condor, Condor puts the jobs in a queue, runs them, and then informs the user as to the result. The collection of inter-networked machines running Condor and controlled by a particular manager is known as a pool. Like most batch systems, Condor provides a queuing mechanism, scheduling policy, priority scheme, and resource classifications.

In slightly more detail: a user submits the job to Condor from one of a number of Submit machines. Condor finds an available Execute machine from the pool and begins running the job on that machine. Condor has the capability to detect that a machine running a Condor job is no longer available (perhaps because the owner of the machine came back from lunch and started typing on the keyboard). It might be able to checkpoint the job and move (migrate) the jobs to a different machine which would otherwise be idle. If it has been able to checkpoint the job then Condor continues the job on the new machine from precisely where it left off.

Condor does not require an account (login) on machines where it runs a job. Condor can do this because it uses remote system calls which trap library calls for such operations as reading or writing from disk files. The calls are transmitted over the network to be performed on the machine where the job was submitted.

Every machine in a Condor pool can serve a variety of roles, and most machines will serve more than one role simultaneously, although certain roles can only be performed by single machines in a pool. The following list describes the 4 different roles:

1.2. Condor on the PLG Cluster

To see condor's view of running machines, use condor_status. People wishing to use them for running jobs under condor only need to be able to log in pleiades.plg.inf.uc3.es.

We have chosen to support only the Vanilla Universe on our Condor pool.

We have decided not to allow preemption on the local pool (this is not the default behaviour - see 4.1.2. User Priority for details).

1.2.1 Etiquette

Because we have turned off job preemption it is possible for a single user to use the entire pool for long periods, thus preventing other people from getting any jobs to run. Condor does not have sophisticated scheduling mechanisms, there is not much we can do about this!

We have decided to adopt the policy that jobs should aim to finish within a reasonable amount of time - anyone requiring the execution of very long jobs should contact sys-admin.

Users should add the line "nice_user = True" to their jobs as a matter of course. This will ensure that when a new job is to be started, they will only be considered if there are no other jobs free to run. This means that if a user has submitted a large batch of jobs, other users can submit a small number of non-nice jobs. This will only work if most waiting jobs are niced. Note that even if jobs are niced, while they are running, they stop other jobs from starting, so please try to ensure most jobs do complete within a reasonable amount of time.

2. Using Condor

  1. Code Preparation.
    A job run under Condor must be able to run as a background batch job. Condor runs the program unattended and in the background. A program that runs in the background will not be able to do interactive input and output. Condor can redirect console output (stdout and stderr) and keyboard input (stdin) to and from files for you. Create any needed files that contain the proper keystrokes needed for program input. Make certain the program/script will run correctly with these files on the submit machine.
  2. Submit description file.
    A submit description file controls the details of job submission. The file will contain information about the job such as what executable to run, the files to use for keyboard and screen data, the platform type required to run the program, and where to send e-mail when the job completes. You can also tell Condor how many times to run a program; it is simple to run the same program multiple times with multiple data sets. Write a submit description file to go with the job, using the syntax description and some illustrative examples here.
  3. Submit the Job.
    Login to a submit machine (see above) and submit the program to Condor with the condor_submit command.
Once submitted, Condor does the rest. You can monitor the progress of the job with the condor_q and condor_status commands. Note that without options condor_q will only tell you about the machine on which you run it - if you have submitted jobs from another machine you should use condor_q -global. You may modify the order in which Condor will run your jobs with condor_prio.

When your program completes, Condor will tell you the exit status of your program and various statistics about its performance, including time used and I/O performed. If you are using a log file for the job (which is recommended) the exit status will be recorded in the log file. Alternatively you can view the history file for the job by typing condor_history, which will show something like:

 ID      OWNER            SUBMITTED    CPU_USAGE ST PRI SIZE CMD               
   1.0   condor          6/13 10:58   0+00:00:00 C  0   0.9  job_blah         
Notice that the status ("ST") is now C, for completed.

You can remove a job from the queue prematurely with condor_rm.

3. Submit File syntax

A submit description file controls the details of job submission. The syntax is simple, a list of the most important entries grouped by concept follows. This is by no means a full list, for that see the condor_submit man page, this selection is intended mainly to make it easier to understand the examples given in other sections.

Blank lines and lines beginning with a pound sign (#) character are ignored by the submit description file parser, and so may be used for comments.

3.1. Basic entries

3.2. Job Ordering and location

3.3. File Handling

3.4. Job Information

3.5. Environment

3.6. Macros

Parameterless macros in the form of $(macro_name) may be inserted anywhere in Condor submit description files. Macros can be defined by lines in the form of
        <macro_name> = <string>
Two pre-defined macros are supplied by the submit description file parser. The $(Cluster) macro supplies the number of the job cluster, and the $(Process) macro supplies the number of the job. These macros are intended to aid in the specification of input/output files, arguments, etc., for clusters with lots of jobs, and/or could be used to supply a Condor process with its own cluster and process numbers on the command line. For an example see 5.2. Multiple Submission - Different Inputs.

In addition to the normal macro, there is also a special kind of macro called a substitution macro that allows you to substitute expressions defined on the resource machine itself (gotten after a match to the machine has been performed) into specific expressions in your submit description file. The special substitution macro is of the form $$(attribute). It may only be used in three expressions in the submit description file: executable, environment, and arguments. Example:

          executable = myprog.$$(opsys).$$(arch)
The opsys and arch attributes will be substituted at match time for any given resource. This will allow Condor to automatically choose the correct executable for the matched machine.

The environment macro, $ENV, allows the evaluation of an environment variable to be used in setting a submit description file command. The syntax used is

          $ENV(variable)
For example:
          log = $ENV(HOME)/jobs/logfile

4. Job scheduling - Priority, Requirements and Rank

The scheduling arrangements adopted by condor control when and on which machine your jobs are run. Priority (both per-job and per-user) determine when a job will be run, ranking (which uses requirements and machine attributes) may be used to determine where a job is run.

All machines in a Condor pool advertise their attributes, such as available RAM memory, CPU type and speed, virtual memory size, current load average, along with other static and dynamic properties. This machine information also includes under what conditions a machine is willing to run a Condor job and what type of job it would prefer.

Likewise, when submitting a job, you can specify your requirements and preferences, for example, the type of machine you wish to use. You can also specify an attribute, for example, floating point performance, and have Condor automatically rank the available machines according to their values for this attribute. Condor plays the role of a matchmaker by continuously reading all the job requirements and all the machine information, matching and ranking jobs with machines.

4.1 Priority

4.1.1 Job Priority

Job priorities allow the assignment of a priority level to each submitted Condor job in order to control order of execution - note that these are priorities between jobs of the same user only. To set a job priority, use the condor_prio command, or use the priority command in your submit description file. Job priorities do not impact user priorities in any fashion.

4.1.2 User Priority

The default behaviour for Condor is to allocate machines to users based upon a user's priority - which changes according to the number of resources the individual is using. It is possible to submit a job as a "nice" job. Setting nice_user in your submit description file tells Condor not to use your regular user priority, but that this job should have the least priority among all users and all jobs.

4.2 Machine Attributes

The attributes advertised by a machine can be seen with condor_status -l machine_name. Some of the listed attributes are used by Condor for scheduling. Other attributes are for information purposes. An important point is that any of the attributes in a machine can be utilized at job submission time as part of a request or preference on which machine to use. Additional attributes can be easily added.

For example, this is the output of condor_status -l for one processor of the machine pb001:

MyType = "Machine"
TargetType = "Job"
Name = "pb001@plg.inf.uc3m.es"
Machine = "pb001.plg.inf.uc3m.es"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
COLLECTOR_HOST_STRING = "pb001.plg.inf.uc3m.es"
CondorVersion = "$CondorVersion: 6.6.7 Oct 11 2004 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
VirtualMachineID = 1
VirtualMemory = 0
Disk = 467172
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 25539262
ConsoleIdle = 25539262
Memory = 29994
Cpus = 1
StartdIpAddr = "<128.232.4.1:33071>"
Arch = "x86_64"
OpSys = "LINUX"
UidDomain = "plg.inf.uc3m.es"
FileSystemDomain = "plg.inf.uc3m.es"
Subnet = "128.232.4"
HasIOProxy = TRUE
TotalVirtualMemory = 0
TotalDisk = 934344
KFlops = 951601
Mips = 3370
LastBenchmark = 1103098732
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 678
ClockDay = 3
TotalVirtualMachines = 2
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
JavaVendor = "Sun Microsystems Inc."
JavaVersion = "1.4.1_01"
JavaMFlops = 295.152039
HasJava = TRUE
HasPVM = TRUE
HasRemoteSyscalls = TRUE
HasCheckpointing = TRUE
StarterAbilityList = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,
HasJava,HasPVM,HasRemoteSyscalls,HasCheckpointing"
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Unclaimed"
EnteredCurrentState = 1103041577
Activity = "Idle"
EnteredCurrentActivity = 1103098732
Start = TRUE
Requirements = START
CurrentRank = 0.000000
DaemonStartTime = 1103041099
UpdateSequenceNumber = 230
MyAddress = "<128.232.4.1:33071>"
LastHeardFrom = 1103109536
UpdatesTotal = 231
UpdatesSequenced = 230
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"

4.3 Ranking

When considering the match between a job and a machine, rank is used to choose a match from among all machines that satisfy the job's requirements and are available to the user, after accounting for the user's priority and the machine's rank of the job. The rank expressions, simple or complex, define a numerical value that expresses preferences.

The job's rank expression evaluates to one of three values. It can be UNDEFINED, ERROR, or a floating point value. If rank evaluates to a floating point value, the best match will be the one with the largest, positive value. If no rank is given in the submit description file, then Condor substitutes a default value of 0.0 when considering machines to match. If the job's rank of a given machine evaluates to UNDEFINED or ERROR, this same value of 0.0 is used. Therefore, the machine is still considered for a match, but has no rank above any other.

A boolean expression evaluates to the numerical value of 1.0 if true, and 0.0 if false.

Example 1: For a job that desires the machine with the most available memory:

          Rank = memory
Example 2: For a job that prefers to run on Saturdays and Sundays:
          Rank = ( (clockday == 0) || (clockday == 6) )
It is wise when writing a rank expression to check if the expression's evaluation will lead to the expected resulting ranking of machines. This can be accomplished using the condor_status command with the -constraint argument. This allows the user to see a list of machines that fit a constraint.

Example 1: To see which machines in the pool have kflops defined, use:

          condor_status -constraint kflops
Example 2:If this is typed on a Wednesday it will show all of the machines in the pool, on any other day it will show none:
          condor_status -constraint "(clockday == 3)"

5. Examples

5.1. A very simple job

Using the C program called hello.c:
          #include <stdio.h>

          main()
          {
            printf("hello, Condor\n");
            exit(0);
          }
The submit file, submit.hello, is:
            ########################
            # Submit description file for hello program
            ########################
            Executable     = hello
            nice_user      = True
            Universe       = vanilla
            Output         = hello.out
            Log            = hello.log 
            Queue 
The submit instruction is:
          condor_submit submit.hello
and the output will look something like this:
            Submitting job(s)
            .
            Logging submit event(s).
            1 job(s) submitted to cluster 57.
condor_q will say:
            $ condor_q

            -- Submitter: pb000.plg.inf.uc3m.es : <127.0.0.1:59865> : pb000.plg.inf.uc3m.es
            ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
            57.0   ckh11           2/1  11:23   0+00:00:00 R  0   9.8  hello             

            1 jobs; 0 idle, 1 running, 0 held
The log file, hello.log, will show (something similar to):
000 (057.000.000) 02/01 11:23:57 Job submitted from host: <127.0.0.1:59865>
...
001 (057.000.000) 02/01 11:24:31 Job executing on host: <127.0.0.1:34755>
...
005 (057.000.000) 02/01 11:24:31 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        816  -  Run Bytes Sent By Job
        1702035  -  Run Bytes Received By Job
        816  -  Total Bytes Sent By Job
        1702035  -  Total Bytes Received By Job
...
The output file, hello.out, will contain:
          hello, Condor

5.2. Multiple Submission - Different Inputs

A common situation has one executable that is executed many times, each time with a different input set. This is called a job cluster. Each cluster has a "cluster ID" and within each cluster, each job has a "process ID". If the program wants the input in a file with a fixed name, then the solution of choice runs each queued job in its own directory.

This particular example outputs the number of characters in an input file named mult_job_input. There are 5 different input files, so we need 5 jobs. Because the program uses a fixed name for its input file we do not need to specify an input in the submit description file. The 5 different but identically named input files are prestaged in 5 directories before submitting the job. The directories are named job.0, job.1, job.2, job.3 and job.4. In addition to the input file, each directory will receive its own output in a file called mult_job_output, its own error messages will go into mult_job_error, and Condor will log each job's progress in the file called mult_job_log.

The submit file, submit.mult_job, is:

            ####################                    
            # Multiple jobs queued, each in its own directory
            ####################                                                    

            nice_user = True
            universe = vanilla
            executable = mult_job
            output = mult_job_output
            error = mult_job_error
            log = mult_job_log
            initialdir = job.$(Process)
            queue 5
Note the initialdir line, it is using a simple macro to give a different directory name for each job to be queued.

The program source, mult_job.c, is:

          #include <stdio.h>

          main()          {
          FILE *in;
          char ch, filename[80];
          int i=0;

            sprintf(filename,"mult_job_input");
            if((in=fopen(filename,"r")) == NULL){
              printf("Cant open %s\n",filename);
              exit(1);
            }

            while((ch=getc(in)) != EOF){i++;}

            printf("i is %d\n",i);
            exit(0);
          }
Having set up the directories and input files, the submit instruction and output is:
$ condor_submit submit.mult_job
Submitting job(s)
.
Logging submit event(s).....
5 job(s) submitted to cluster 60.

5.3. Multiple Submission - Different Arguments

This example queues three jobs for execution by Condor. The first will be given command line arguments of 15 and 20, and it will write its standard output to msda.out1. The second will be given command line arguments of 30 and 20, and it will write its standard output to msda.out2. Similarly the third will have arguments of 45 and 60, and it will use msda.out3 for its standard output.

The submit file, submit.msda, is:

          ####################
          #
          # Different command line arguments and output files.
          #                                                                      
          ####################                                                   
                                                                         
          executable     = msda                                                   
          nice_user      = True
          universe       = vanilla
                                                                         
          arguments      = 15 20                                               
          output  = msda.out1                                                     
          error   = msda.err1
          queue                                                                  
                                                                         
          arguments      = 30 20                                               
          output  = msda.out2                                                     
          error   = msda.err2
          queue                                                                  
                                                                         
          arguments      = 45 60                                               
          output  = msda.out3                                                     
          error   = msda.err3
          queue
The source for msda is not given as it is trivial - it adds its two arguments and outputs them to stdout.

The submit instruction and output is:

          condor_submit submit.msda
          Submitting job(s)...
          3 job(s) submitted to cluster 61.
Note this time it does not mention logging as we did not specify a log file.

5.4. Simple shell script

Any program can be run as a vanilla job, including shell scripts. The script "doloop" stays in a loop and prints out a number, then sleeps for a second. At the end, doloop.out should contain the values from 0 to 10 and the message "Normal End-of-Job".

The script, "doloop" is:

          #!/bin/bash
          x=0;     # initialize x to 0
          while [ "$x" -le 10 ]; do
              echo "$x"
              # increment the value of x:
              x=$(expr $x + 1)
              sleep 1
          done
          echo "Normal End-of-Job"
The submit file, "submit.doloop", is
          ####################
          ##
          ## Vanilla script test
          ##
          ####################

          nice_user       = True
          universe        = vanilla
          executable      = doloop
          output          = doloop.out
          error           = doloop.err
          log             = doloop.log
          arguments       = 10
          queue

5.5. Matlab

The following example shows matlab running a simple script file (also often known as an M-file). A Matlab script file is an external file that contains a sequence of matlab statements that can be executed interactively in matlab simply by typing its name (without the extension) at the prompt. However, under condor matlab cannot be run interactively, so the script file needs to be executed from the command line by using the -r option to matlab. It is also necessary to use the -nosplash, -nojvm and -nodesktop matlab options to prevent unwanted windows from appearing. Matlab will still try to open a display connection even if we don't want any windows to appear - normally this would not be a problem, but as we run condor daemons as user "condor" instead of root there can be authentication issues. Thus an option such as -display yourhostname:0 or -nodisplay is also needed (the latter will result in some warning messages about broken X connections in your error file which can be ignored). The fact that we run condor daemons as user "condor" instead of root also can cause file ownership problems in this particular example (see 1.2. Condor on the PLG grid) - because we write to a file which will be owned by user "condor" we have to make the working directory world-writeable.

The script file "matscripttest.m" in this example is:

          load a.dat;
          load b.dat;
          matrR = a * b;
          save matrR.dat;
          exit;
Note the final exit - else the script will never finish and condor will hang. The files a.dat and b.dat must preexist, the file matrR.dat will be created.

The submit file, "submit.matlab" will be

           #
           # Submit a matlab job
           #
           executable = /usr/bin/cl-matlab
           arguments = -nosplash -nojvm  -nodesktop -nodisplay -r matscripttest
           nice_user = True
           universe = vanilla
           getenv   = True          # MATLAB needs local environment
           log = mat.log
           output = mat.out
           error = mat.err
           queue 1 
Note the getenv = True - without it matlab will core dump!
Note also that the executable given is the full path name. Even if matlab is on your PATH you need to give the full pathname or condor will assume it is an executable in the current working directory, and condor_submit will report an error when it can't find it.

5.6. More matlab - a slight variant

A slight variant on the procedure in 5.5. Matlab is to create a small shell script, eg "matscripttest.sh", as a wrapper to matlab:
 
          #! /bin/sh  
          cl-matlab  -nosplash -nojvm  -nodesktop -nodisplay -r "matscripttest"

The submit file would be similar to the above, but the executable line would then be

           executable = matscripttest.sh
and the arguments line would not be needed.

5.7. IPC Software

This is an example of how to use the International Planning Competition(IPC) software sending a job per domain of the last IPC for a given configuration.

The script, "script.sh" is:

          #!/bin/bash

          domains=("barman" "elevators" "floortile" "nomystery" "openstacks" "parking" "parcprinter" "pegsol" "scanalyzer" "sokoban" "transport" "tidybot" "visitall" "woodworking")
          domain=${domains["$2"]}
          exec_folder=exec_"$2"

          mkdir "$exec_folder"
          cd scripts/IPCData
          ./invokeplanner.py --ini ../../ipc.ini -e myaddress@mydomain.com -t seq -s opt --timeout 1800 --memory 6 --directory ../../"$exec_folder" -l ./logfile -D "$domain" --planner "$1"
The submit file, "exec.condor", is
          
          nice_user = True
          universe = vanilla
          notify_user = myaddress@mydomain.com
          notification = Always
          getenv = TRUE

          Executable = script.sh
          Output = test.$(Cluster).$(Process).out
          log = test.$(Cluster).$(Process).log
          error = test.$(Cluster).$(Process).err
          request_memory = 7000

          transfer_input_files = ipc.ini, scripts
          transfer_output_files = exec_$(PROCESS)

          arguments = mutex $(PROCESS)
          queue 14

5.8. Urgent job

To send an urgent job, just set nice_user to "False".
          ####################
          ##
          ## Urgent job
          ##
          ####################

          nice_user       = False
          universe        = vanilla
          executable      = doloop
          output          = doloop.out
          error           = doloop.err
          log             = doloop.log
          arguments       = 10
          queue

6. Summary of Useful Condor Commands