Diagram of how a job gets run on Lonestar
Start at the bottom - that's what you want: one of Lonestar's 1,888 compute nodes running your specific program (bowtie mapping in this case).
To get there, you must go through a "Queue Manager" program running on a different computer - the login (or "head") node. This program keeps track of what's running on those 1,888 nodes and what's in line to run next. It's very good at doing this.
You tell the Queue Manager what you want done via "job.sge" - your job submission script. That specifies how many nodes you need, what allocation to use, the maximum run time of the job, etc. The Queue Manager doesn't really care what you're running, just how you want it run. It needs to pass info on what you're running off to the compute node - you do that with the line
setenv CONTROL_FILE commands.
The Queue Manager sends off the commands in the file
commands to the compute nodes; so
commands is really the first thing to start with.
launcher_creator.py script just helps you by creating
jobs.sge easily - saves you some time editing a file (and potentially messing it up).
The main point of using Lonestar is that it is a massive computer cluster. We have been running all of our commands in "interactive mode", where we type a command and then sit around and wait it to complete. We can only really do one command at a time this way. Furthermore, we've been tying up a "head" or "login" node on TACC when we do this. WHen we do serious computations that are going to take more than a few minutes or use a lot of RAM, we need to submit them to one of the other 1,888 computer nodes and 22,656 cores on Lonestar.
In this section we are going to learn how to submit a job to the Lonestar cluster.
In the examples we tend to say that a job can be "interactive" or should be "submitted to the TACC queue". The first means that you can type it and run it directly. It should be short enough that it does not tie up the TACC head node. The second means that you should go through the launcher submission process described here.
If you do try to run a long job in interactive mode. It will be killed after 10-15 minutes and you may see a message like this:
Message from firstname.lastname@example.org on pts/127 at 09:16 ... Please do not run scripts or programs that require more than a few minutes of CPU time on the login nodes. Your current running process below has been killed and must be submitted to the queues, for usage policy see http://www.tacc.utexas.edu/user-services/usage-policies/ If you have any questions regarding this, please submit a consulting ticket.
A launcher file tells Lonestar which executables to run with your desired options and for how long. It requests a certain amount of resources (cores and time) so that Lonestar's scheduling program figure out where to fit your job in.
First, let's make a very simple job to run. All we need to do is create a text file. Each line in this text file, which we will call simply
commands, is a command exactly as you would type it into the terminal yourself to have it run.
date > date.out ls > ls.out
- The minimum number of processors that you can request on Lonestar is 12, so you might as well add up to 10 more lines to this file that are different shell commands that will give some sort of output. Each will be run on a different core in parallel.
TACC has supplied a sample launcher script which we will modify to queue and execute our job. First, type
module load launcher
Now let's copy the example launcher file.
cp $TACC_LAUNCHER_DIR/launcher.sge ./
There's a few things we should change inside of this file. Open the file using nano like so:
First, Let's change the name of the job.
The -N line specifies the name of the job. Let's change it to (what).
The -o line specifies the names of the output files that Lonestar makes. Let's change them to the name of this job.
The -l line specifies the length of time given to the job. The more time we give our job, the longer in the queue our job will wait to be run. When the time is up, Lonestar will terminate our job whether or not it's finished. So it's best to give our job slightly more time than it'll take.
We can also add a few lines to have Lonestar send an email to your email address when the job starts and finishes.
Under -V, add 2 new lines like so:
#$ -M email@example.com #$ -m be
Lastly, we need to specify the job file.
Change the line that says "setenv CONTROL_FILE" to say:
setenv CONTROL_FILE job.csh
Now let's save our changes and quit.
The Launcher Queue
Now that we have our job file and our launcher, we need to queue the launcher. Type:
Lonestar will make sure that everything you've specified is correct and if it is, your job will be queued.
You can check the status of your job like so:
This will tell you its job priority and what state it is in.
A state of "qw" means "queued."
A state of "r" means "running."
If you happen to notice that your job will run incorrectly, you can delete your job like so:
You can obtain the job-ID by typing "qstat."
If you are nosy and want to see all of the jobs queued and running on Lonestar, then use this command:
You can also see just your jobs in this format:
You can create a job that is dependent on another job finishing only start after the first job has completed using this command:
qsub -hold_jid job-ID launcher.sge
TACC Output Files
While your job is running, TACC creates 3 different files with names based on the -o field in the launcher. These files are named like so:
(job_name).e(job-ID) (job_name).pe(job-ID) (job_name).o(job-ID)
These files have the output of your job that would have been sent to standard output or standard error and messages from TACC about your job. These files will be useful if your job fails.
We have created a Python script called
launcher_creator.py that makes creating a
launcher.sge script a breeze. You will probably want to use this for the rest of the course.
Now run the script with the
-h option to show the help message:
module load python launcher_creator.py -h
The name of the job.
The allocation you want to charge the run to.
The queue to submit to, like 'normal' or 'largemem', etc.
Optional The number of jobs in a job list you want to give to each node. (Default is 12 for Lonestar, 16 for Stampede.)
number of nodes
Optional Specifies a certain number of nodes to use. You probably don't need this option, as the launcher calculates how many nodes you need based on the job list (or Bash command string) you submit. It sometimes comes in handy when writing pipelines.
Time allotment for job, format must be hh:mm:ss.
Optional Your email address if you want to receive an email from Lonestar when your job starts and ends.
Optional Filename of the launcher. (Default is
Optional String of module management commands.
Optional String of Bash commands to execute.
Optional Filename of list of commands to be distributed to nodes.
Optional Setting this flag outputs the name of the launcher to stdout.
We should mention that
launcher_creator.py does some under-the-hood magic for you and automatically calculates how many cores to request on lonestar, assuming you want one core per process. You don't know it, but you should be grateful that this saves you from ever having to think about a confusing calculation.
- Take it for a test drive: use
launcher_creator.pyto create a
launcher.sgescript for your previous
commandsfile and run it again.
Now let's go back to the course outline