Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Table of Contents

Overview:

This portion of the class is devoted to making sure we are all starting from the same starting point on lonestar. This tutorial is adapted from a previous version which allowed for set up on the now decommissioned lonestar4. Portions of this tutorial were adapted from previous versions which can be found herehereherehereherehere, and here . Collective thanks to all those that contributed to those works which now appear in a single versionwas developed as a combined version of multiple other tutorials which were previously given credit here. Anyone wishing to use this tutorial is welcome.

Objectives:

  1. Log into lonestar5.
  2. Change your lonestar profile to the course specific format.
  3. Refresh understanding of basic linux commands with some course organization.
  4. Review use of the nano text editor program, and become familiar with several other text editor programs.


Tutorial:

  • Logging into lonestar5

Start a new terminal window. For MACs this is done by clicking on the magnifying glass on the right hand side of the toolbar at the top of the page and type "terminal".  For windows this should be done by connecting through cygwin. Log into lonestar using your account information. 

...

Warning
titleLogging into remote computers

As a matter of internet safety, the terminal window knows you are entering a password and may not want your neighbor to see what it is. For this reason, even as you type to enter your password, nothing will be displayed on the screen. While backspace will work if you know you made a mistake, we often find it better to just hit enter and try again.

If you have never logged into lonestar from the computer you are currently using before, you will be issued a security warning. The same will be true if you log into any of the other TACC resources, or any other remote computer. If you ever see a security warning logging into somewhere that you use commonly you should answer no and try to figure out why you were warned. Otherwise type "yes" to bypass the security check.


  • Setting up your lonestar profile and other variables

There are many flavors of Linux/Unix shells. The default for TACC's Linux (and most other Linuxes) is bash (bourne again shell), which we will use throughout.

...

Code Block
titleCreating a shortcut to the main Lonestar working directories
cdh
ln -s $SCRATCH scratch
ln -s $WORK work
ln -s $BI BioITeam


  • Understanding what your .bashrc file actually does.

Expand
titleWhile interesting and useful information to have, understanding it is not critical to variant analysis. We encourage you to look through this information at your own time.

Let's look at what your .bashrc profile actually does. Use the cat command to print contents of the .bashrc file to the screen.

Code Block
languagebash
titlePrint the contents of the .profile file to the screen
collapsetrue
cat .bashrc

This will print several lines of text to the terminal window. Let's look at what some of these lines do with a little more information:

  • lines that start with #

    • Any line begins with a # symbol, it is "commented out". Anything after a # symbol will not be executed by any program. Programers commonly make use of behavior to leave notes for others, or even themselves at a later date as to what particular lines of a script are actually doing.
  • Section 1 has multiple lines involving "module load <NAME>"

    • This loads different modules by default. We have included ones that we will use throughout the course and that you will commonly make use of. After we review the use of the nano text editor we'll go into more depth with TACC modules. But for now trust us when we say that not having to load a bunch of modules everytime you log into TACC is a good thing.

  • Section 2 has multiple lines starting with "export"

    • The export lines define shell variables for example BI and PATH. You've already seen how using $BI can come in handy accessing our shared course directory. As for PATH, that is a well-known environment variable that defines a set of directories where the shell will look when you type in a program's name. Our shared profile adds the common course directories that we copied at the start of this tutorial and your local ~/local/bin directory (which does not exist yet) to the location list. You can see the entire list of locations by doing this:

      Code Block
      titleHow to see where the bash shell looks for programs
      echo $PATH

      As you can see, there are a lot of locations on the path. That's because when you load modules at TACC (see above), that mechanism makes the programs available to you by putting their installation directories on your $PATH.

  • umask 002

    • The umask command is used to set the default permissions of newly created files and directories limiting the need to use the chmod command. umask functions as the inverse of chmod meaning that it subtracts the values from the default permissions. In this case the command umask 002 is the equivalent of the command chmod 775 for directories, and chmod 664 for files. in summary, having this command in your .profile gives all new files you create read and write access to both you and your group while giving read only access to everyone else.
  • PS1='tacc:\w$ '

    • The PS1='tacc:\w$ ' line is a special setting that tells the shell to display the current directory as part of its prompt. It saves you typing pwd all the time to see where you are in the directory hierarchy. Try using the mkdir command to make a new directory called tmp and change into that directory to see what it does to your prompt.

      Code Block
      languagebash
      titleSee how your prompt reflects your current directory
      collapsetrue
      mkdir tmp
      cd tmp
    • Your prompt should have changed from: "tacc:~$"to now be "tacc:~/tmp$". Your prompt now tells you you are in the tmp subdirectory of your home directory (~). See if you can figure out how to return to your home directory without expanding the code block. Expand the following code block to see the different ways of returning to your home directory

      Code Block
      languagebash
      titleHow to return to your home directory
      collapsetrue
      cd
      cdh
      cd $HOME
      cd ~


  • Editing files

There are a number of options for editing files at TACC. These fall into three categories:

...

Warning

Be careful with long lines – sometimes nano will split long lines into more than one line, which can cause problems in our commands files, as you will see.

 

  • How should we name files and folders?

In general you will want to adopt a consistent pattern of naming, and it should be your own and something that makes sense to you. The most important thing to get used to is the convention of using . or _ in names rather than spaces in names, and limiting your use of any other punctuation. Spaces are great for mac and windows folder names when you are using visual interfaces, but on the command line, a space is a signal to start doing something different. Imagine instead of a BioITeam folder you wanted to make it a little easier to read and wanted to call it "Bio I Team" certainly everyone would agree its easier to read that way, but because of the spaces, bash will think you want to create 3 folers, 1 named Bio another named I and a third named Team. Now this is certainly behavior you can use when appropriate to your advantage, but generally speaking spaces will not be your friend. Early on in my computational learning I was told "A computer will always do exactly what you told it to do. The trick is telling it to do what you want it to do". 

Expand
titleCan you title new directories that have spaces in them?

This is hidden away to keep you from accidentally thinking that this is a good idea. If for some reason you encounter spaces in the file names or directories that you are working with, (assumably because a colleague sent you some data, and not because you thought it was a good idea personally) spaces can be "escaped" like many other special characters. Imagine someone sent you directory name "This is really annoying to use, but I don't know it yet" to change into that directory you would have to type:

Code Block
languagebash
cd this\ is\ really\ annoying\ to\ use\ but\ I\ don\'t\ know\ it\ yet

Notice that the apostrophe also had to be escaped, which should help show you not to use other punctuation.


  • Stringing commands together and controlling their output

In a linux shell, it is often useful to take output of one command save it to a new file rather than having it print to the screen. It uses a familiar metaphor: "pipes". The linux operating system expects some "standard input pipe" and gives output back through a "standard output pipe". These are called "stdin" and "stdout" in linux. There's also a special "stderr" for errors; we'll ignore that for now. Usually, your shell is filling the operating system's stdin with stuff you type - the commands with options. The shell passes responses back from those commands to stdout, which the shell usually dumps to your screen. The ability to switch stdin and stdout around is one of the key reasons linux has existed for decades and beat out many other operating systems. Let's start making use of this. Change to the scratch directory and make a new folder called "piping" and put list of the full contents of the $BI folder to a new file called whatsHere.

...

Again, you should see your answer only showing up after the cat command. Note that by using a single > you are overwriting the existing contents and that there is no warning that this is happening beware of this in the future as linux doesn't have an "undo" feature. We will make use of the redirect output (stdout) character (>), and the "pass output along as input"  "|" throughout the course. Not all shells are equal - the bash shell lets you redirect stdout with either > or 1>; stderr can be redirected with 2>; you can redirect both stdout and stderr using &>. If these don't work, use google to try to figure it out. The web site stackoverflow is a usually trustworthy and well annotated site for OS and shell help.


  • Understanding TACC

Now that we've been using lonestar for a little bit, and have it behaving in a way that is a little more useful to us, let's get more of a functional understanding of what exactly it is and how it works.

Diagram of Lonestar5 directories: What connects to what, how fast, and for how long.

Lonestar is a collection of 1,252 computers with 24 cores connected to three file servers, each with unique characteristics. You need to understand the file servers to know how to use them effectively

...

Code Block
languagebash
titleExample command for copying data from a $WORK directory to $SCRATCH
 cp $WORK/my_fastq_data/*fastq $SCRATCH/my_project/

Understanding "jobs" and compute nodes.


When you log into lonestar using ssh you are connected to what is known as the login node or "the head node". There are several different head nodes, but they are shared by everyone that is logged into lonestar (not just in this class, or from campus, or even from texas, but everywhere in the world). Anything you type onto the command line has to be executed by the head node. The longer something takes to complete, or the more it will slow down you and everybody else. Get enough people running large jobs on the head node all at once (say a classroom full of Big Data in Biology summer school students) and lonestar can actually crash leaving nobody able to execute commands or even log in for minutes -> hours -> perhaps even days if something goes really wrong. To try to avoid crashes, TACC tries to monitor things and proactively stop things before they get too out of hand. If you guess wrong on if something should be run on the head node, you may eventually see a message like the one pasted below. If you do, its not the end of the world, but repeated messages will become revoked TACC access and emails where you have to explain what you are doing to TACC and your PI and how you are going to fix it and avoid it in the future.  

...

So you may be asking yourself what the point of using lonestar is at all if it is wrought with so many issues. The answer comes in the form of compute nodes. There are 1,252 compute nodes that can only be accessed by a single person for a specified amount of time. These compute nodes are divided into different queues called: normal, development, largemem, etc. Access to nodes (regardless of what queue they are in) is controlled by a "Queue Manager" program. You can personify the Queue Manager program as: Heimdall in Thor, a more polite version of Gandalf in lord of the rings when dealing with with the balrog, the troll from the billy goats gruff tail, or any other "gatekeeper" type. Regardless of how nerdy your personification choice is, the Queue Manager has an interesting caveat: you can only interact with it using the  sbatch command. "sbatch <filename.slurm>" tells the que manager to run a set job based on information in filename.slurm (i.e. how many nodes you need, how long you need them for, how to charge your allocation, etc). The Queue manager doesn't care WHAT you are running, only HOW to find what you are running (which is specified by a setenv CONTROL_FILE commands line in your filename.slurm file). The WHAT is then handled by the file "commands" which contains what you would normally type into the command line to make things happen.

Further sbatch reading

To make things easier on all of us, there is a script called launcher_creator.py that you can use to automatically generate a .slurm file. This can all be summarized in the following figure:

Using launcher_creator.py

The BioITeam created a Python script called launcher_creator.py that makes creating a .slurm file a breeze. Before learning to work with interactive compute nodes during the class, we will show you how you will most often do your analysis. Run the launcher_creator.py script with the -h option to show the help message so we can see what other options the script takes:

...

We should mention that launcher_creator.py does some under-the-hood magic for you and automatically calculates how many cores to request on lonestar, assuming you want one core per process. You don't know it, but you should be grateful that this saves you from ever having to think about a confusing calculation.

Running a job

Now that we have an understanding of what the different parts of running a job is, let's actually run a job. Move to your scratch directory, make a new folder called "my_first_job" (Remember not to use spaces in file/folder names), make a new file called "commands" inside of that directory using nano, and put 4-12 lines with 1 command on each line in that file, being sure to remember to pipe the output to 1 or more files. 

Code Block
languagebash
titlehow to make a sample commands file
linenumberstrue
# remember that things after the # sign are ignored by bash 
# lines in blocks like this often will scroll to the right
cds  # move to your scratch directory
mkdir my_first_job  # make a new folder called "my_first_job"
cd my_first_job  # move into the new folder to make it easier to create a file there
nano commands  
 
# the following lines should be typed into the nano editor so they will be saved to the new file "commands"
cat commands > commands.out  # this will print the contents of the file you are currently editing to a new file called commands.out
date > date.out  # this will create a file with todays date on it
pwd > current_directory.out  # this will create a file with the current directory in it
echo "my name is <YOURNAME>" >> name.out  # Note that this time we used the append symbol >> not the write symbol > as we plan to put multiple things into the same file. be sure to replace the <> signs with your name
echo "This is the final result of my first script. It worked how I thought it would, or hopefully have the resources to figure out why it didn't" >> name.out  # this will add another line of text to the name.out file.
# feel free to add up to 7 more lines to your commands file here using the cat/ls/pwd/mkdir/other commands that you know.
# beware using cd commands here as it will change your directory as if you were doing it on an interactive node and may cause you to reference files that don't exist
# write and exit nano now ctrl-o ctrl-x
launcher_creator.py -n "my_first_job" -t 00:02:00 -a "UT-2015-05-18" # this will create a my_first_job.slurm file that will run for 2 minutes
sbatch my_first_job.slurm  # this will actually submit the job to the Queue Manager and if everything has gone right, it will be added to the development queue.

Interrogating the launcher queue

Here are some of the common commands that you can run and what they will do or tell you:

...

If the queue is moving very quickly you may not see much output, but don't worry, there will be plenty of opportunity once you are working on your own data.


Evaluating your first job submission

Based on our example you may have expected 4 new files to have been created during the job submission, but instead you will find 3 extra files as follows: <job_name>.e(job-ID), <job_name>.pe(job-ID), and <job_name>.o(job-ID). When things have worked well, these files are typically ignored. When your job fails, these files offer insight into the why so you can fix things and resubmit. 

...

Code Block
languagebash
titlemake a single final file using the cat command and copy to a useful work directory
linenumberstrue
# remember that things after the # sign are ignored by bash 
cat *.out > first_job_submission.final.output  # Remember that the * wildcard will take things in alpha order, if you want you can list each file separately to control what order they go into the new file.
mkdir $WORK/BDIB_GVA_2017
mkdir $WORK/BDIB_GVA_2017/Day1
mkdir $WORK/BDIB_GVA_2017/Day1/first_tacc_job  # each directory must be made in order to avoid getting a no such file or directory error
cp first_job_submission.final.output $WORK/BDIB_GVA_2017/Day1/first_tacc_job
cp *.slurm $WORK/BDIB_GVA_2017/Day1/first_tacc_job
cp *<job-ID> $WORK/BDIB_GVA_2017/Day1/first_tacc_job  #your job-id is the string of numbers following the .o and .e filenames


Moving beyond the preinstalled commands on TACC

If (or when) you looked at what our edits to the .bashrc file did, you would have seen that the last lines were a series of "module load XXXX" commands, and a promise to talk more about them later. I'm sure you will be thrilled to learn that now is that time... As a "classically trained wet-lab biologist" one of the most difficult things I have experienced in computational analysis has been in installing new programs to improve my analysis. Programs and their installation instructions tend (or appear) to be written by computational biologists in what at times feels like a foreign language, particularly when a particular when things start going wrong. Luckily TACC (and the BioITeam) help get around a large number of these problems by preinstalling many programs if you know where to look.

TACC modules

Modules are programs or sets of programs that have been set up to run on TACC. They make managing your computational environment very easy. All you have to do is load the modules that you need and a lot of the advanced wizardry needed to set up the linux environment has already been done for you. New commands just appear.

...

You will notice when you type module list you have several different modules installed already. These come from both TACC defaults (TACC, linux, etc), and several that are used so commonly both in this class and by biologists that it becomes cumbersome to type "module load python" all the time and therefore we just have them turned on by default by putting them in our profile to load on startup.  As you advance in your own data analysis you may start to find yourself constantly loading modules as well. When you become tiered of doing this (or see jobs fail to run because the modules that load on the compute nodes are based on your .bashrc file plus commands given to each node), you may want to add additional modules to your .bashrc file. This can be done using the "nano .bashrc" command from your home directory.


Transferring files to and from lonestar with a Mac/Linux machine

Lonestar is tremendously powerful and capable of doing many things, but as most of you are probably being slightly frustrated by, it doesn't have much in the way of a GUI (graphical user interface), and does not have the same scrolling capabilities we are used to on our own computers, let alone actually visualizing graphs and more meaningful representations of our data. In order to do these types of things, we have to get our data off of lonestar and onto our own computers. On our diagram of lonestar we showed a boundary of what could be copied and moved within TACC and list the scp command as a way of moving files to other computers outside of TACC. scp works the same was as the cp command, it just includes more detailed information on the path of where the file is, or where the file is going. Here we will transfer our recently created "first_job_submission.final.output" file from lonestar to the computer you are sitting at as an example. First navigate to your work directory to find your final output file, and determine what the full path to that location is.

...

Files can be moved to lonestar in the same way, just by adding the "<username>@ls5.tacc.utexas.edu:" location information to the destination portion of the command.

Transferring files to and from lonestar with Windows

Expand
titleIf you are forced to use a windows machine, this may be of use to you.

SSH Secure File Transfer (Windows) is available as part of the SSH Secure Shell client which can be downloaded from Bevoware.

iFrame
srchttp://www.utexas.edu/learn/upload/ssh_client.html
width100%
height800

If you are seeing this message, your browser does not support iframes and you are not seeing the instructions on using SSH Secure file transfer

...