Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Remember that PODs are shared resources, and it is important to be aware of how your work can affect others trying to use POD resources. Here are some tips for using POD resources wisely.

Memory usage considerations

Using too much RAM can quickly make a compute server unusable. When a system's main random access memory (RAM) is filled and additional memory requests are made, "pages" of main memory will be written out to "swap" space on disk, then read back in when again needed. Since disk I/O is on the order of 1,000 times slower than RAM access, swapping can slow a system down considerably.

And in a pathological (but unfortunately not uncommon) pattern, a program (or programs) that need more memory than available can cause "thrashing" where swapping in and out of RAM is happening continuously. This will bring a computer to its knees, making it virtually impossible to do anything on it (slow logins, or logins timing out; any simple command just "hanging" for a long time or never returning). We monitor system usage, and will intervene when we see this happen, by termininating the offending process(es) if possible, or by rebooting the compute server if not.

You can avoid causing a problem like this by following this advice:

Tips:

  • Know the memory configuration of the compute server you're using
    • free -g will show you total RAM and swap in Gigabytes
  • Before starting a memory intensive job, check the system's current memory status
    • free -g also shows used and available for both main memory and swap
  • Know the memory requirements of your program.
    • Monitor its memory usage while it is running using top (see https://www.booleanworld.com/guide-linux-top-command/)
    • This is particularly important if you plan to run multiple instances of a program, since it will guide you in knowing how many such instances you should run.
  • Run memory intensive processes when system load is otherwise light (e.g. overnight)

Computational considerations

Running processes unattended

While POD compute servers do not have a batch system, you can still run multiple tasks simultaneously in several different ways. 

For example, you can use terminal multiplexer tools like screen or tmux to create virtual terminal sessions that won't go away when you log off. Then, inside a screen  or  tmux  session you can create multiple sub-shells where you can run different commands.

You can also use the command line utility nohup to start processes in the background, again allowing you to log off and still have the process running.

 Here are some links on how to use these tools:

Do not run too many processes

Having described how to run multiple processes, it is important that you do not run too many processes at a time, because you are just using one compute server, and you're not the only one using the machine!

How many is "too many"? That really depends on what kind of job it is, what compute/input-output mix it has, and how much RAM it needs. As a general rule, don't run more simultaneous jobs on a POD compute server than you would run on a single TACC compute node.

Before running mutiple jobs, you should check RAM usage (free -g will show usage in GB) and see what is already running using the top program (press the 1 key to see per-hyperthread load), or using the who command, or with a command like this:

Code Block
languagebash
ps -ef | grep -v root | grep -v bash | grep -v sshd | grep -v screen | grep -v tmux | grep -v 'www-data'

Here is a good article on all the aspects of the top command: https://www.booleanworld.com/guide-linux-top-command/

Finally, be sure to lower the priority of your processes using renice as described below (e.g. renice -n 15 -u `whoami`).

Lower priority for large, long-running jobs

If you have one or more jobs that uses multiple threads, or does significant I/O, its execution can affect system responsiveness for other users.

To help avoid this, please use the renice tool to manipulate the priority of your tasks (a priority of 15 is a good choice). It's easy to do, and here's a quick tutorial: http://www.thegeekstuff.com/2013/08/nice-renice-command-examples/?utm_source=tuicool

For example, before you start any tasks, you can set the default priority to nice 15 as shown here. Anything you start from then on (from this shell) should inherit the nice 15 value.

Code Block
languagebash
renice +15 $$

Once you have tasks running, their priority can be changed for all of them by specifying your user name:

Code Block
languagebash
renice +15 -u `whoami`

or for a particular process id (PID):

Code Block
languagebash
renice +15 -p <some PID number>

Multi-processing: cores vs hyperthreads

Many programs offer an option to divide their work among multiple processes, which can reduce the total clock time the program will run. The option may refer to "processes", "cores" or "threads", but actually target the available computing units on a server. Examples include: samtools sort --threads option; bowtie2 -p/--threads option; in R, library(doParallel); registerDoParallel(cores = NN).

One thing to keep in mind here is the difference between cores and hyperthreads. Cores are physical computing units, while hyperthreads are virtual computing units -- kernel objects that "split" each core into two hyperthreads so that the single compute unit can be used by two processes.

The AvailablePODs table describes the compute servers that are associated with each BRCF pod, along with their available cores and (hyper)threads. (Note that most servers are dual-CPU, meaning that total core count is double the per-CPU core count, so a dual 4-core CPU machine would have 8 cores.) You can also see the hyperthread and core counts on any server via:

Code Block
languagebash
cat /proc/cpuinfo | grep -c 'core id'           # actually the number of hyperthreads!
cat /proc/cpuinfo | grep 'siblings' | head -1   # the real number of physical cores

(Yes, the fact that 'core id' gives hyperthreads and 'siblings' the number of cores is confusing. But what do you expect -- this is Unix (smile))

Since hyperthreads look like available computing units, parallel processing options that detect "cores" usually really detect hyperthreads. Why does this matter? 

The bottom line:

  • virtual Hyperthreads are useful if the work a process is doing periodically "yields", typically to perform input/output operations, since waiting for I/O allows the core to be used by other work. Many NGS tools fall into this category since they read/write sequencing files.
  • phycical Cores are best used when a program's work is compute-bound. When processing is compute bound -- as is typical of matrix-intensive machine learning algorithms -- hyperthreads actually degrade performance, because two compute-bound hyperthreads are competing for the same physical core, and there is OS-level overhead involved in process switching between the two.

So before you select a process/core/thread count for your program, consider whether it will perform significant I/O. If so, you can specify a higher count. If it is compute bound (e.g. machine learning), be sure to specify a count low enough to leave free hyperthreads for others to use.

Note that this issue with machine learning (ML) workflows being incredibly compute bound is the main reason ML processing is best run on GPU-enabled servers. While none of our current PODs have GPUs, GPU-enabled servers are available at TACC. Additionally, Austin's Advanced Micro Devices, who are trying to compete with NVIDIA in the GPU market, will soon be offering a "GPU cloud" that will be available to UT researchers. We're working with them on this initiative and will provide access information when it is available.

Input/Output considerations

Avoid heavy I/O load

Please be aware of the potential effects of the input/output (I/O) operations in your workflows.

Many common bioinformatics workflows are largely I/O bound; in other words, they do enough input/output that it is essentially the gating factor in execution time. This is in contrast to simulation or modeling type applications, which are essentially compute bound.

It is underappreciated that I/O is much more difficult to parallelize than compute. To add more compute power, one can generally just increase the number of processors, their speed, and optimize their CPU-to-memory architecture, which greatly affects compute-bound tasks.

I/O, on the other hand, is harder to parallelize. Large compute clusters such as TACC expose large single file system namespaces to users (e.g. Work, Scratch), but these are implemented using multiple redundant storage systems managed by a sophisticated parallel file system (Lustre, at TACC) to appear as one. Even so, file system outages at TACC caused by heavy I/O are not uncommon.

In the POD architecture, all compute servers share a common storage server, whose file system is accessed over a high-bandwidth local network (NFS over 10 Gbit ethernet). This means that heavy I/O to shared storage initiated from any compute server can negatively affect users on all compute servers.

For example, as few as three simultaneous invocations of gzip or samtools sort on large files can degrade system responsiveness for other users. If you notice that doing an ls or command completion on the command line seems to be taking forever, this can be a sign of an excessive I/O load (although very high compute loads can occasionally cause similar issues).

Transfer large files directly to the storage server

BRCF storage servers are just Linux servers, but ones you access from compute servers over a high-speed internal network. While they are not available for interactive shell (ssh) access; they provide direct file transfer capability via scp or rsync.

Using the storage server as a file transfer target is useful when you have many files and/or large files, as it provides direct access to the shared storage. Going through a compute server is also possible, but involves an extra step in the path – from the compute-server to its network-attached storage-server.

The solution is to target your POD's storage server directly using scp or rsync. When you do this, you are going directly to where the data is physically located, so you avoid extra network hops and do not burden heavily-used compute servers.

Tip

Note that direct storage server file transfer access is only available from UT network addresses, from TACC, or using the UT VPN service.

Please see this FAQ for more information: I'm having trouble transferring files to/from TACC.

Storage management considerations

Manage storage areas by project activity

Shared POD storage servers are high capacity (~50 to ~250 TB), but space is not infinite! The same goes for backup storage, since the BRCF must have capacity to back up all POD Home and Work areas. The following guidelines will help you and your colleagues stay within storage limits.

The three categories of data activity determine where the data should reside:

  1. Data that is active, such as project directories where new files are added and ongoing analysis is taking place.
    • This data belongs in your Work area where it is regularly backed up.
  2. Data that is no longer active, but needs to be accessible for reference, such as projects that are complete but that you refer to from time to time.
    • This data belongs in your Scratch area so that it does not consume backup space.
    • Please contact us at rctf-support@utexas.edu to request that a long-term archive of the data be made to tape.
      • We can also efficiently move the data from Work to Scratch for you since we can access the storage server directly.
  3. Data that is no longer active and does not need to be referenced.
    • This data can be removed entirely so that it does not consume either storage server or backup server space.
    • If a copy should be preserved, please contact us at rctf-support@utexas.edu to request that a long-term tape archive be made before the data is deleted.
      • We can also efficiently remove the data for you since we can access the storage server directly.

Also keep in mind the other types of data that belong in Scratch, that do not need backing up or archiving, such as data and references from puclic databases and downloaded software – these can be re-downloaded if necessary.

Avoid having too many small files

While the ZFS file system we use is quite robust, we can experience issues in the weekly backup and periodic archiving process when there are too many small files in a directory tree.

What is too many? Ten million or more.

If the files are small, they don't take up much storage space. But the fact that there are so many causes the backup or archiving to run for a really long time. For weekly backups, this can mean that the previous week's backup is not done by the time the next one starts. For archiving, it means it can take weeks on end to archive a single directory that has many millions of small files.

Backing up gets even worse when a directory with many files is just moved or renamed. In this case the files need to be deleted from the old location and added to the new one – and both of these operations can be extremely long-running.

To see how many files (termed "inodes" in Unix) there are under a directory tree, use the df -i command. For example:

Code Block
languagebash
df -i /stor/work/MyGroup/my_dir

The results might look something like this:

Code Block
languagebash
Filesystem               Inodes     IUsed        IFree IUse% Mounted on
stor/work/MyGroup  103335902213  28864562 103307037651    1% /stor/work/MyGroup

The IUsed column (here 28864562) is the number of inodes (files plus directories) in the directory tree listed under Filesystem (here /stor/work/MyGroup). Note that the reported Filesystem may be different from the one you queried, depending on the structure of the ZFS file systems.

There are a several work-arounds for this issue.

1) Move the files to a temporary directory.
The backup process excludes any sub-directory anywhere in the file system directory tree named tmp, temp, or backups. So if there are files you don't care about, just rename the directory to, for example, tmp. There will be a one-time deletion of the directory under its previous name, but that would be it. 

2) Move the directories to a Scratch area.
Scratch areas are not backed up, so will not cause an issue. The directory can be accessed from your Work area via a symbolic link. Please Contact Us if you would like us to help move large directories of yours to Scratch (we can do it more efficiently with our direct access to the storage server).

3) Zip or Tar the directory
If these are important files you need to have backed up, ziping or taring the directory is the way to go. This converts a directory and all its contents into a single, larger file that can be backed up or archived efficiently. Please Contact Us if you would like us to help with this, since with our direct access to the storage server we can perform zip and tar operations much more efficiently than you can from a compute server.

Storage management considerations

Manage storage areas by project activity

Shared POD storage servers are high capacity (~50 to ~250 TB), but space is not infinite! The same goes for backup storage, since the BRCF must have capacity to back up all POD Home and Work areas. The following guidelines will help you and your colleagues stay within storage limits.

The three categories of data activity determine where the data should reside:

  1. Data that is active, such as project directories where new files are added and ongoing analysis is taking place.
    • This data belongs in your Work area where it is regularly backed up.
  2. Data that is no longer active, but needs to be accessible for reference, such as projects that are complete but that you refer to from time to time.
    • This data belongs in your Scratch area so that it does not consume backup space.
    • Please contact us at rctf-support@utexas.edu to request that a long-term archive of the data be made to tape.
      • We can also efficiently move the data from Work to Scratch for you since we can access the storage server directly.
  3. Data that is no longer active and does not need to be referenced.
    • This data can be removed entirely so that it does not consume either storage server or backup server space.
    • If a copy should be preserved, please contact us at rctf-support@utexas.edu to request that a long-term tape archive be made before the data is deleted.
      • We can also efficiently remove the data for you since we can access the storage server directly.

Also keep in mind the other types of data that belong in Scratch, that do not need backing up or archiving, such as data and references from puclic databases and downloaded software – these can be re-downloaded if necessary.

Avoid having too many small files

While the ZFS file system we use is quite robust, we can experience issues in the weekly backup and periodic archiving process when there are too many small files in a directory tree.

What is too many? Ten million or more.

If the files are small, they don't take up much storage space. But the fact that there are so many causes the backup or archiving to run for a really long time. For weekly backups, this can mean that the previous week's backup is not done by the time the next one starts. For archiving, it means it can take weeks on end to archive a single directory that has many millions of small files.

Backing up gets even worse when a directory with many files is just moved or renamed. In this case the files need to be deleted from the old location and added to the new one – and both of these operations can be extremely long-running.

To see how many files (termed "inodes" in Unix) there are under a directory tree, use the df -i command. For example:

Code Block
languagebash
df -i /stor/work/MyGroup/my_dir

The results might look something like this:

Code Block
languagebash
Filesystem               Inodes     IUsed        IFree IUse% Mounted on
stor/work/MyGroup  103335902213  28864562 103307037651    1% /stor/work/MyGroup

The IUsed column (here 28864562) is the number of inodes (files plus directories) in the directory tree listed under Filesystem (here /stor/work/MyGroup). Note that the reported Filesystem may be different from the one you queried, depending on the structure of the ZFS file systems.

There are a several work-arounds for this issue.

1) Move the files to a temporary directory.
The backup process excludes any sub-directory anywhere in the file system directory tree named tmp, temp, or backups. So if there are files you don't care about, just rename the directory to, for example, tmp. There will be a one-time deletion of the directory under its previous name, but that would be it. 

2) Move the directories to a Scratch area.
Scratch areas are not backed up, so will not cause an issue. The directory can be accessed from your Work area via a symbolic link. Please Contact Us if you would like us to help move large directories of yours to Scratch (we can do it more efficiently with our direct access to the storage server).

3) Zip or Tar the directory
If these are important files you need to have backed up, ziping or taring the directory is the way to go. This converts a directory and all its contents into a single, larger file that can be backed up or archived efficiently. Please Contact Us if you would like us to help with this, since with our direct access to the storage server we can perform zip and tar operations much more efficiently than you can from a compute server.

If your analysis pipeline creates many small files as a matter of course, you should consider modifying the processing to create small files in a tmp directory then ziping or taring the as a final step.

Memory usage considerations

Using too much RAM can quickly make a compute server unusable. When a system's main random access memory (RAM) is filled and additional memory requests are made, "pages" of main memory will be written out to "swap" space on disk, then read back in when again needed. Since disk I/O is on the order of 1,000 times slower than RAM access, swapping can slow a system down considerably.

And in a pathological (but unfortunately not uncommon) pattern, a program (or programs) that need more memory than available can cause "thrashing" where swapping in and out of RAM is happening continuously. This will bring a computer to its knees, making it virtually impossible to do anything on it (slow logins, or logins timing out; any simple command just "hanging" for a long time or never returning). We monitor system usage, and will intervene when we see this happen, by termininating the offending process(es) if possible, or by rebooting the compute server if not.

You can avoid causing a problem like this by following this advice:

Tips:

  • Know the memory configuration of the compute server you're using
    • free -g will show you total RAM and swap in Gigabytes
  • Before starting a memory intensive job, check the system's current memory status
    • free -g also shows used and available for both main memory and swap
  • Know the memory requirements of your program.
    • Monitor its memory usage while it is running using top (see https://www.booleanworld.com/guide-linux-top-command/)
    • This is particularly important if you plan to run multiple instances of a program, since it will guide you in knowing how many such instances you should run.
  • Run memory intensive processes when system load is otherwise light (e.g. overnight)

Computational considerations

Running processes unattended

While POD compute servers do not have a batch system, you can still run multiple tasks simultaneously in several different ways. 

For example, you can use terminal multiplexer tools like screen or tmux to create virtual terminal sessions that won't go away when you log off. Then, inside a screen  or  tmux  session you can create multiple sub-shells where you can run different commands.

You can also use the command line utility nohup to start processes in the background, again allowing you to log off and still have the process running.

 Here are some links on how to use these tools:

Do not run too many processes

Having described how to run multiple processes, it is important that you do not run too many processes at a time, because you are just using one compute server, and you're not the only one using the machine!

How many is "too many"? That really depends on what kind of job it is, what compute/input-output mix it has, and how much RAM it needs. As a general rule, don't run more simultaneous jobs on a POD compute server than you would run on a single TACC compute node.

Before running mutiple jobs, you should check RAM usage (free -g will show usage in GB) and see what is already running using the top program (press the 1 key to see per-hyperthread load), or using the who command, or with a command like this:

Code Block
languagebash
ps -ef | grep -v root | grep -v bash | grep -v sshd | grep -v screen | grep -v tmux | grep -v 'www-data'

Here is a good article on all the aspects of the top command: https://www.booleanworld.com/guide-linux-top-command/

Finally, be sure to lower the priority of your processes using renice as described below (e.g. renice -n 15 -u `whoami`).

Lower priority for large, long-running jobs

If you have one or more jobs that uses multiple threads, or does significant I/O, its execution can affect system responsiveness for other users.

To help avoid this, please use the renice tool to manipulate the priority of your tasks (a priority of 15 is a good choice). It's easy to do, and here's a quick tutorial: http://www.thegeekstuff.com/2013/08/nice-renice-command-examples/?utm_source=tuicool

For example, before you start any tasks, you can set the default priority to nice 15 as shown here. Anything you start from then on (from this shell) should inherit the nice 15 value.

Code Block
languagebash
renice +15 $$

Once you have tasks running, their priority can be changed for all of them by specifying your user name:

Code Block
languagebash
renice +15 -u `whoami`

or for a particular process id (PID):

Code Block
languagebash
renice +15 -p <some PID number>

Multi-processing: cores vs hyperthreads

Many programs offer an option to divide their work among multiple processes, which can reduce the total clock time the program will run. The option may refer to "processes", "cores" or "threads", but actually target the available computing units on a server. Examples include: samtools sort --threads option; bowtie2 -p/--threads option; in R, library(doParallel); registerDoParallel(cores = NN).

One thing to keep in mind here is the difference between cores and hyperthreads. Cores are physical computing units, while hyperthreads are virtual computing units -- kernel objects that "split" each core into two hyperthreads so that the single compute unit can be used by two processes.

The AvailablePODs table describes the compute servers that are associated with each BRCF pod, along with their available cores and (hyper)threads. (Note that most servers are dual-CPU, meaning that total core count is double the per-CPU core count, so a dual 4-core CPU machine would have 8 cores.) You can also see the hyperthread and core counts on any server via:

Code Block
languagebash
cat /proc/cpuinfo | grep -c 'core id'           # actually the number of hyperthreads!
cat /proc/cpuinfo | grep 'siblings' | head -1   # the real number of physical cores

(Yes, the fact that 'core id' gives hyperthreads and 'siblings' the number of cores is confusing. But what do you expect -- this is Unix (smile))

Since hyperthreads look like available computing units, parallel processing options that detect "cores" usually really detect hyperthreads. Why does this matter? 

The bottom line:

  • virtual Hyperthreads are useful if the work a process is doing periodically "yields", typically to perform input/output operations, since waiting for I/O allows the core to be used by other work. Many NGS tools fall into this category since they read/write sequencing files.
  • phycical Cores are best used when a program's work is compute-bound. When processing is compute bound -- as is typical of matrix-intensive machine learning algorithms -- hyperthreads actually degrade performance, because two compute-bound hyperthreads are competing for the same physical core, and there is OS-level overhead involved in process switching between the two.

So before you select a process/core/thread count for your program, consider whether it will perform significant I/O. If so, you can specify a higher count. If it is compute bound (e.g. machine learning), be sure to specify a count low enough to leave free hyperthreads for others to use.

Note that this issue with machine learning (ML) workflows being incredibly compute bound is the main reason ML processing is best run on GPU-enabled servers. While none of our current PODs have GPUs, GPU-enabled servers are available at TACC. Additionally, Austin's Advanced Micro Devices, who are trying to compete with NVIDIA in the GPU market, will soon be offering a "GPU cloud" that will be available to UT researchers. We're working with them on this initiative and will provide access information when it is available.

Input/Output considerations

Avoid heavy I/O load

Please be aware of the potential effects of the input/output (I/O) operations in your workflows.

Many common bioinformatics workflows are largely I/O bound; in other words, they do enough input/output that it is essentially the gating factor in execution time. This is in contrast to simulation or modeling type applications, which are essentially compute bound.

It is underappreciated that I/O is much more difficult to parallelize than compute. To add more compute power, one can generally just increase the number of processors, their speed, and optimize their CPU-to-memory architecture, which greatly affects compute-bound tasks.

I/O, on the other hand, is harder to parallelize. Large compute clusters such as TACC expose large single file system namespaces to users (e.g. Work, Scratch), but these are implemented using multiple redundant storage systems managed by a sophisticated parallel file system (Lustre, at TACC) to appear as one. Even so, file system outages at TACC caused by heavy I/O are not uncommon.

In the POD architecture, all compute servers share a common storage server, whose file system is accessed over a high-bandwidth local network (NFS over 10 Gbit ethernet). This means that heavy I/O to shared storage initiated from any compute server can negatively affect users on all compute servers.

For example, as few as three simultaneous invocations of gzip or samtools sort on large files can degrade system responsiveness for other users. If you notice that doing an ls or command completion on the command line seems to be taking forever, this can be a sign of an excessive I/O load (although very high compute loads can occasionally cause similar issues).

Transfer large files directly to the storage server

BRCF storage servers are just Linux servers, but ones you access from compute servers over a high-speed internal network. While they are not available for interactive shell (ssh) access; they provide direct file transfer capability via scp or rsync.

Using the storage server as a file transfer target is useful when you have many files and/or large files, as it provides direct access to the shared storage. Going through a compute server is also possible, but involves an extra step in the path – from the compute-server to its network-attached storage-server.

The solution is to target your POD's storage server directly using scp or rsync. When you do this, you are going directly to where the data is physically located, so you avoid extra network hops and do not burden heavily-used compute servers.

Tip

Note that direct storage server file transfer access is only available from UT network addresses, from TACC, or using the UT VPN service.

Please see this FAQ for more information: I'm having trouble transferring files to/from TACCIf your analysis pipeline creates many small files as a matter of course, you should consider modifying the processing to create small files in a tmp directory then ziping or taring the as a final step.