Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Two BRCF research pods also have AMD GPU servers available: the Hopefog and Livestong PODs. Their use is restricted to the groups who own those pods. See Livestrong and Hopefog pod AMD servers for specific information.

Command-line diagnostics

GPU

...

-

...

  • between GPU2 and CPU: rocm-bandwidth-test -b2,0
  • between GPU3 and GPU4: rocm-bandwidth-test -b3,4

enabled software

AlphaFold

Sharing resources

Since there's no batch system on BRCF POD compute servers, it is important for users to monitor their resource usage and that of other users in order to share resources appropriately.

  • Use top to monitor running tasks (or top -i to exclude idle processes)
    • commands while top is running include:
    • M - sort task list by memory usage
    • P - sort task list by processor usage
    • N - sort task list by process ID (PID)
    • T - sort task list by run time
    • 1 - show usage of each individual hyperthread
      • they're called "CPUs" but are really hyperthreads
      • this list can be long; non-interactive mpstat may be preferred
  • Use mpstat to monitor overall CPU usage
    • mpstat -P ALL to see usage for all hyperthreads
    • mpstat -P 0 to see specific hyperthread usage
  • Use free -g to monitor overall RAM memory and swap space usage (in GB)
  • Use rocm-smi to see GPU usage

...

The AlphaFold protein structure solving software is available on all AMD GPU servers. The /stor/scratch/AlphaFold directory has the large required database, under the data.3 sub-directory. There is also an AMD example script /stor/scratch/AlphaFold/alphafold_example_amd.shand an alphafold_example_nvidia.sh script if the POD also has NVIDIA GPUs, (e.g. the Hopefog pod). Interestingly, our timing tests indicate that AlphaFold performance is quite similar on all the AMD and NVIDIA GPU servers.


Pytorch and TensorFlow examples

Two Python scripts are located in /stor/scratch/GPU_info that can be used to ensure you have access to the server's GPUs from TensorFlow or PyTorch. Run them from the command line using time to compare the run times.

...

If GPUs are available and accessible, the output generated will indicate they are being used.

AlphaFold

TensorFlow

The AMD-GPU-specific version of TensorFlow, Tensorflow-rocm 2.9.1 is installed The AlphaFold protein structure solving software is available on all AMD GPU servers. The /stor/scratch/AlphaFold directory has the large required database, under the data.3 sub-directory. There is also an AMD example script /stor/scratch/AlphaFold/alphafold_example_amd.shand an alphafold_example_nvidia.sh script if the POD also has NVIDIA GPUs, (e.g. the Hopefog pod). Interestingly, our timing tests indicate that AlphaFold performance is quite similar on all the AMD and NVIDIA GPU servers.

...

This version works with ROCm 5.1.3+. If you need to install your own version with pip, specify this version:

Code Block
pip install tensorflow-rocm==2.9.1

You may also need to adjust your LD_LIBRARY_PATH as follows:

Code Block
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH"

Resources

ROCm environment

ROCm is AMD's equivalent to the CUDA framework. ROCm is open source, while CUDA is proprietary.

...

Code Block
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH"

TensorFlow

The AMD-GPU-specific version of TensorFlow, Tensorflow-rocm 2.9.1 is installed on all AMD GPU servers. This version works with ROCm 5.1.3+. If you need to install your own version with pip, specify this version:

Code Block
pip install tensorflow-rocm==2.9.1

You may also need to adjust your LD_LIBRARY_PATH as follows:

Code Block
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH"

PyTorch

PyTorch is available as a Docker image on hfogcomp02 and livecomp02. To run it, first define this alias:

Code Block
languagebash
alias drun='docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx'

Then to shell into the Docker image:

Code Block
languagebash
export ROCM_HOME=/opt/rocm-5.1.3
drun rocm/pytorch

...

Command-line diagnostics

  • GPU usage: rocm-smi
  • CPU and GPU details: rocminfo
  • What ROCm modules are installed: dpkg -l | grep rocm
  • GPU ↔ GPU/CPU communication bandwidth test
    • between GPU2 and CPU: rocm-bandwidth-test -b2,0
    • between GPU3 and GPU4: rocm-bandwidth-test -b3,4

Sharing resources

Since there's no batch system on BRCF POD compute servers, it is important for users to monitor their resource usage and that of other users in order to share resources appropriately.

  • Use top to monitor running tasks (or top -i to exclude idle processes)
    • commands while top is running include:
    • M - sort task list by memory usage
    • P - sort task list by processor usage
    • N - sort task list by process ID (PID)
    • T - sort task list by run time
    • 1 - show usage of each individual hyperthread
      • they're called "CPUs" but are really hyperthreads
      • this list can be long; non-interactive mpstat may be preferred
  • Use mpstat to monitor overall CPU usage
    • mpstat -P ALL to see usage for all hyperthreads
    • mpstat -P 0 to see specific hyperthread usage
  • Use free -g to monitor overall RAM memory and swap space usage (in GB)
  • Use rocm-smi to see GPU usage

AMD GPU and ROCm resources

ROCm GPU-enabling framework

...