...
Two BRCF research pods also have AMD GPU servers available: the Hopefog and Livestong PODs. Their use is restricted to the groups who own those pods. See Livestrong and Hopefog pod AMD servers for specific information.
Command-line diagnostics
GPU
...
-
...
- between GPU2 and CPU: rocm-bandwidth-test -b2,0
- between GPU3 and GPU4: rocm-bandwidth-test -b3,4
enabled software
AlphaFold
Sharing resources
Since there's no batch system on BRCF POD compute servers, it is important for users to monitor their resource usage and that of other users in order to share resources appropriately.
- Use top to monitor running tasks (or top -i to exclude idle processes)
- commands while top is running include:
- M - sort task list by memory usage
- P - sort task list by processor usage
- N - sort task list by process ID (PID)
- T - sort task list by run time
- 1 - show usage of each individual hyperthread
- they're called "CPUs" but are really hyperthreads
- this list can be long; non-interactive mpstat may be preferred
- Use mpstat to monitor overall CPU usage
- mpstat -P ALL to see usage for all hyperthreads
- mpstat -P 0 to see specific hyperthread usage
- Use free -g to monitor overall RAM memory and swap space usage (in GB)
- Use rocm-smi to see GPU usage
...
The AlphaFold protein structure solving software is available on all AMD GPU servers. The /stor/scratch/AlphaFold directory has the large required database, under the data.3 sub-directory. There is also an AMD example script /stor/scratch/AlphaFold/alphafold_example_amd.shand an alphafold_example_nvidia.sh script if the POD also has NVIDIA GPUs, (e.g. the Hopefog pod). Interestingly, our timing tests indicate that AlphaFold performance is quite similar on all the AMD and NVIDIA GPU servers.
Pytorch and TensorFlow examples
Two Python scripts are located in /stor/scratch/GPU_info that can be used to ensure you have access to the server's GPUs from TensorFlow or PyTorch. Run them from the command line using time to compare the run times.
...
If GPUs are available and accessible, the output generated will indicate they are being used.
AlphaFold
TensorFlow
The AMD-GPU-specific version of TensorFlow, Tensorflow-rocm 2.9.1 is installed The AlphaFold protein structure solving software is available on all AMD GPU servers. The /stor/scratch/AlphaFold directory has the large required database, under the data.3 sub-directory. There is also an AMD example script /stor/scratch/AlphaFold/alphafold_example_amd.shand an alphafold_example_nvidia.sh script if the POD also has NVIDIA GPUs, (e.g. the Hopefog pod). Interestingly, our timing tests indicate that AlphaFold performance is quite similar on all the AMD and NVIDIA GPU servers.
...
This version works with ROCm 5.1.3+. If you need to install your own version with pip, specify this version:
Code Block |
---|
pip install tensorflow-rocm==2.9.1 |
You may also need to adjust your LD_LIBRARY_PATH as follows:
Code Block |
---|
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH" |
Resources
ROCm environment
ROCm is AMD's equivalent to the CUDA framework. ROCm is open source, while CUDA is proprietary.
...
Code Block |
---|
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH" |
TensorFlow
The AMD-GPU-specific version of TensorFlow, Tensorflow-rocm 2.9.1 is installed on all AMD GPU servers. This version works with ROCm 5.1.3+. If you need to install your own version with pip, specify this version:
Code Block |
---|
pip install tensorflow-rocm==2.9.1 |
You may also need to adjust your LD_LIBRARY_PATH as follows:
Code Block |
---|
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH" |
PyTorch
PyTorch is available as a Docker image on hfogcomp02 and livecomp02. To run it, first define this alias:
Code Block | ||
---|---|---|
| ||
alias drun='docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx' |
Then to shell into the Docker image:
Code Block | ||
---|---|---|
| ||
export ROCM_HOME=/opt/rocm-5.1.3
drun rocm/pytorch |
...
Command-line diagnostics
- GPU usage: rocm-smi
- CPU and GPU details: rocminfo
- What ROCm modules are installed: dpkg -l | grep rocm
- GPU ↔ GPU/CPU communication bandwidth test
- between GPU2 and CPU: rocm-bandwidth-test -b2,0
- between GPU3 and GPU4: rocm-bandwidth-test -b3,4
Sharing resources
Since there's no batch system on BRCF POD compute servers, it is important for users to monitor their resource usage and that of other users in order to share resources appropriately.
- Use top to monitor running tasks (or top -i to exclude idle processes)
- commands while top is running include:
- M - sort task list by memory usage
- P - sort task list by processor usage
- N - sort task list by process ID (PID)
- T - sort task list by run time
- 1 - show usage of each individual hyperthread
- they're called "CPUs" but are really hyperthreads
- this list can be long; non-interactive mpstat may be preferred
- Use mpstat to monitor overall CPU usage
- mpstat -P ALL to see usage for all hyperthreads
- mpstat -P 0 to see specific hyperthread usage
- Use free -g to monitor overall RAM memory and swap space usage (in GB)
- Use rocm-smi to see GPU usage
AMD GPU and ROCm resources
ROCm GPU-enabling framework
...