Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Overview

The Hopefog and Livestong PODs both have two AMD GPU servers, which enable powerful Machine Learning (ML) workflows.

Hardware info

Architecture specs

image2020-9-26_13-55-3.png

Resources

ROCm GPU-enabling framework

Best starting places:

Training Guides

  1. Introduction_to_AMD_7002_processor.pdf
  2. Radeon_Instinct_HPC_Training_2020.pdf
  3. Radeon_Instinct_ML_Training_2020.pdf

Command-line diagnostics

  • GPU usage: rocm-smi
  • CPU and GPU details: rocminfo
  • What ROCm modules are installed: dpkg -l | grep rocm
  • GPU ↔ GPU/CPU communication bandwidth test
    • between GPU2 and CPU: rocm-bandwidth-test -b2,0
    • between GPU3 and GPU4: rocm-bandwidth-test -b3,4

Sharing Resources

Since there's no batch system on BRCF POD compute servers, it is important for users to monitor their resource usage and that of other users in order to share resources appropriately.

  • Use top to monitor running tasks (or top -i to exclude idle processes)
    • commands while top is running include:
    • M - sort task list by memory usage
    • P - sort task list by processor usage
    • N - sort task list by process ID (PID)
    • T - sort task list by run time
    • 1 - show usage of each individual hyperthread
      • they're called "CPUs" but are really hyperthreads
      • this list can be long; non-interactive mpstat may be preferred
  • Use mpstat to monitor overall CPU usage
    • mpstat -P ALL to see usage for all hyperthreads
    • mpstat -P 0 to see specific hyperthread usage
  • Use free -g to monitor overall RAM memory and swap space usage (in GB)
  • Use rocm-smi to see GPU usage




  • No labels