UPDATE: The University Wiki Service is in a degraded state.
The UWS Team is actively working to fully restore service.
For more information please visit the ServiceNow Alerts & Outages page
Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Introduction_to_AMD_7002_processor.pdf
  2. Radeon_Instinct_HPC_Training_2020.pdf
  3. Radeon_Instinct_ML_Training_2020.pdf

Command-line diagnostics

  • GPU usage: rocm-smi
  • CPU and GPU details: rocminfo
  • What ROCm modules are installed: dpkg -l | grep rocm
  • GPU ↔ GPU/CPU communication bandwidth test
    • between GPU2 and CPU: rocm-bandwidth-test -b2,0
    • between GPU3 and GPU4: rocm-bandwidth-test -b3,4

Sharing Resources

Since there's no batch system on BRCF POD compute servers, it is important for users to monitor their resource usage and that of other users in order to share resources appropriately.

  • Use top to monitor running tasks (or top -i to exclude idle processes)
    • commands while top is running include:
    • M - sort task list by memory usage
    • P - sort task list by processor usage
    • N - sort task list by process ID (PID)
    • T - sort task list by run time
    • 1 - show usage of each individual hyperthread
      • they're called "CPUs" but are really hyperthreads
      • this list can be long; non-interactive mpstat may be preferred
  • Use mpstat to monitor overall CPU usage
    • mpstat -P ALL to see usage for all hyperthreads
    • mpstat -P 0 to see specific hyperthread usage
  • Use free -g to monitor overall RAM memory and swap space usage (in GB)
  • Use rocm-smi to see GPU usage