Ascent GPU Notes

Number of Ranks Per Node

On each node, it is important to only launch one MPI rank per GPU. If your cluster contains multiple GPUs per node, it is important that consecutive ranks be assigned to each node, which is the default behavior of most job schedulers. By default, each CUDA capable GPU device is queried, and a rank is assigned a device based on the rank number modulo the number of devices. Collisions could result in a run-time failure or significant delays. This default behavior can be overridden. Please see the Ascent options for more details.

Using ROCm-SMI on AMD GPUs

ROCm-SMI stands for ROCm System Management Interface and provides an interfaces for applications and users to monitor and control GPU applicatoins.

ROCm-SMI can be used to verify GPU usage. This is done by launching your application and then using ROCm-SMI on the same node to monitor the running job.

The following call will monitor the usage of GPUs every tenth of a second:

watch -n 0.1 rocm-smi

Using RocProf on AMD GPUs

RocProf is a command line tool provided by ROCm that is implemented on top of the rocProfiler and rocTracer APIs. RocProf allows for application tracing and counter collection on AMD GPUs, and can be a useful tool for confirming your application is running on the GPU as expected. Below you will find a basic guide to using rocProf with Ascent.

Enabling RocProf in Your Build

In order for rocProf to profile your application, you must compile your build with the following modules and library flags:

module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm/5.2.0

export CXX="$(which CC) -x hip"
export CXXFLAGS="-ggdb -03 -std=c++17 -Wall"
export LD="$(which CC)"
export LDFLAGS="${CXXFLAGS} -L${ROCM_PATH}/lib"
export LIBS="-lamdhip64"

Many of the modules and flags required for rocProf are required in order to execute on the GPU. Below are the modules and flags that are required by Ascent and its dependencies to enable HIP on Crusher:

module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load cmake/3.22.2
module load rocm/5.2.0
module load cray-mpich

# we want conduit to use this
module load cray-hdf5-parallel/1.12.1.1

# GPU-aware MPI
export MPICH_GPU_SUPPORT_ENABLED=1

export AMREX_AMD_ARCH=gfx90a

export CC=$(which cc)
export CXX=$(which CC)
export FC=$(which ftn)
export CFLAGS="-I${ROCM_PATH}/include"
export CXXFLAGS="-I${ROCM_PATH}/include -Wno-pass-failed"
export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64"

When combined, the modules and flags needed to enable both HIP and rocProf on Crusher become the following:

module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load cmake/3.22.2
module load rocm/5.2.0
module load cray-mpich

# we want conduit to use this
module load cray-hdf5-parallel/1.12.1.1

# GPU-aware MPI
export MPICH_GPU_SUPPORT_ENABLED=1

export AMREX_AMD_ARCH=gfx90a

export CC=$(which cc)
export CXX="$(which CC) -x hip"
export FC=$(which ftn)
export CFLAGS="-I${ROCM_PATH}/include"
export CXXFLAGS="-I${ROCM_PATH}/include -ggdb -O3 -std=c++17 -Wall -Wno-pass-failed"
export LD=${CC}
export LDFLAGS="${CXXFLAGS} -L${ROCM_PATH}/lib -lamdhip64"
export LIBS="-lamdhip64"

Running on the GPU

This section will demonstrate submitting a job on Crusher via slurm and a batch script. The `job.sh` batch script will define submissions options, load necessary modules, and launch the parallel job. Below is an example `job.sh` that launches the Kripke simulation, one of the example integrations provided by Ascent.

#!/bin/bash
#SBATCH -A csc340
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH -J kripke_gpu
#SBATCH -o kripke.output
#SBATCH -e kripke.error


module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm/5.2.0
module load cmake/3.22.2
module load cray-mpich
module load cray-hdf5-parallel/1.12.1.1

srun -n 6 --ntasks-per-node 6 -G 6 --gpus-per-node 6 ./kripke_par --procs 1,1,6  --zones 60,60,60 --niter 3 --dir 1:2 --grp 1:1 --legendre 4 --quad 4:4

In the above example, the #SBATCH options are:

Option

Description

-A

Account to charge submitted job

-t

Requested walltime for submitted job

-N

Number of nodes

-J

Job Name

-o

stdout file name

-e

stderr file name

The parallel job is launched using srun with the follow options:

Option

Description

-n

Number of MPI ranks

–ntasks-per-node

Number of MPI ranks per node

-G

Number of GPUs

–gpus-per-node

Number of GPUs per node

For more information on running jobs on Crusher, see their user guide: https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#running-jobs. And for more information about running Kripke, see Kripke.

Profiling with RocProf

You can profile your job using rocProf by adding the necessary options to the job.sh batch script, following the srun launch command (and respective options). One way to confirm your job is executing on the GPU, is to use rocProf with the –hip-trace flag, this will generate a file that will show the HIP function calls and kernels. But an easy way to verify your job is NOT running on the GPU is if rocProf produces no output files.

Below is an example of using rocProf that will generate an output file for each MPI rank. Note: this could result in several GBs of data depending on the size of your job since there are typically 6 files generated per MPI rank. Known Issue: Currently, rocprof on Crusher is failing to produce output files for non-root MPI ranks.

#!/bin/bash
#SBATCH -A csc340
#SBATCH -t 00:10:00
#SBATCH -N 1
#SBATCH -J kripke_gpu
#SBATCH -o kripke.output
#SBATCH -e kripke.error


module load PrgEnv-cray
module load craype-accel-amd-gfx90a
module load rocm/5.2.0
module load cmake/3.22.2
module load cray-mpich
module load cray-hdf5-parallel/1.12.1.1

srun -n 6 --ntasks-per-node 6 -G 6 --gpus-per-node 6 rocprof -o ${SLURM_JOBID}-${SLURM_PROCIDd}.csv --hip-trace ./kripke_par --procs 1,1,6  --zones 60,60,60 --niter 3 --dir 1:2 --grp 1:1 --legendre 4 --quad 4:4

Helpful RocProf Flags

Help Flag:

Option

Description

/rocm/bin/rocprof -h

Display command line options

Housekeeping Flags:

Option

Description

Default

–timestamp <on|off>

Turn on/off gpu kernel dispatch timestamps

OFF

–basenames <on|off>

Turn on/off turncating gpu kernel names such as template parameters and arguments types

OFF

-o <output csv file>

Save direct counter data to a specified file name

<input file base>.csv

-d <directory>

Send profiling data to a specified directory

/tmp

-t <temporary directory>

Change the directory where tmp files are created allowing you to save these files

/tmp

Profiling Flags:

Option

Description

-i input<.txt|.html>

Specify an input file (note: all output files will be named input.*)

–stats

Generate a file of kernel execution stats called <output file>.stats.csv

–hsa-trace

Trace GPU kernels, host HSA events, and HIP memory copies. Generates JSON file that is compatible with chrome-tracing

–hip-trace

Trace HIP API calls. Generates JSON file that is compatible with chrome-tracing