.. ############################################################################ .. # Copyright (c) Lawrence Livermore National Security, LLC and other Ascent .. # Project developers. See top-level LICENSE AND COPYRIGHT files for dates and .. # other details. No copyright assignment is required to contribute to Ascent. .. ############################################################################ Ascent GPU Notes ================== Number of Ranks Per Node ------------------------ On each node, it is important to only launch one MPI rank per GPU. If your cluster contains multiple GPUs per node, it is important that consecutive ranks be assigned to each node, which is the default behavior of most job schedulers. By default, each CUDA capable GPU device is queried, and a rank is assigned a device based on the rank number modulo the number of devices. Collisions could result in a run-time failure or significant delays. This default behavior can be overridden. Please see the Ascent options for more details. Using ROCm-SMI on AMD GPUs -------------------------- ROCm-SMI stands for ROCm System Management Interface and provides an interfaces for applications and users to monitor and control GPU applications. ROCm-SMI can be used to verify GPU usage. This is done by launching your application and then using ROCm-SMI on the same node to monitor the running job. The following call will monitor the usage of GPUs every tenth of a second: .. code-block:: sh watch -n 0.1 rocm-smi Using RocProf on AMD GPUs ------------------------- RocProf is a command line tool provided by ROCm that is implemented on top of the rocProfiler and rocTracer APIs. RocProf allows for application tracing and counter collection on AMD GPUs, and can be a useful tool for confirming your application is running on the GPU as expected. Below you will find a basic guide to using rocProf with Ascent. Enabling RocProf in Your Build ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In order for rocProf to profile your application, you must compile your build with the following modules and library flags: .. code-block:: sh module load PrgEnv-cray module load craype-accel-amd-gfx90a module load rocm/5.2.0 export CXX="$(which CC) -x hip" export CXXFLAGS="-ggdb -03 -std=c++17 -Wall" export LD="$(which CC)" export LDFLAGS="${CXXFLAGS} -L${ROCM_PATH}/lib" export LIBS="-lamdhip64" Many of the modules and flags required for rocProf are required in order to execute on the GPU. Below are the modules and flags that are required by Ascent and its dependencies to enable HIP on Crusher: .. code-block:: sh module load PrgEnv-cray module load craype-accel-amd-gfx90a module load cmake/3.22.2 module load rocm/5.2.0 module load cray-mpich # we want conduit to use this module load cray-hdf5-parallel/1.12.1.1 # GPU-aware MPI export MPICH_GPU_SUPPORT_ENABLED=1 export AMREX_AMD_ARCH=gfx90a export CC=$(which cc) export CXX=$(which CC) export FC=$(which ftn) export CFLAGS="-I${ROCM_PATH}/include" export CXXFLAGS="-I${ROCM_PATH}/include -Wno-pass-failed" export LDFLAGS="-L${ROCM_PATH}/lib -lamdhip64" When combined, the modules and flags needed to enable both HIP and rocProf on Crusher become the following: .. code-block:: sh module load PrgEnv-cray module load craype-accel-amd-gfx90a module load cmake/3.22.2 module load rocm/5.2.0 module load cray-mpich # we want conduit to use this module load cray-hdf5-parallel/1.12.1.1 # GPU-aware MPI export MPICH_GPU_SUPPORT_ENABLED=1 export AMREX_AMD_ARCH=gfx90a export CC=$(which cc) export CXX="$(which CC) -x hip" export FC=$(which ftn) export CFLAGS="-I${ROCM_PATH}/include" export CXXFLAGS="-I${ROCM_PATH}/include -ggdb -O3 -std=c++17 -Wall -Wno-pass-failed" export LD=${CC} export LDFLAGS="${CXXFLAGS} -L${ROCM_PATH}/lib -lamdhip64" export LIBS="-lamdhip64" Running on the GPU ^^^^^^^^^^^^^^^^^^ This section will demonstrate submitting a job on Crusher via slurm and a batch script. The ```job.sh``` batch script will define submissions options, load necessary modules, and launch the parallel job. Below is an example ```job.sh``` that launches the Kripke simulation, one of the example integrations provided by Ascent. .. code-block:: sh #!/bin/bash #SBATCH -A csc340 #SBATCH -t 00:10:00 #SBATCH -N 1 #SBATCH -J kripke_gpu #SBATCH -o kripke.output #SBATCH -e kripke.error module load PrgEnv-cray module load craype-accel-amd-gfx90a module load rocm/5.2.0 module load cmake/3.22.2 module load cray-mpich module load cray-hdf5-parallel/1.12.1.1 srun -n 6 --ntasks-per-node 6 -G 6 --gpus-per-node 6 ./kripke_par --procs 1,1,6 --zones 60,60,60 --niter 3 --dir 1:2 --grp 1:1 --legendre 4 --quad 4:4 In the above example, the ``#SBATCH`` options are: =========================== ============================================================================================== Option Description =========================== ============================================================================================== -A Account to charge submitted job -t Requested walltime for submitted job -N Number of nodes -J Job Name -o stdout file name -e stderr file name =========================== ============================================================================================== The parallel job is launched using ``srun`` with the follow options: =========================== ============================================================================================== Option Description =========================== ============================================================================================== -n Number of MPI ranks --ntasks-per-node Number of MPI ranks per node -G Number of GPUs --gpus-per-node Number of GPUs per node =========================== ============================================================================================== For more information on running jobs on Crusher, see their user guide: `https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#running-jobs`. And for more information about running Kripke, see :ref:`kripke_integration`. Profiling with RocProf ^^^^^^^^^^^^^^^^^^^^^^ You can profile your job using rocProf by adding the necessary options to the ``job.sh`` batch script, following the ``srun`` launch command (and respective options). One way to confirm your job is executing on the GPU, is to use rocProf with the --hip-trace flag, this will generate a file that will show the HIP function calls and kernels. But an easy way to verify your job is NOT running on the GPU is if rocProf produces no output files. Below is an example of using rocProf that will generate an output file for each MPI rank. Note: this could result in several GBs of data depending on the size of your job since there are typically 6 files generated per MPI rank. **Known Issue:** Currently, rocprof on Crusher is failing to produce output files for non-root MPI ranks. .. code-block:: sh #!/bin/bash #SBATCH -A csc340 #SBATCH -t 00:10:00 #SBATCH -N 1 #SBATCH -J kripke_gpu #SBATCH -o kripke.output #SBATCH -e kripke.error module load PrgEnv-cray module load craype-accel-amd-gfx90a module load rocm/5.2.0 module load cmake/3.22.2 module load cray-mpich module load cray-hdf5-parallel/1.12.1.1 srun -n 6 --ntasks-per-node 6 -G 6 --gpus-per-node 6 rocprof -o ${SLURM_JOBID}-${SLURM_PROCIDd}.csv --hip-trace ./kripke_par --procs 1,1,6 --zones 60,60,60 --niter 3 --dir 1:2 --grp 1:1 --legendre 4 --quad 4:4 Helpful RocProf Flags ^^^^^^^^^^^^^^^^^^^^^ **Help Flag:** =========================== ============================================================================================== Option Description =========================== ============================================================================================== /rocm/bin/rocprof -h Display command line options =========================== ============================================================================================== **Housekeeping Flags:** =========================== ==================================================== ================================================ Option Description Default =========================== ==================================================== ================================================ --timestamp Turn on/off gpu kernel dispatch timestamps **OFF** --basenames Turn on/off truncating gpu kernel names such **OFF** as template parameters and arguments types -o Save direct counter data to a specified file name .csv -d Send profiling data to a specified directory /tmp -t Change the directory where tmp files are created /tmp allowing you to save these files =========================== ==================================================== ================================================ **Profiling Flags:** =========================== ============================================================================================== Option Description =========================== ============================================================================================== -i input<.txt|.html> Specify an input file (note: all output files will be named input.\*) --stats Generate a file of kernel execution stats called .stats.csv --hsa-trace Trace GPU kernels, host HSA events, and HIP memory copies. Generates JSON file that is compatible with chrome-tracing --hip-trace Trace HIP API calls. Generates JSON file that is compatible with chrome-tracing =========================== ==============================================================================================