Profiling
Introduction
In software engineering, profiling is a form of dynamic program analysis that measures, for example, the space (memory) or time complexity of a program, the usage of particular instructions, or the frequency and duration of function calls. Most commonly, profiling information serves to aid program optimization.
Profiling is achieved by instrumenting either the program's source code or its binary executable with a tool called a profiler (or code profiler). Profilers may use a number of different techniques, such as event-based, statistical, instrumentation-based, and simulation methods.
Profilers use a wide variety of techniques to collect data, including hardware interrupts, code instrumentation, instruction set simulation, operating system hooks, and performance counters. Profilers are commonly used in the performance engineering process.
Timers in MPTRAC
A set of timers has been directly implemented in the trac
tool. In
addition to the runtime for different parts of the code, estimates of
memory usage and problem size are also reported.
An example of output reported by the trac
tool:
SIZE_NP = 10000
SIZE_MPI_TASKS = 1
SIZE_OMP_THREADS = 4
MEMORY_ATM = 1449.58 MByte
MEMORY_CACHE = 648.499 MByte
MEMORY_CLIM = 15.3182 MByte
MEMORY_METEO = 7408.74 MByte
TIMER_ALLOC = 0.000 s (min= 6.9767e-05 s, mean= 6.9767e-05 s, max= 6.9767e-05 s, n= 1)
TIMER_READ_CTL = 0.007 s (min= 0.00717609 s, mean= 0.00717609 s, max= 0.00717609 s, n= 1)
TIMER_READ_CLIM = 0.025 s (min= 0.0254529 s, mean= 0.0254529 s, max= 0.0254529 s, n= 1)
TIMER_READ_ATM = 0.028 s (min= 0.0275215 s, mean= 0.0275215 s, max= 0.0275215 s, n= 1)
TIMER_MODULE_TIMESTEPS_INIT = 0.002 s (min= 0.002085 s, mean= 0.002085 s, max= 0.002085 s, n= 1)
TIMER_GET_MET = 0.026 s (min= 1.546e-06 s, mean= 1.76945e-05 s, max= 0.00997372 s, n= 1444)
TIMER_READ_MET_GRID = 0.003 s (min= 0.000771007 s, mean= 0.000873196 s, max= 0.000986955 s, n= 4)
TIMER_READ_MET_SURFACE = 0.219 s (min= 0.0519605 s, mean= 0.0547709 s, max= 0.0580699 s, n= 4)
TIMER_READ_MET_LEVELS = 5.490 s (min= 1.30604 s, mean= 1.37255 s, max= 1.39654 s, n= 4)
TIMER_READ_MET_EXTRAPOLATE = 0.253 s (min= 0.0239204 s, mean= 0.06324 s, max= 0.103068 s, n= 4)
TIMER_READ_MET_POLAR_WINDS = 0.006 s (min= 0.00125896 s, mean= 0.00153137 s, max= 0.00172609 s, n= 4)
TIMER_READ_MET_PERIODIC = 0.002 s (min= 0.000341735 s, mean= 0.000458583 s, max= 0.000582423 s, n= 4)
TIMER_READ_MET_GEOPOT = 1.594 s (min= 0.394668 s, mean= 0.398482 s, max= 0.402873 s, n= 4)
TIMER_READ_MET_PV = 0.238 s (min= 0.0565131 s, mean= 0.0594941 s, max= 0.0644101 s, n= 4)
TIMER_READ_MET_PBL = 0.122 s (min= 0.0269698 s, mean= 0.0306072 s, max= 0.0331155 s, n= 4)
TIMER_READ_MET_TROPO = 0.737 s (min= 0.183587 s, mean= 0.184353 s, max= 0.18602 s, n= 4)
TIMER_READ_MET_CLOUD = 0.085 s (min= 0.0204375 s, mean= 0.0213186 s, max= 0.0221769 s, n= 4)
TIMER_READ_MET_CAPE = 2.116 s (min= 0.517294 s, mean= 0.528923 s, max= 0.549389 s, n= 4)
TIMER_READ_MET_OZONE = 0.122 s (min= 0.020632 s, mean= 0.0305517 s, max= 0.041007 s, n= 4)
TIMER_MODULE_CHEM_INIT = 0.000 s (min= 0.000247314 s, mean= 0.000247314 s, max= 0.000247314 s, n= 1)
TIMER_MODULE_TIMESTEPS = 0.302 s (min= 2.6811e-05 s, mean= 0.000209307 s, max= 0.000455049 s, n= 1441)
TIMER_MODULE_POSITION = 0.629 s (min= 5.9016e-05 s, mean= 0.000218254 s, max= 0.000583772 s, n= 2882)
TIMER_MODULE_ADVECT = 1.685 s (min= 0.000162452 s, mean= 0.00116961 s, max= 0.00311676 s, n= 1441)
TIMER_MODULE_DIFF_TURB = 1.742 s (min= 0.000707676 s, mean= 0.00120864 s, max= 0.00243357 s, n= 1441)
TIMER_MODULE_DIFF_MESO = 1.544 s (min= 0.000488041 s, mean= 0.00107166 s, max= 0.00232684 s, n= 1441)
TIMER_MODULE_CONVECTION = 0.962 s (min= 0.000295857 s, mean= 0.000667748 s, max= 0.00142307 s, n= 1441)
TIMER_MODULE_METEO = 0.014 s (min= 0.00228027 s, mean= 0.0035394 s, max= 0.00568389 s, n= 4)
TIMER_MODULE_BOUND_COND = 0.916 s (min= 4.1641e-05 s, mean= 0.000317876 s, max= 0.000974128 s, n= 2882)
TIMER_MODULE_DECAY = 0.427 s (min= 0.000157237 s, mean= 0.000296458 s, max= 0.000618609 s, n= 1441)
TIMER_MODULE_MIXING = 2.408 s (min= 0.0298906 s, mean= 0.0329851 s, max= 0.0553491 s, n= 73)
TIMER_MODULE_CHEM_GRID = 1.744 s (min= 0.000420581 s, mean= 0.00121059 s, max= 0.00318095 s, n= 1441)
TIMER_MODULE_OH_CHEM = 1.873 s (min= 0.000114869 s, mean= 0.00129959 s, max= 0.00288309 s, n= 1441)
TIMER_MODULE_H2O2_CHEM = 0.776 s (min= 5.8315e-05 s, mean= 0.00053884 s, max= 0.00215969 s, n= 1441)
TIMER_MODULE_TRACER_CHEM = 2.582 s (min= 1.9956e-05 s, mean= 0.00179167 s, max= 0.0040358 s, n= 1441)
TIMER_MODULE_WET_DEPO = 0.532 s (min= 1.9342e-05 s, mean= 0.000369044 s, max= 0.000838083 s, n= 1441)
TIMER_MODULE_DRY_DEPO = 0.453 s (min= 2.7147e-05 s, mean= 0.000314659 s, max= 0.000882621 s, n= 1441)
TIMER_WRITE_ATM = 0.192 s (min= 0.0456917 s, mean= 0.0480724 s, max= 0.049666 s, n= 4)
TIMER_WRITE_GRID = 0.328 s (min= 0.0776646 s, mean= 0.0819251 s, max= 0.0879495 s, n= 4)
TIMER_WRITE_CSI = 1.025 s (min= 0.000456732 s, mean= 0.000711579 s, max= 0.00215784 s, n= 1441)
TIMER_WRITE_ENS = 0.003 s (min= 0.000652666 s, mean= 0.000704509 s, max= 0.000817902 s, n= 4)
TIMER_WRITE_PROF = 5.827 s (min= 0.000380689 s, mean= 0.0040439 s, max= 0.0294752 s, n= 1441)
TIMER_WRITE_SAMPLE = 0.024 s (min= 9.81003e-07 s, mean= 1.65024e-05 s, max= 0.0191332 s, n= 1441)
TIMER_WRITE_STATION = 0.620 s (min= 0.000347288 s, mean= 0.000430191 s, max= 0.00285026 s, n= 1441)
TIMER_WRITE_VTK = 0.152 s (min= 0.0374231 s, mean= 0.0378852 s, max= 0.0385151 s, n= 4)
TIMER_FREE = 0.098 s (min= 0.0981576 s, mean= 0.0981576 s, max= 0.0981576 s, n= 1)
TIMER_GROUP_MEMORY = 0.098 s
TIMER_GROUP_INPUT = 5.798 s
TIMER_GROUP_PHYSICS = 18.593 s
TIMER_GROUP_METPROC = 5.276 s
TIMER_GROUP_OUTPUT = 8.171 s
TIMER_TOTAL = 37.936 s
SIZE_NP
refers to the number of particles used in the
simulation. SIZE_MPI_TASKS
refers to the number of MPI
tasks. SIZE_OMP_THREADS
refers to the number of OpenMP threads per
MPI task. SIZE_ACC_DEVICES
refers to the number of GPU devices per
MPI task.
The memory requirements have been derived by analyzing the data structures used in the code and are not based on measurements.
Some timers are zero or not shown because the corresponding modules have not been used in this run.
A simple optimization of memory requirements (and runtime!) is to
adjust the constants EX
, EY
, and EP
, defined in mptrac.h
, to
match the grid dimensions of the meteorological data set and the
constant NP
to match the number of particles used in the simulation.
Profiling of CPU runs
CPU profiling of MPTRAC is enabled using gprof
. Set PROF = 1
in
the Makefile, recompile the code, and run a test case.
Runtime information will be collected in the file gmon.out
in the
working directory. Use gprof [binary]
to see the runtime profile.
Example output for the trac
tool:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
27.09 2.85 2.85 21071464 0.00 0.00 intpol_met_space_3d
15.49 4.48 1.63 96 16.98 17.61 write_output
8.27 5.35 0.87 7770 0.11 0.68 frame_dummy
7.13 6.10 0.75 11636847 0.00 0.00 intpol_met_time_3d
5.04 6.63 0.53 gomp_team_barrier_wait_end
4.56 7.11 0.48 cspline_eval
3.33 7.46 0.35 2357631 0.00 0.00 clim_tropo
3.33 7.81 0.35 __cos_avx
3.04 8.13 0.32 __printf_fp_l
2.57 8.40 0.27 gsl_ran_gaussian_ziggurat
2.28 8.64 0.24 3977062 0.00 0.00 intpol_met_space_2d
2.28 8.88 0.24 3 80.00 80.00 spline
2.09 9.10 0.22 __mpn_divrem
1.71 9.28 0.18 gomp_barrier_wait_end
1.62 9.45 0.17 7475552 0.00 0.00 locate_irr
1.05 9.56 0.11 2658144 0.00 0.00 intpol_met_time_2d
...
Profiling of GPU runs
GPU profiling is enabled by means of NVIDIA Nsight Systems. To see
NVTX markers in the timeline, set NVTX = 1
in the Makefile before
compilation.
Here is an example of how to enable Nsight Systems profiling in a job script:
#### Nsight Systems notes
# For MPI profiling add "mpi" to the trace list:
# --trace=nvtx,osrt,openacc,mpi
# to start profiling after x sec add "--delay x":
# nsys profile -o wodiff --delay 5 ... starts after 5 sec
# To limit the profiling to x sec add "--duration x":
# nsys profile -o wodiff --delay 7 ... limits the profiling to 7 sec
# Calculate trajectories...
nsys profile -f true -o wodiff --trace=nvtx,osrt,openacc --stats=true \
$trac/trac data/dirlist trac.ctl atm_split.tab meteo/ei ATM_BASENAME atm | tee data/log_trac.txt
After the execution, you can visualize .qdrep
-files in
nsys-ui
. Refer to the Nsight Systems manual for other useful
options.