next up previous contents index
Next: 4.5.2 Vectorization Up: 4.5 Optimization, compiler switches Previous: 4.5 Optimization, compiler switches   Contents   Index


4.5.1 OpenMP settings

The OpenMP specifications and many related links can be found on the OpenMP.org homepage. Overviews can be found in ``Shared Memory Programming With OpenMP'' (M. D. Jones 2013)'' and ``Intel Parallel Studio XE. Facing the Multicore-Challenge II'' (Hans Pabst 2011). A complete online tutorial is ``Parallel Computing and OpenMP Tutorial'' (Shao-Ching Huang 2013). To activate OpenMP in the CO5BOLD executable you have to set the corresponding environment variable (see 4.3.1.5) before calling the configure script, like
export F90_PARALLEL=openmp
./configure
make
This will insert the corresponding compiler switch (e.g.tex2html_verb_mark>518, -omp, -mp,...confer the following sections) into the compiler calls in the Makefile (see Sect.4.3). The calls to the timing routines that would be executed in parallel are removed by (not) setting the appropriate compiler macros (see Sect.4.4). In addition, the switch rhd_shortrad_dir_l02 (see Sect.4.4.7.16) might be set, according to experience about performance enhancements. The user has to find optimum values for the parameters n_hydcellsperchunk (for the Roe and the HLLMHD solver module, see Sect.7.1.8.9) and n_viscellsperchunk (for the tensor-viscosity module, see Sect.7.1.11.17) to optimize the size of the chunk given to one thread per time. For the Roe solver of the hydrodynamics module, there exist also the optional parameters n_hydcellsperchunk2 (see Sect.7.1.9.1) and n_hydcellsperchunk3 (see Sect.7.1.9.2). For several modules the environment variable OMP_SCHEDULE can be set (before running CO5BOLD) to control its OpenMP scheduling behavior. Important parallel loops in the SHORTrad module have a SCHEDULE(RUNTIME) modifier that allows this external control. The old default is achieved by not defining the variable or by setting
export OMP_SCHEDULE="STATIC,1"
On some machines (e.g. an older Intel Xeon with Linux and PGI compiler), a dynamic scheduling activated with
export OMP_SCHEDULE="DYNAMIC,1"

is advantageous. The size of the individual chunks might be set to larger values than 1 (in the examples above). The optimal value has to be found empirically. A good starting point is number_of_grid_points_in_1D/Number_of_treads, which gives for a model with $171^3$ grid points on a 4-processor machine

export OMP_NUM_THREADS=4
export OMP_SCHEDULE="STATIC,43"

However, usually the general default

export OMP_SCHEDULE="STATIC"

is a good choice. The number of threads should equal the number of available processors and has to be set at run-time with the environment variable OMP_NUM_THREADS, e.g. with

export OMP_NUM_THREADS=16

The size of the stack per thread can be set with OMP_STACKSIZE, as e.g. in

export OMP_STACKSIZE=300M

Usually, the default value is too small. On machines with many cores, experiments with KMP_AFFINITY might be beneficial for the performance as e.g. in

export KMP_AFFINITY=verbose,granularity=core,compact


next up previous contents index
Next: 4.5.2 Vectorization Up: 4.5 Optimization, compiler switches Previous: 4.5 Optimization, compiler switches   Contents   Index