next up previous contents index
Next: 3.7.12 Linux: PathScale Compiler Up: 3.7 Optimization, Compiler Switches Previous: 3.7.10 Linux: PGI Compiler   Contents   Index


3.7.11 Linux: Intel Compiler

With Version 7.0 and 7.1 of the Intel compiler CO5BOLD compiled (with tricks, see below). Version 8.0 still caused trouble. With version 9.1 (and up) everything compiles smoothly.

The native format on Intel machines is little_endian. With

export F_UFMTENDIAN=big
(to be set at runtime after compilation before running CO5BOLD) the default can be changed to big_endian. In 3.6 the preprocessor switches are listed, that control the modern - single - version uio_mac_module.F90. The compiler is called with ifort (ifc on older compiler versions).

Important switches are:

On Macintosh machines the typical optimization flags are -O3 -no-prec-div -fno-alias -ip. A big problem is the tiny stack size on those machines: large arrays taken from the stack should be avoided. For the SHORTrad module, this can be achieved by setting -Drhd_arrays_l01=2 during compilation. In addition, relatively small chunk sizes should be specified in rhd.par, see Sect. 5.4.7 and Sect. 5.4.8.

Using the Intel compiler (before version 9.1) there was a problem with the UIO modules when OpenMP is activated. This was a bit weird because the UIO modules do not contain any OpenMP directives. However, this means that OpenMP can be safely deactivated for these modules. A proposed compiling sequence is (all modules activated):

export F90_LHDRAD=1
export F90_MSRAD=1
export F90_SHORTRAD=1
export F90_DUST=1
export F90_MHD=1

export F90_PARALLEL=scalar
./configure
make UIO

export F90_PARALLEL=openmp
./configure
make

For OpenMP (see Sect. 3.7.1), the number of threads can be set for instance with

export OMP_NUM_THREADS=16

for a machine with 16 threads (e.g.: 2 processors, 4 cores per processor, 2 threads per core). Experimenting with the scheduling, e.g., with

export OMP_SCHEDULE=DYNAMIC,1

or

export OMP_SCHEDULE=GUIDES,2

might improve the performance (see Sect. 3.7.1). The last two OpenMP variables are recognized by several compiler. However, there are Intel-specific ones:

In some cases it might be helpful to set

export LD_ASSUME_KERNEL=2.4.19

when encountering problems with OpenMP. However, that seems not to be necessary with recent compiler versions. Still, often the stack memory per thread is too small, which can be increased e.g., with

export KMP_STACKSIZE=300000000

To optimize the performance, particularly on many-core systems, the thread affinity (see ``Intel Thread Affinity Interface'') can specified e.g., with

export KMP_AFFINITY=verbose,granularity=core,compact


next up previous contents index
Next: 3.7.12 Linux: PathScale Compiler Up: 3.7 Optimization, Compiler Switches Previous: 3.7.10 Linux: PGI Compiler   Contents   Index