next up previous contents index
Next: 4.6.9 Linux: PathScale compiler Up: 4.6 Specific machines & Previous: 4.6.7 Linux: PGI compiler   Contents   Index


4.6.8 Linux: Intel compiler

The compiler is called with ifort (ifc on older compiler versions).

With Version 7.0 and 7.1 of the Intel compiler (version 15.0 here) CO5BOLD compiled (with tricks, see below). Version 8.0 still caused trouble. With version 9.1 (and up) everything compiles, with some glitches for certain version every now and then.

The native binary format on Intel machines is little_endian. With

export F_UFMTENDIAN=big
(to be set at runtime after compilation before running CO5BOLD) the default can be changed to big_endian. In 4.4.2 the preprocessor switches are listed, that control the modern - single - version uio_mac_module.F90. Important switches are:

On Macintosh machines the typical optimization flags are -O3 -no-prec-div -fno-alias -ip. A big problem is the tiny stack size on those machines: large arrays taken from the stack should be avoided. For the SHORTrad module, this can be achieved by setting -Drhd_arrays_l01=2 during compilation. In addition, relatively small chunk sizes should be specified in rhd.par, see Sect.7.1.8.9 and Sect.7.1.11.17.

Using the Intel compiler (before version 9.1) there was a problem with the UIO modules when OpenMP is activated. This was a bit weird because the UIO modules do not contain any OpenMP directives. However, this means that OpenMP can be safely deactivated for these modules. A proposed compiling sequence was:

export F90_COMPILER=ifort
export F90_MSRAD=1

export F90_PARALLEL=scalar
./configure
make UIO

export F90_PARALLEL=openmp
./configure
make

With more recent compiler version, this is much simpler. A realistic example (with several modules activated, with an explicit choice if optimization flags) might look like:

export F90_COMPILER=ifort
export F90_MSRAD=1
export F90_SHORTRAD=1
export F90_MHD=1
export F90_PARALLEL=openmp
export F90_OPTIMIZE="-ipo -O3 -xHost -static-intel -W0 -Drhd_box_arrays01=1"
export F90_POSTFLAGS="-Drhd_hyd_gravcorr_p01=6 -Dtiming_c_range=15 -Dtiming_r_type=7"

./configure -c
make

For OpenMP (see Sect.4.5.1), the number of threads can be set for instance with

export OMP_NUM_THREADS=16

for a machine with 16 threads (e.g., 2 processors, 4 cores per processor, 2 threads per core). With

export OMP_NUM_THREADS=`cat /proc/cpuinfo | grep "processor.*:" | wc -l`

the number of OpenMP threads is determined from the number of - logical - processors. Experimenting with the scheduling, e.g. by setting

export OMP_SCHEDULE=DYNAMIC,1

or

export OMP_SCHEDULE=GUIDED,2

or (most often just)

export OMP_SCHEDULE=STATIC

might improve the performance (see Sect.4.5.1). The last two OpenMP variables are recognized by several compilers. However, there are Intel-specific ones.

In some cases it was helpful to set

export LD_ASSUME_KERNEL=2.4.19

when encountering problems with OpenMP. However, that seems not to be necessary with recent compiler versions. Still, often the stack memory per thread is too small, which can be increased e.g. with

export KMP_STACKSIZE=300000000

or

export OMP_STACKSIZE=300M

To optimize the performance, particularly on many-core systems, the thread affinity (see ``Intel Thread Affinity Interface'') can specified at runtime (i.e., after compilation but before running the code) e.g. with

export KMP_AFFINITY=verbose,granularity=core,compact


next up previous contents index
Next: 4.6.9 Linux: PathScale compiler Up: 4.6 Specific machines & Previous: 4.6.7 Linux: PGI compiler   Contents   Index