4.6.8 Linux: Intel compiler

Next: 4.6.9 Linux: PathScale compiler Up: 4.6 Specific machines & Previous: 4.6.7 Linux: PGI compiler Contents Index

4.6.8 Linux: Intel compiler

The compiler is called with ifort (ifc on older compiler versions).

With Version 7.0 and 7.1 of the Intel compiler (version 15.0 here) CO5BOLD compiled (with tricks, see below). Version 8.0 still caused trouble. With version 9.1 (and up) everything compiles, with some glitches for certain version every now and then.

The native binary format on Intel machines is little_endian. With

export F_UFMTENDIAN=big

(to be set at runtime after compilation before running CO5BOLD) the default can be changed to big_endian. In 4.4.2 the preprocessor switches are listed, that control the modern - single - version uio_mac_module.F90. Important switches are:

-fast: Optimization:
choose a general optimization (close to) optimual for the local machine. It actually activates a number of other optimization flags, that might change with compiler version. This option currently is not recommended for CO5BOLD because the too aggressive optimization causes runtime errors (the offending sub flag appears to be no-prec-div). Instead, the - working - sub flags are specified individually.
-ip: Optimization:
activate interprocedural optimization within each source file. This enables some inlining.
-ipo: Optimization:
enable interprocedural optimization between files. The compiler makes a first step with a pre-compilation and syntax check for each source file and then finishes the compilation and optimization of all source files together in a second - time-consuming - step. It allows some global optimizations and appears to have at least no adverse effects - in all recent compiler versions.
-O3: Optimization:
activate a generally high level of optimization.
-xHost: Optimization for hardware:
optimize for the architecture of the compiling host. This flag is recommended, although there are a few issues: On AMD machines this might work less than perfect, because the features of the processors might now be properly recognized: check it! Compiling with this option on a new machine with a recent set of vector instructions will likely generate code that will nor run on older machines. In the - rather uncommon - case that the CO5BOLD executable is distributed a combination of other flags has to be used that ensures that the code runs in an optimal way on a range of machines, see below and see the manual.
-tpp6 -xK: Optimization for hardware (old):
optimize especially for Pentium III (and Athlon, includes SSE vector commands).
-tpp7 -xW: Optimization for hardware (old):
optimize especially for Pentium IV (includes SSE2 vector commands).
-xP: Optimization for hardware (old):
optimize especially for Core 2 Duo and simular architectures.
-xSSE4.2: Optimization for hardware:
Optimize for machines with SSE 4.2 vector instructions.
-xAVX: Optimization for hardware:
Optimize for machines with AVX vector instructions. By now, there are several - more advanced - versions of this.
-Drhd_box_arrays01=1: Optimization:
activate a faster way of handling arrays in structures (see Sect.4.4.1.4). That works e.g. in version 12.1 of the compiler but not in more recent ones.
-DMSrad_raytas=2: Optimization:
choose non-default version of loop in SUBROUTINE raytas in file MSrad3D.F90. See Sect.4.4.7.20.
-Drhd_shortrad_dir1_l01=1: Optimization:
Transpose arrays and use routine rhd_shortrad_dir3 for rays in x1 direction. See Sect.4.4.7.15.
-openmp: Parallelization:
activate OpenMP directives. Note that the for compiler versions before 9.0 the UIO routines should be compiled without OpenMP support (even if they do not contain any OpenMP directives themselves).
-static: Linking:
prevent linking with shared libraries. This is usually part of -fast and does not seem to have any positive effect on the performance.
-static-intel: Linking:
link Intel-provided libraries statically. This option is recommended. This has the nice effect that the executable now can be safely transferred from the compile node - where the Intel libraries are available - to any of the compute nodes - where the Intel libraries should be available, but aren't.
-assume byterecl: I/O:
specify that the length of a record is measured in bytes (and not in words). That is necessary for the UIO routines.
-Duio_switch_system_l01=1 -Duio_switch_native_l01=1 -Duio_switch_ieeebe_l01=1 -Duio_switch_ieeele_l01=1 -Duio_switch_ieee_l01=1 -Duio_switch_open_l01=1: I/O:
activate the appropriate set of UIO switches.
-Dtiming_c_range=15 -Dtiming_r_type=7: General:
configure the timing routines.
-r8 -fpconstant: General:
force compilation in double precision (see 4.5.5). In general, single precision is sufficient. However, e.g. for very cool objects with small-Mach-number flows, the general activation of double-precision arithmetic is necessary.
-fpp: General:
activate the preprocessor.
-W0: General:
suppress warning messages.

On Macintosh machines the typical optimization flags are -O3 -no-prec-div -fno-alias -ip. A big problem is the tiny stack size on those machines: large arrays taken from the stack should be avoided. For the SHORTrad module, this can be achieved by setting -Drhd_arrays_l01=2 during compilation. In addition, relatively small chunk sizes should be specified in rhd.par, see Sect.7.1.8.9 and Sect.7.1.11.17.

Using the Intel compiler (before version 9.1) there was a problem with the UIO modules when OpenMP is activated. This was a bit weird because the UIO modules do not contain any OpenMP directives. However, this means that OpenMP can be safely deactivated for these modules. A proposed compiling sequence was:

export F90_COMPILER=ifort export F90_MSRAD=1 export F90_PARALLEL=scalar ./configure make UIO export F90_PARALLEL=openmp ./configure make

With more recent compiler version, this is much simpler. A realistic example (with several modules activated, with an explicit choice if optimization flags) might look like:

export F90_COMPILER=ifort export F90_MSRAD=1 export F90_SHORTRAD=1 export F90_MHD=1 export F90_PARALLEL=openmp export F90_OPTIMIZE="-ipo -O3 -xHost -static-intel -W0 -Drhd_box_arrays01=1" export F90_POSTFLAGS="-Drhd_hyd_gravcorr_p01=6 -Dtiming_c_range=15 -Dtiming_r_type=7" ./configure -c make

For OpenMP (see Sect.4.5.1), the number of threads can be set for instance with

export OMP_NUM_THREADS=16

for a machine with 16 threads (e.g., 2 processors, 4 cores per processor, 2 threads per core). With

export OMP_NUM_THREADS=`cat /proc/cpuinfo | grep "processor.*:" | wc -l`

the number of OpenMP threads is determined from the number of - logical - processors. Experimenting with the scheduling, e.g. by setting

export OMP_SCHEDULE=DYNAMIC,1

export OMP_SCHEDULE=GUIDED,2

or (most often just)

export OMP_SCHEDULE=STATIC

might improve the performance (see Sect.4.5.1). The last two OpenMP variables are recognized by several compilers. However, there are Intel-specific ones.

In some cases it was helpful to set

export LD_ASSUME_KERNEL=2.4.19

when encountering problems with OpenMP. However, that seems not to be necessary with recent compiler versions. Still, often the stack memory per thread is too small, which can be increased e.g. with

export KMP_STACKSIZE=300000000

export OMP_STACKSIZE=300M

To optimize the performance, particularly on many-core systems, the thread affinity (see ``Intel Thread Affinity Interface'') can specified at runtime (i.e., after compilation but before running the code) e.g. with

export KMP_AFFINITY=verbose,granularity=core,compact

Next: 4.6.9 Linux: PathScale compiler Up: 4.6 Specific machines & Previous: 4.6.7 Linux: PGI compiler Contents Index