Next: 4.6.9 Linux: PathScale compiler
Up: 4.6 Specific machines &
Previous: 4.6.7 Linux: PGI compiler
Contents
Index
4.6.8 Linux: Intel compiler
The compiler is called with ifort
(ifc
on older compiler versions).
With Version 7.0 and 7.1 of the
Intel compiler
(version 15.0 here)
CO5BOLD compiled (with tricks, see below).
Version 8.0 still caused trouble.
With version 9.1 (and up) everything compiles, with some glitches for certain
version every now and then.
The native binary format on Intel machines is
little_endian
.
With
export F_UFMTENDIAN=big
(to be set at runtime after compilation before running CO5BOLD)
the default can be changed to
big_endian
.
In 4.4.2 the preprocessor switches are listed, that control the
modern - single - version uio_mac_module.F90
.
Important switches are:
-fast
: Optimization:
choose a general optimization (close to)
optimual for the local machine.
It actually activates a number of other optimization flags,
that might change with compiler version.
This option currently is not recommended for CO5BOLD
because the too aggressive optimization causes runtime errors
(the offending sub flag appears to be no-prec-div
).
Instead, the - working - sub flags are specified individually.
-ip
: Optimization:
activate interprocedural optimization within each source file.
This enables some inlining.
-ipo
: Optimization:
enable interprocedural optimization between files.
The compiler makes a first step with a pre-compilation and syntax check
for each source file
and then finishes the compilation and optimization of all source files
together in a second - time-consuming - step.
It allows some global optimizations and appears to have at least no adverse
effects - in all recent compiler versions.
-O3
: Optimization:
activate a generally high level of optimization.
-xHost
: Optimization for hardware:
optimize for the architecture of the compiling host.
This flag is recommended, although there are a few issues:
On AMD machines this might work less than perfect,
because the features of the processors might now be properly recognized: check it!
Compiling with this option on a new machine with a recent set of vector instructions
will likely generate code that will nor run on older machines.
In the - rather uncommon - case that the CO5BOLD executable is distributed
a combination of other flags has to be used that ensures that
the code runs in an optimal way on a range of machines, see below
and see the manual.
-tpp6 -xK
: Optimization for hardware (old):
optimize especially for Pentium III (and Athlon, includes SSE vector commands).
-tpp7 -xW
: Optimization for hardware (old):
optimize especially for Pentium IV (includes SSE2 vector commands).
-xP
: Optimization for hardware (old):
optimize especially for Core 2 Duo and simular architectures.
-xSSE4.2
: Optimization for hardware:
Optimize for machines with SSE 4.2 vector instructions.
-xAVX
: Optimization for hardware:
Optimize for machines with AVX vector instructions.
By now, there are several - more advanced - versions of this.
-Drhd_box_arrays01=1
: Optimization:
activate a faster way of handling arrays in structures
(see Sect.4.4.1.4).
That works e.g. in version 12.1 of the compiler but not in more recent ones.
-DMSrad_raytas=2
: Optimization:
choose non-default version
of loop in SUBROUTINE raytas
in file MSrad3D.F90
.
See Sect.4.4.7.20.
-Drhd_shortrad_dir1_l01=1
: Optimization:
Transpose arrays and use routine rhd_shortrad_dir3
for rays in x1 direction.
See Sect.4.4.7.15.
-openmp
: Parallelization:
activate OpenMP directives.
Note that the for compiler versions before 9.0
the UIO routines should be compiled without OpenMP support (even if they do not contain
any OpenMP directives themselves).
-static
: Linking:
prevent linking with shared libraries.
This is usually part of -fast
and does not seem to have any positive effect
on the performance.
-static-intel
: Linking:
link Intel-provided libraries statically.
This option is recommended.
This has the nice effect that the executable now can be safely transferred
from the compile node - where the Intel libraries are available -
to any of the compute nodes - where the Intel libraries should be available, but aren't.
-assume byterecl
: I/O:
specify that the length of a record is measured in bytes (and not in words).
That is necessary for the UIO routines.
-Duio_switch_system_l01=1 -Duio_switch_native_l01=1 -Duio_switch_ieeebe_l01=1
-Duio_switch_ieeele_l01=1 -Duio_switch_ieee_l01=1 -Duio_switch_open_l01=1
: I/O:
activate the appropriate set of UIO switches.
-Dtiming_c_range=15 -Dtiming_r_type=7
: General:
configure the timing routines.
-r8 -fpconstant
: General:
force compilation in double precision (see 4.5.5).
In general, single precision is sufficient.
However, e.g. for very cool objects with small-Mach-number flows, the general activation
of double-precision arithmetic is necessary.
-fpp
: General:
activate the preprocessor.
-W0
: General:
suppress warning messages.
On Macintosh machines the typical optimization flags are
-O3 -no-prec-div -fno-alias -ip
.
A big problem is the tiny stack size on those machines:
large arrays taken from the stack should be avoided.
For the SHORTrad module, this can be achieved by setting
-Drhd_arrays_l01=2
during compilation.
In addition, relatively small chunk sizes should be specified in rhd.par
,
see Sect.7.1.8.9 and
Sect.7.1.11.17.
Using the Intel compiler (before version 9.1)
there was a problem with the UIO modules when OpenMP is activated.
This was a bit weird because the UIO modules do not contain any OpenMP directives.
However, this means that OpenMP can be safely deactivated for these modules.
A proposed compiling sequence was:
export F90_COMPILER=ifort
export F90_MSRAD=1
export F90_PARALLEL=scalar
./configure
make UIO
export F90_PARALLEL=openmp
./configure
make
With more recent compiler version, this is much simpler.
A realistic example
(with several modules activated, with an explicit choice if optimization flags)
might look like:
export F90_COMPILER=ifort
export F90_MSRAD=1
export F90_SHORTRAD=1
export F90_MHD=1
export F90_PARALLEL=openmp
export F90_OPTIMIZE="-ipo -O3 -xHost -static-intel -W0 -Drhd_box_arrays01=1"
export F90_POSTFLAGS="-Drhd_hyd_gravcorr_p01=6 -Dtiming_c_range=15 -Dtiming_r_type=7"
./configure -c
make
For OpenMP (see Sect.4.5.1),
the number of threads can be set for instance with
export OMP_NUM_THREADS=16
for a machine with 16 threads
(e.g., 2 processors, 4 cores per processor, 2 threads per core).
With
export OMP_NUM_THREADS=`cat /proc/cpuinfo | grep "processor.*:" | wc -l`
the number of OpenMP threads is determined from the number of - logical - processors.
Experimenting with the scheduling,
e.g. by setting
export OMP_SCHEDULE=DYNAMIC,1
or
export OMP_SCHEDULE=GUIDED,2
or (most often just)
export OMP_SCHEDULE=STATIC
might improve the performance (see Sect.4.5.1).
The last two OpenMP variables are recognized by several compilers. However, there
are Intel-specific ones.
In some cases it was helpful to set
export LD_ASSUME_KERNEL=2.4.19
when encountering problems with OpenMP. However, that seems not to be necessary
with recent compiler versions.
Still, often the stack memory per thread is too small,
which can be increased e.g. with
export KMP_STACKSIZE=300000000
or
export OMP_STACKSIZE=300M
To optimize the performance, particularly on many-core systems,
the thread affinity
(see ``Intel Thread Affinity Interface'')
can specified
at runtime (i.e., after compilation but before running the code)
e.g. with
export KMP_AFFINITY=verbose,granularity=core,compact
Next: 4.6.9 Linux: PathScale compiler
Up: 4.6 Specific machines &
Previous: 4.6.7 Linux: PGI compiler
Contents
Index