3.7.8 Hitachi SR8000

Next: 3.7.9 IBM Up: 3.7 Optimization, Compiler Switches Previous: 3.7.7 Hewlett-Packard: Itanium 2 Contents Index

3.7.8 Hitachi SR8000

Some information about the Hitachi compiler is here.

In 3.6 the preprocessor switches are listed, that control the modern - single - version uio_mac_module.F90.

-conti199: Up to 199 continuation lines can be interpreted (otherwise not more than 39 continuation lines are accepted).
-limit: Limits the amount of time and memory for compilation.
-opt=ss: use highest possible optimization level.
-nopredicate: this option switches off a sub-option activated by opt=ss. It is necessary to disable the -predicate option because the code crashes otherwise (segmentation violation). The switch must appear after setting -opt=ss.
-pvfunc=2: References the pseudo-vectorizing mathematical function and applies the temporary array to reference the pseudo-vectorizing mathematical function.
-omp -parallel=1: parallelize based on OpenMP directives only.
-procnum=8: generated code for 8 processors on one node
-orphaned=1: Checks if the regions sequentially executed contain orphaned directives during run-time when PROCNUM=8 is specified. If a sequentially executed region contains an orphaned directive, the system outputs a message and terminates the program.
-nestcheck=1: Checks for nesting errors in parallel regions. If a parallel region is nested, the system returns an error and terminates the program. Without this option, the code aborts with an error message, indicating illegal nesting. Compiler bug?
-pmpar: Collects the performance monitor information for each parallelization unit.
-pmfunc: Collects the performance monitor information for each procedure.
-Drhd_hyd_roe1d_l01=1: Optimization: Choose non-standard set of routines for Roe solver. See Sect. 5.4.17.
-DMSrad_raytas=0: Optimization: choose default version of loop in SUBROUTINE raytas in file MSrad3D.F90. See Sect. 3.6.
Important note: The UIO routines need in addition the compiler option -subchk: Array bound checking. Without this checking option, some UIO routines are not working properly (compiler bug?).

A proposed compiling sequence is (only default modules activated):

export F90_PREFLAGS="-subchk" ./configure make UIO export F90_PREFLAGS= ./configure make

Performance tests on hwwsr8k

**Figure 2:** Performance tests on Hitachi SR8000 at HLR Stuttgart. For models with 128x128x192 and 252x252x188 grid cells different values for the hydrodynamics and viscosity chunk size parameters were used. See text for more details. , Postscript version
$\begin{figure}\centering \includegraphics[width=16.2cm]{co5bold/cobold_perf_sr8k.eps}\end{figure}$

Some tests have been performed on the machine hwwsr8k at HLR Stuttgart in order to determine the optimum chunk sizes which are set by the parameters n_hydcellsperchunk and n_viscellsperchunk (see Sect. 5.4.7 and Sect. 5.4.8). Two different models have been used, one consisting of 128x128x192 grid cells, the other of 252x252x188, respectively. Grey radiative transfer has been performed with the MSrad module. Different values for the chunk size(s) have been assumed where the hydrodynamics and the viscosity parameter were set equal. In all cases three time steps have been computed. The results are shown in Fig. 2. The number of resulting chunks for step HYD1 (the values for HYD2, HYD3, and VIS are very similar), total memory, performance, and the wall clock duration of the hydrodynamics and the viscosity routines are shown as functions of the chunk size parameter(s). Clearly, the number of chunks decreases towards larger chunk sizes whereas the required memory increases - in particular for very large chunk size values. Moreover, performance and CPU time can be optimized by choosing the right parameter values. Interestingly, the optimum chunk size is different for hydrodynamics and viscosity. Based on these tests, a larger value seems to be preferable for the viscosity (n_viscellsperchunk). In the case of the smaller model, 50000 seems to be fine for the hydrodynamics whereas the optimum viscosity chunk size is 200000. This difference explains the double-peaked structure of performance and CPU time. Note that the optimum values do not only depend on the architecture used but also on the dimensions of the model. We recommend to test some chunk size values since it might lead to a higher performance.

Next: 3.7.9 IBM Up: 3.7 Optimization, Compiler Switches Previous: 3.7.7 Hewlett-Packard: Itanium 2 Contents Index