Some information about the Hitachi compiler is here.
The appropriate machine dependent UIO module is
The appropriate machine dependent RHD module is
-conti199: Up to 199 continuation lines can be interpreted (otherwise not more than 39 continuation lines are accepted).
-limit: Limits the amount of time and memory for compilation.
-opt=ss: use highest possible optimization level.
-nopredicate: this option switches off a sub-option activated by opt=ss. It is necessary to disable the -predicate option because the code crashes otherwise (segmentation violation). The switch must appear after setting
-pvfunc=2: References the pseudo-vectorizing mathematical function and applies the temporary array to reference the pseudo-vectorizing mathematical function.
-omp -parallel=1: parallelize based on OpenMP directives only.
-procnum=8: generated code for 8 processors on one node
-orphaned=1: Checks if the regions sequentially executed contain orphaned directives during run-time when PROCNUM=8 is specified. If a sequentially executed region contains an orphaned directive, the system outputs a message and terminates the program.
-nestcheck=1: Checks for nesting errors in parallel regions. If a parallel region is nested, the system returns an error and terminates the program. Without this option, the code aborts with an error message, indicating illegal nesting. Compiler bug?
-pmpar: Collects the performance monitor information for each parallelization unit.
-pmfunc: Collects the performance monitor information for each procedure.
-Drhd_hyd_roe1d_l01=1: Optimization: Choose non-standard set of routines for Roe solver. See Sect. 3.6.
-DMSrad_raytas=0: Optimization: choose default version of loop in
SUBROUTINE raytasin file
MSrad3D.F90. See Sect. 3.6.
-subchk: Array bound checking. Without this checking option, some UIO routines are not working properly (compiler bug?).
A proposed compiling sequence is (only default modules activated):
Performance tests on hwwsr8k
n_viscellsperchunk(see Sect. 5.3.7 and Sect. 5.3.8). Two different models have been used, one consisting of 128x128x192 grid cells, the other of 252x252x188, respectively. Grey radiative transfer has been performed with the MSrad module. Different values for the chunk size(s) have been assumed where the hydrodynamics and the viscosity parameter were set equal. In all cases three time steps have been computed. The results are shown in Fig. 2. The number of resulting chunks for step HYD1 (the values for HYD2, HYD3, and VIS are very similar), total memory, performance, and the wall clock duration of the hydrodynamics and the viscosity routines are shown as functions of the chunk size parameter(s). Clearly, the number of chunks decreases towards larger chunk sizes whereas the required memory increases - in particular for very large chunk size values. Moreover, performance and CPU time can be optimized by choosing the right parameter values. Interestingly, the optimum chunk size is different for hydrodynamics and viscosity. Based on these tests, a larger value seems to be preferable for the viscosity (
n_viscellsperchunk). In the case of the smaller model, 50000 seems to be fine for the hydrodynamics whereas the optimum viscosity chunk size is 200000. This difference explains the double-peaked structure of performance and CPU time. Note that the optimum values do not only depend on the architecture used but also on the dimensions of the model. We recommend to test some chunk size values since it might lead to a higher performance.