background image

 

Code Optimization with IBM XL Compilers 

June 2004 

Introduction 

The IBM XL compiler family offers C, C++, and Fortran compilers built on an industry 
wide reputation for robustness, versatility and standards compliance. However, the XL 
compilers are about more than providing reliable, feature-rich functionality. The true 
strength of the compilers is in optimization, the ability to improve code generation, and 
produce fast-executing code on PowerPCâ„¢-based systems.  Optimized code executes 
faster, using less machine resources, and makes you and your PowerPC systems more 
productive. 
 
For example, consider the advantages of a compiler that properly exploits the inherent 
opportunities in the PowerPC architecture. A set of complex calculations that could take 
hours using unoptimized code, may be reduced to mere minutes when heavily optimized 
by an XL compiler.  
 
Built with flexibility in mind, the XL compiler optimization suites give you the 
reassurance of powerful, no-hassle optimization, coupled with the ability to tailor 
optimization levels and settings to meet the needs of your application and your 
development environment. 
 
This document does not contain an exhaustive list of the many optimization capabilities 
of the XL compiler family.  It introduces the most important capabilities and describes 
the compiler options, source constructs, and techniques that you can use to maximize the 
performance of your application. 
 

XL Compiler Family 

IBM’s XL compiler family encompasses three core languages for three major operating 
systems on IBM PowerPC hardware.   
 

Operating System and 

Machine Architecture 

XL C/C++ Products 

XL Fortran Product 

 

Mac OS X on Apple G4 
and G5 systems 

XL C/C++ Advanced Edition XL Fortran Advanced 

Edition 

AIXâ„¢ on pSeries 

C for AIX, 
VisualAge C++ for AIX 

XL Fortran for AIX 

Linux on pSeries and 

VisualAge C/C++ for Linux  XL Fortran for Linux 

background image

iSeries 

 
This document describes the optimizations available in the V8.1 XL Fortran and the V6.0 
XL C and C++ products. 
 
You can find more information about the XL compiler family, including downloadable 
trial versions, on the Web: 
 

http://www.ibm.com/software/awdtools/fortran/

 

http://www.ibm.com/software/awdtools/ccompilers/  

 
Those web sites also contain information on contacting IBM if you have questions about 
the XL compiler family or this document.  You may also contact the compiler team by 
sending e-mail to 

compinfo@ca.ibm.com

 

XL Compiler History 

The XL family of compilers is based on work at IBM that began in the mid-1980s with 
development of the IBM’s earliest POWER-based AIX systems.  Since then, the 
compilers have been under continuous development, with special attention to producing 
highly optimized applications that fully exploit the IBM POWER and PowerPC range of 
systems. 
 
The compiler design team at IBM works closely with the hardware design and operating 
system development teams.  This ensures that the compilers take advantage of all the 
latest hardware and OS capabilities, and also that the compiler team can influence 
hardware and OS design to create performance-enhancing capabilities. 
 
The XL family of compilers build performance-critical customer code every day and are 
key to the success of many of IBM’s most performance-sensitive products such as AIX 
and DB2.  IBM also relies on the XL family of compilers for producing industry-leading 
performance benchmarks results, such as SPEC. 
 
The XL compiler family has evolved into a set of highly standards-compliant compilers.  
All core language standards for all three languages are supported, with some additional 
draft standard features also available.  The XL compilers also support many important 
industry, IBM, and gcc/g++ language extensions such as the OpenMP Symmetric Multi-
Processing (SMP) standard. 
 

Optimization Technology Overview 

Optimization techniques for the XL compiler family are built on a foundation of common 
components and techniques that are then customized for the C, C++, and Fortran 
languages. All three language parser components emit an intermediate language that is 

background image

processed by the Interprocedural Analysis (IPA) optimizer and the optimizing code 
generator. The languages also share a set of language-independent high-performance run-
time libraries to support capabilities such as SMP and high-performance complex 
mathematical calculations. 
 
Sharing common components ensures that performance-enhancing optimization 
techniques available to you in one language are available to you in all of them.  At the 
highest optimization levels, the IPA optimizer can combine and optimize code from all 
three languages simultaneously when linking an application. 
 
Below are some of the optimization highlights that the XL family of compilers offers: 
 

• 

Five distinct optimization levels as well as additional options that allow you to tailor 
the optimization process for your application 

• 

Code generation and tuning for specific hardware chipsets 

• 

Interprocedural optimization and inlining via IPA 

• 

Profile-Directed Feedback (PDF) optimization 

• 

User-directed optimization via directives and source-level intrinsic functions that give 
you direct access to PowerPC instructions 

• 

Optimization of OpenMP programs and auto-parallelization capabilities to exploit 
SMP systems 

 

Optimization Levels 

The optimizer includes five base optimization levels: 
 

• 

-O0

, almost no optimization, best for debugging 

• 

-O2

, strong low-level optimization that benefits most programs 

• 

-O3

, intense low-level optimization analysis 

• 

-O4

, all of 

-O3

 plus detailed loop analysis and good whole-program analysis at link-

time 

• 

-O5

, all of 

-O4

 plus detailed whole-program analysis at link time 

 
Note that the 

-O1

 level is not supported.  Each of the optimization levels will be 

discussed in detail in subsequent sections of this document. 

Optimization Progression 

While increasing the level at which you optimize your application can provide an 
increase in performance, other compiler options can be as important to the optimization 
process as the 

-O

 level you choose. 

 
The 

Optimization Levels and Options

 table below offers some options as additional 

optimization techniques.  The table details compiler options implied by using a base 
optimization level; options you should use with each level; and some additional options 
that could be useful with each optimization level. 

background image

 

Optimization Levels and Options 

Base Optimization 

Level 

Additional Options 

Implied by Base 

Optimization Level

Additional 

Recommended 

Options 

Additional Options 

to Try with Base 

Optimization Level

-O0 

None 

-qarch -g 

-O2 -qmaxmem=2048 

-qarch 
-qtune 

-qmaxmem=-1 
-qhot 
-qnostrict 

-O3 -qnostrict 

-qmaxmem=-1

 

-qarch 
-qtune 

-qhot 
-qpdf 

-O4 

All of 

-O3

 plus: 

-qhot 
-qipa 
-qarch=auto 
-qtune=auto 
-qcache=auto

 

-qarch 
-qtune 
-qcache 

-qpdf 

-O5 

All of 

-O4

 plus: 

-qipa=level=2 

-qarch 
-qtune 
-qcache 

-qpdf 

 
While the above table provides a list of the most common compiler options for 
optimization, the XL compiler family offers optimization facilities for almost any 
application. For example 

-qsmp=auto

 can additionally be used with the base 

optimization levels if you desire automatic parallelization of your application. 
 
Compilation time can also be a consideration when developing an application.  In 
general, higher optimization levels take additional compilation time and resources.  
Specifying additional optimization options beyond the base optimization levels can also 
increase compilation time. For a large application with a long build time, compilation 
time can be an issue when choosing the appropriate optimization level and options for 
your build. 
 
It is important to test your application at lower optimization levels before moving on to 
higher levels, or adding optimization options.  If an application does not execute correctly 
when built with 

-O0

, it is unlikely to execute correctly when built with 

-O2

.  The 

optimizer is sensitive to language standard rules.  Even subtly non-conforming code can 
cause the optimizer to perform incorrect transformations, especially at the highest 
optimization levels.  All of the XL compilers have options to limit optimizations for 
nonconforming code, but it is best practice to correct the code and not limit optimization 
opportunities. 

Optimization Level 0 

At 

-O0

, the XL compilers limit optimization transformations to your applications. Some 

limited optimization occurs at 

-O0

 even if you specify no other optimization levels or 

background image

options.  However, limited optimization analysis generally results in the quicker compile 
times than other optimization levels. 
 
This is the best level to use when debugging your code with a symbolic debugger. 
Specify 

-g

 on the command line to instruct the compiler to emit debugging information. 

Whenever possible, ensure that your application executes correctly after compiling with  

-O0

 before increasing your optimization level.  When debugging SMP code, you can 

specify 

-qsmp=noopt

 to perform only the minimal transformations necessary to 

parallelize your code and preserve maximum debug capability. 

Optimization Level 2 

-O2

 performs a set of comprehensive low-level transformations generally limited to 

subprogram or compilation unit scopes with the possibility of some inlining.  
Optimization level 2 is the default setting of the 

-O

 compiler option, using 

-O

 implies  

-O2

.  This optimization level attempts to limit compilation time and resource 

requirements while performing as much low-level optimization as possible under those 
time and resource constraints.  You can increase the memory resources available to some 
of the level 2 optimizations by providing a larger value to the 

-qmaxmem

 option than the 

default of 2048. The value of -1 makes unlimited memory available. 
 
Optimization level 2 performs many transformations.  These are some of the most 
important, and generally most applicable to application development: 
 

• 

Global assignment of user variables to registers, also known as 

graph coloring 

register allocation

.  

• 

Strength reduction and effective use of addressing modes.  

• 

Elimination of redundant instructions, also known as 

common subexpression 

elimination 

• 

Elimination of instructions whose results are unused or that cannot be reached by a 
specified control flow, also known as 

dead code elimination

.  

• 

Value numbering (algebraic simplification).  

• 

Movement of invariant code out of loops.  

• 

Compile-time evaluation of constant expressions, also known as 

constant 

propagation

• 

Control flow simplification.  

• 

Instruction scheduling (reordering) for the target machine.  

• 

Loop unrolling and software pipelining 

• 

Elimination of unnecessary variable assignments, also known as 

dead store 

elimination

 
Optimization level 2 preserves some limited debug information when you specify 

-g

, and 

is the best choice for debugging optimized code.  Optimization levels higher than 

-O2

 

usually result in less accurate debug information being available to a debugger.  
However, along with 

-g

, additional compiler options like 

-qkeepparm

 or directives 

(pragmas in C/C++) such as 

SNAPSHOT

 can assist in debugging optimized code at any 

optimization level. 

background image

The -qarch and –qtune options at -O2 and higher 

The 

-qarch

 and 

-qtune

 options can be important to 

-O2

 and higher optimization levels.  

Choosing a hardware architecture target or family of targets allows the optimizer to make 
the best use of the hardware facilities available.  If you choose a family of hardware 
targets, the 

-qtune

 option can direct the compiler to emit code consistent with the 

architecture choice, but will execute optimally on the chosen tuning hardware target.  
This allows you to compile for a general set of targets but have the code run best on a 
particular target.  Details on the 

-qarch

 and 

-qtune

 options can be found in subsequent 

sections of this document. 

Optimization Level 3 

The 

-O3

 optimization level is an intensified version of 

-O2

.  The compiler performs 

additional low-level transformations and removes limits on the optimization level 2 
transformations, as 

-qmaxmem

 defaults to the -1 (unlimited) value.  Optimizations 

encompass larger program regions and deepen to attempt more analysis.  This level also 
performs transformations that are not always beneficial to all programs and attempts 
several optimizations that can be both memory- and time-intensive.  However, most 
applications benefit from this extra optimization.  Some general differences with 

-O2

 are: 

 

• 

Better loop scheduling.  

• 

Increased optimization scope, typically to encompass a whole procedure.  

• 

Specialized optimizations (those that might not help all programs).  

• 

Optimizations that require large amounts of compile time or space.  

• 

Implicit memory usage limits are eliminated. 

• 

Implies 

-qnostrict

, which allows some reordering of floating-point computations and 

potential exceptions. 

 
Because 

-O3

 implies the 

-qnostrict

 option, certain floating-point semantics of your 

application can be altered to gain execution speed.  These typically involve precision 
trade-offs such as the following: 
 

• 

Reordering of floating-point computations. 

• 

Reordering or elimination of possible exceptions (for example, division by zero or 
overflow). 

 
You can still gain most of the benefits of 

-O3

 and preserve precise floating-point 

semantics by specifying 

-qstrict

.  This is only necessary if absolutely precise floating-

point computational accuracy (as compared with 

-O0

 or 

-O2

 results) is important. You 

may also need 

-qstrict

 if your application is sensitive to floating-point exceptions or the 

order and manner in which floating-point arithmetic is evaluated is important.  Generally, 
without 

-qstrict

, the difference in computed values on any one source-level operation is 

very small compared to lower optimization levels.  However, the difference can 
compound if the operation involved is in a loop structure, and the difference becomes 
additive. 

background image

Optimization Level 4 

Optimization level 4 is an easy way of specifying 

-O3

 along with several other 

complementary optimization options.  The most important of the additional options is  

-qipa

.  IPA optimization extends program analysis beyond individual files and 

compilation units to the entire application.  IPA analysis can propagate values between 
compilation units and inline code from one compilation unit to another.  Global data 
structures can be reorganized or eliminated, and many other transformations become 
possible when the entire application is visible to the IPA optimizer.  The 

-O4

 

optimization level invokes IPA at the default IPA optimization level, level 1. 
 
To make full use of IPA optimizations, you must specify 

-O4

 on the compilation and the 

link steps of your application build.  At compilation time, important optimizations occur 
at the compilation-unit level, as well as preparation for link-stage optimization.  IPA 
information is written into the object files produced.  At the link step, the IPA 
information is read from the object files and the entire application is analyzed.  The 
analysis results in a restructured and rewritten application, which subsequently has the 
lower-level 

-O3

 style optimizations applied to it before linking.  Object files containing 

IPA information can also be used safely by the system linker without using IPA on the 
link step. 
 

-O4

 implies other optimization options beyond IPA: 

 

• 

-qhot

 enables a set of High-Order Transformation optimizations that are most 

effective when optimizing loop constructs.   

• 

-qarch=auto

 and 

-qtune=auto

 are enabled and assume that the machine on which 

you are compiling is the machine on which you will execute your application.  If the 
architecture of your build machine is incompatible with the machine that will execute 
the application, you will need to specify a different 

-qarch

 option 

after

 the 

-O4

 

option to override 

-qarch=auto

• 

-qcache=auto

 assumes that the cache configuration of the machine on which you are 

compiling is the machine on which you will execute your application.  If you are 
executing your application on a different machine, specify correct cache values or use 

-qnocache

 to disable the 

auto

 suboption. 

Optimization Level 5 

-O5

 is the highest base optimization level and includes all 

-O4

 optimizations as well as 

running the IPA optimizer at level 2 instead of the default level 1.  That change, like the 
difference between 

-O2

 and 

-O3

, broadens and deepens IPA optimization analysis and 

performs even more intense whole-program analysis. 

-O5

 can consume the most compile 

time and machine resource of any optimization level.  You should only use it once your 
application is debugged and known to work at lower optimization levels. 
 

-O5

, as with 

-O4

, implies 

-qarch=auto

-qtune=auto

, and 

-qcache=auto

.  You may 

need to alter those defaults if the machine you are compiling on is not where you will 
execute the application. 
 

background image

PowerPC Optimization Capabilities 

The XL compiler family can fully exploit the potential of the PowerPC and related 
POWER series processors.  Using an XL compiler, you can target any processor 
supported by the operating system on which you can use an XL compiler.  
 
The AIX compilers target the full range of processors that AIX supports and the Linux 
compilers support range of PowerPC processors supported by the Linux pSeries and 
iSeries distributions.  The Mac OS X compilers target the two latest processors available 
to Apple customers, the G4 and G5. 

-qarch Option 

Using the correct 

-qarch

 suboption is the most important step in influencing chip-level 

optimization.  The compiler uses that option to make both high- and low-level 
optimization decisions and trade-offs.  It allows the compiler access to the full range of 
processor hardware instructions and capabilities when making code generation decisions.  
Even at low optimization levels such as 

-O0

 or 

-O2

, specifying the correct target 

architecture can have a positive impact on performance.  It allows the compiler to select 
more efficient machine instructions and generate instruction sequences that will perform 
best for a particular machine. 
 
You must use 

-qarch

 correctly because it instructs the compiler to structure your 

application to execution on a particular set of machines that support the specified 
instruction set.  The 

-qarch

 option features suboptions that specify individual processors 

and suboptions that specify a family of processors with common instruction sets or 
subsets.  The choice of processor gives you the flexibility of compiling your application 
to execute optimally on a particular machine, or to be able to execute on a wider variety 
of machines, but still have as much architecture-specific optimization applied as possible.  
For example, an AIX or Linux compiler that is building an application that will only run 
on POWER4-based machines should use the 

-qarch=pwr4

 option.  However, a Linux 

compiler that is building applications that will only run on 64-bit mode capable hardware 
but does not know which machines in particular, should use the 

-qarch=ppc64

 option to 

select the entire 64-bit PowerPC family of processors. 
 
The default setting of 

-qarch

 at all optimization levels except 

-O4

 or 

-O5

, selects the 

smallest subset of capabilities that all the processors have in common for that target 
operating system and compilation mode.  For example, on Mac OS X, the default 
architecture setting is 

ppcv 

which selects a generic PowerPC implementation that 

includes the Apple Velocity Engine (AltiVec) instruction set.  This setting generates code 
that correctly executes on G4 or G5 processors but will not exploit specific characteristics 
of the G5 chip.  At optimization levels 

-O4

 or 

-O5

 or if 

-qarch=auto

 is specified, the 

compiler will detect the type of machine on which you are compiling and assume your 
application will execute on that machine architecture. 

background image

-qtune Option 

The 

-qtune

 option directs the optimizer to bias optimization decisions for executing the 

application on a particular architecture, but does not prevent the application from running 
on other architectures.  The compiler generates machine instructions consistent with your 

-qarch

 architecture choice.  Using 

-qtune

 additionally allows the optimizer to perform 

transformations, such as instruction scheduling, so that the resulting code executes most 
efficiently on your chosen 

-qtune

 architecture.  Since the 

-qtune

 option tunes code to run 

on one particular processor architecture, it supports specific processors as suboptions and 
does not support suboptions representing families of processors. 
 
You should use 

-qtune

 to specify the most common or important processor where your 

application will execute.  For example, if your AIX or Linux application will usually 
execute on a POWER4-based systems but will sometimes execute on POWER3-based 
systems, specify 

-qtune=pwr4

.  The code generated will execute more efficiently on 

POWER4-based systems will but will still run correctly on Power3-based systems. 
 
The default 

-qtune

 setting depends on the setting of the 

-qarch

 option.  If the 

-qarch

 

option selects a particular machine architecture, the range of 

-qtune

 suboptions that are 

supported is limited, and the default tune setting will be compatible with the selected 
target processor.  If instead, the 

-qarch

 option selects a family of processors, the range of 

values accepted for 

-qtune

 is expanded across that family, and the default is chosen from 

a commonly used machine in that family.  With the 

-qtune=auto

 suboption, which is the 

default for optimization levels 

-O4

 and 

-O5

, the compiler detects the machine 

characteristics on which you are compiling, and assumes you want to tune for that type of 
machine. 

-qcache Option 

This option allows you to describe to the optimizer the memory cache layout for the 
machine where your application will execute.  There are several suboptions you can 
specify to describe cache characteristics such as types of cache available, their sizes, and 
cache-miss penalties.  If you do not specify 

-qcache

, the compiler will make cache 

assumptions based on your 

-qarch

 and 

-qtune

 option settings.  If you know some, but 

not all of the cache characteristics of your target machine, specify the characteristics that 
you know.  With the 

-qcache=auto

 suboption, the default at optimization levels 

-O4

 and 

-O5

, the compiler detects the cache characteristics of the machine on which you are 

compiling and assumes you want to tune cache optimizations for that cache layout. 

Source-Level PowerPC Optimizations 

The XL compiler family exposes hardware-level capabilities directly to you through 
source-level intrinsic functions, procedures, directives, and pragmas.  XL compilers offer 
simple interfaces that you can use to access PowerPC instructions that control low-level 
instruction functionality such as: 
 

• 

Hardware cache prefetching/clearing/sync 

• 

Access to FPSCR register (read and write) 

background image

• 

Arithmetic (e.g. FMA, converts, rotates, reciprocal SQRT) 

• 

Compare-and-trap 

 
The compiler inserts the requested instructions or instruction sequences for you, but is 
also able to perform optimizations using and modelling the instructions’ behavior.  For 
the XL C and C++ compilers, you can additionally insert arbitrary PowerPC instruction 
sequences through the 

mc_func

 pragma. 

VMX Vector Instruction Programming Interface 

Under the 

-qaltivec

 option, the Mac OS X C and C++ compilers additionally support the 

AltiVec programming interfaces available in the Mac OS X gcc compiler.  This allows 
source-level recognition of the vector data type and the more than 100 intrinsic functions 
defined to manipulate vector data.  These interfaces allow you to program source-level 
operations that manipulate vector data using the VMX facilities of the Apple PowerPC 
processors.  AltiVec vector capabilities allow your program to calculate arithmetic results 
on up to sixteen data items simultaneously. 
 

High-Order Transformation (HOT) Loop Optimization 

The HOT optimizer in the XL compilers is a specialized loop transformation optimizer.  
HOT optimizations are activated by default at optimization levels 

-O4

 and 

-O5

, but can 

be specified with levels 

-O2

 or 

-O3

 using the 

-qhot

 option.  Loops typically account for 

the majority of the execution time of most applications and the HOT optimizer performs 
in-depth analysis of loops to minimize their execution time.  Loop optimization 
techniques include:  interchange, fusion, unrolling of loop nests, and reducing the use of 
temporary arrays.  The goals of these optimizations include: 
 

• 

Reducing the costs of memory access through the effective use of caches and 
translation look-aside buffers (TLBs).  Increasing memory locality reduces 
cache/TLB misses. 

• 

Overlapping computation and memory access through effective utilization of the data 
prefetching capabilities provided by the hardware. 

• 

Improving the utilization of processor resources through reordering and balancing the 
usage of instructions with complementary resource requirements.  Loop computation 
balance typically involves load/store operations balanced against floating-point 
computations. 

 
HOT is especially adept at handling Fortran 90-style array language constructs and 
performs optimizations such as elimination of intermediate temporary variables and 
fusion of statements.  While HOT works well with Fortran constructs, it also recognizes 
opportunities in C and C++ code compiled with the XL C and C++ compilers. 
 
In all three languages, you can use pragmas and directives to assist the HOT optimizer in 
loop analysis.  Assertive directives such as 

INDEPENDENT

 or 

CNCALL

 allow you to 

describe important loop characteristics or behaviors that the HOT optimizer can exploit.  
Prescriptive directives such as 

UNROLL

 or 

PREFETCH 

allow you to direct the HOT 

background image

optimizer’s behavior on a loop-by-loop basis.  You can additionally use the 

-qreport

 

compiler option to generate information about loop transformations.  The report can 
assist you in deciding where pragmas or directives can be applied to improve 
performance. 
 
In addition to general loop transformation, the 

-qhot

 option supports suboptions that you 

can specify to enable particular transformations.  These optimization techniques are 
discussed below. 

HOT Vectorization 

By default, if you specify 

-qhot

 with no suboptions (or an option like 

-O4

 that implies  

-qhot

), the 

-qhot=vector

 suboption is enabled.  The 

vector

 suboption can optimize loops 

in source code for operations on array data by ensuring that operations run in parallel 
where applicable.  The compiler uses standard machine registers for these 
transformations and does not restrict vector data size, supporting both single- and double-
precision floating-point vectorization.  Often, HOT vectorization involves 
transformations of loop calculations into calls to specialized mathematical routines 
supplied with the compiler.  These mathematical routines use algorithms that calculate 
the results more efficiently than executing the original loop code.  HOT vectorization 
does not use AltiVec/VMX PowerPC instructions and so it can even be used all types of 
systems. 

HOT Array Size Adjustment 

An array dimension that is a power of two can lead to a decrease in cache utilization. The 

-qhot=arraypad

 suboption allows the compiler to increase the dimensions of arrays 

where doing so could improve the efficiency of array-processing loops. Using this 
suboption can reduce cache misses and page faults that slow your array processing 
programs.  Not all arrays will necessarily be padded, and the compiler can pad different 
arrays by different amounts, making the correct decisions to increase your application’s 
performance.  You can specify a padding factor to apply to all arrays which would 
typically be a multiple of the largest array element size.  Array padding should be done 
with discretion.  The HOT optimizer does not check for situations where array data is 
overlaid, as with Fortran 

EQUIVALENCE or

 C 

union

 constructs, or with Fortran array 

reshaping operations. 
 

Interprocedural Analysis (IPA) Optimization 

The IPA optimizer’s primary purpose is whole-program analysis and optimization.  IPA 
analyzes the entire program rather than a program on a file-by-file basis while running 
during the link step of an application build.  On the link step, the entire program, 
including linked-in libraries, is visible to the IPA optimizer.  IPA can perform 
transformations that are not possible when only one file or compilation unit is visible at 
compilation time. 
 
IPA link-time transformations restructure the application, performing optimizations such 
as inlining between compilation units.  Complex data flow analyses are performed across 

background image

subprogram calls to eliminate parameters or propagate constants directly into called 
subprograms.  IPA can recognize system library calls because it acts as a pseudo-linker 
resolving external subprogram calls to system libraries.  This allows IPA to improve 
parameter usage analysis or even eliminate the call completely and replace it with more 
efficient inline code. 
 
In order to maximize IPA link-time optimization, the IPA optimizer must be used on both 
the compile step and the link step.  IPA can only perform a limited program analysis at 
link time on objects that were not compiled with IPA and must work with greatly reduced 
information.  When IPA is active on the compile step, program information is stored in 
the resulting object file, which IPA reads on the link step when the object file is analyzed.  
The program information is invisible to the system linker, and the object file can be used 
as a normal object and be linked without invoking IPA.  IPA uses the hidden information 
to reconstruct the original compilation and is then able to completely reanalyze the 
subprograms in the object in the context of their actual usage in the application. 
 
The IPA optimizer performs many transformations even if IPA is not used on the link 
step.  Using IPA on the compile step initiates optimizations that can improve 
performance for each individual object file even if the object files are not linked using the 
IPA optimizer.  IPA’s primary focus is link-step optimization but using the IPA optimizer 
only on the compile-step can still be very beneficial to your application. 
 
Because the XL compiler family shares optimization technology, object files created 
using IPA on the compile step with the XL C, C++, and Fortran compilers can all be 
analyzed by IPA at link time.  Where program analysis shows that objects were built with 
compatible options, such as 

-qnostrict

, IPA can perform transformations such as inlining 

C functions into Fortran code, or propagating C++ constant data into C function calls. 
 
IPA’s link-time analysis facilitates a restructuring of the application and a partitioning of 
it into distinct units of compatible code.  After IPA optimizations are completed, each 
unit is further optimized by the lower-level compilation-unit optimizer normally invoked 
with the 

-O2

 or 

-O3

 options.  Each unit is compiled into one or more object files, which 

are linked with the required libraries by the system linker, and an executable program is 
created. 
 
It is important that you specify a set of compilation options as consistent as possible 
when compiling and linking your application.  This advice applies to all compiler 
options, not just 

-qipa

 suboptions.  The ideal situation is to specify identical options on 

all compilations and then to repeat the same options on the IPA link step.  Incompatible 
or conflicting options used to create object files or link-time options in conflict with 
compile-time options can reduce the effectiveness of IPA optimizations.  For example, it 
is unsafe to inline a subprogram into another subprogram if they were compiled with 
conflicting options. 

background image

IPA Suboptions 

The IPA optimizer has many behaviors which you can control using the 

-qipa

 option and 

suboptions.  The most important part of the IPA optimization process is the level at which 
IPA optimization occurs. By default, the IPA optimizer is not invoked.  If you specify  

-qipa

 without a level, or you specify 

-O4

, IPA is run at level one.  If you specify 

-O5

IPA is run at level two.  There is also a level zero which can lower compilation time, but 
performs a more limited analysis.  Below are some of the important IPA transformations 
performed at each level. 

-qipa=level=0  

• 

Performs automatic recognition of standard library functions (e.g. ANSI C, Fortran 
run-time) 

• 

Localization of statically bound variables and procedures 

• 

Partitioning and layout of code according to call affinity - expands the scope of the  

-O2

/

-O3

 low-level compilation unit optimizer 

• 

This level can be beneficial to an application but cost less compile time than higher 
levels 

-qipa=level=1 

• 

Level 0 optimizations 

• 

Procedure inlining 

• 

Partitioning and layout of static data according to reference affinity 

-qipa=level=2 

• 

Level 0 and level 1 optimizations 

• 

Whole program alias analysis 

Disambiguation of pointer references and calls 

Refinement of call side effect information 

• 

Aggressive intraprocedural optimizations 

Value numbering, code propagation and simplification, code motion (into 
conditions, out of loops), redundancy elimination 

• 

Interprocedural constant propagation, dead code elimination, pointer analysis 

• 

Procedure specialization (cloning) 

 
In addition to selecting a level, the 

-qipa

 option has many other suboptions available for 

fine-tuning the optimizations applied to your program.  There are several suboptions to 
control inlining including: how much to do, threshold levels, functions to always or never 
inline, and other related actions.  Other suboptions allow you to describe the 
characteristics of given subprograms to IPA - pure, safe, exits, isolated, low execution 
frequency, and others.  The 

-qipa=list

 suboption can show you brief or detailed 

information concerning IPA analysis such as program partitioning and object reference 
maps. 
 
An important suboption that can speed compilation time is 

-qipa=noobject

.  You can 

lower compilation time if you intend to use IPA on the link step and do not need to link 

background image

the object files from the compilation step without using IPA.  Specify the  

-qipa=noobject

 option on the compile step and this will create object files on the 

compile step that only the IPA link-time optimizer can use.   The object files will be 
created more quickly because the low-level compilation unit optimizer is not invoked on 
the compile step when the 

noobject

 suboption is specified. 

 
On AIX and Linux systems, you may also be able to reduce IPA optimization time by 
using the 

-qipa=threads

 suboption.  The threads suboption allows the IPA optimizer to 

run portions of the optimization process in parallel threads which can speed up the 
compilation process on multi-processor systems. 
 

Profile-Directed Feedback (PDF) Optimization 

PDF is an optimization that is applied to your application in two stages.  The first stage 
collects information about your program as you run it with typical input data.  The second 
stage applies transformations to your application based on the information collected.  
PDF is used by the XL compiler optimizer to discover information about your application 
such as the locations of heavily used or infrequently used blocks of code.  Knowing the 
relative execution frequency of code provides opportunities to the optimizer to bias 
execution paths in favor of heavily used code.  Program restructuring can also be 
performed in order to ensure that infrequently-executed blocks of code are less likely to 
affect program path length or participate in instruction cache fetching. 
 
It is important that the data sets used to collect the PDF information be characteristic of 
the data your application will typically see.  Using atypical data or insufficient data can 
lead to a false picture of the program and suboptimal program transformation.  If you do 
not have sufficient data, you should not use PDF optimization. 
 
The first step in PDF optimization is to compile your application with the 

-qpdf1

 option.  

Doing so will instrument your code with calls to a PDF run-time that will be linked with 
your program.  You then need to execute your application with typical input data.  You 
can do so as many times as you wish with as many data sets as you have.  Each run will 
record information in data files.  XL compilers supply the 

cleanpdf

 and 

resetpdf

 tools that 

assist you in managing PDF data. 
 
After sufficient PDF data is collected, recompile your application with the 

-qpdf2

 option.  

The compiler will read the PDF data files and make the information available to all levels 
of optimization that are active.  PDF optimization can be combined with other 
optimization techniques in the XL compilers such as the standard 

-O2

 or 

-O3

 compilation 

unit optimizations, or the 

-qhot

, or 

-qipa

 optimizations active at higher optimization 

levels. 
 
The PDF optimization is most effective when applied to applications that contain blocks 
of code that are infrequently and conditionally executed.  Typical examples of this coding 
style include blocks of error-handling code and code that has been instrumented to 
conditionally collect debugging or statistical information. 

background image

 

Symmetric Multiprocessing (SMP) Optimizations 

Note:  Version 8 of the XL Fortran compiler on Mac OS X and Version 6 of the XL 
C/C++ compiler on Mac OS X provide SMP programming and optimization as a 
technology preview only.  SMP is fully supported on other XL compiler platforms. 

 
The XL compilers support the OpenMP standard applicable to each language.  Using 
OpenMP allows you to write portable code that is compliant to the OpenMP parallel-
programming standard and enables your application to run in parallel threads on SMP 
systems.  OpenMP consists of a defined set of source-level pragmas/directives, and run-
time function interfaces you can use in your program to parallelize your application.  The 
compilers include threadsafe libraries in support of OpenMP or for use with other SMP 
programming models. 
 
The optimizer in the XL compilers includes threadsafe optimizations specific to SMP 
programming and particular performance enhancements to the OpenMP standard. The  

-qsmp

 compiler option has many suboptions that you can use to guide the optimizer 

when analyzing SMP code.  You can set additional SMP-specific environment variables 
to tune the run-time behavior of your application to maximize the SMP capabilities of 
your hardware.  For basic SMP functionality, the 

-qsmp=noopt

 suboption allows you to 

transform an application into an SMP application, but performs only the minimal 
transformations required in order to preserve maximum source-level debug information. 
 
The 

-qsmp=auto

 suboption is a powerful tool that enables automatic transformation of 

normal sequentially-executing code into parallel-executing code.  Using this suboption 
allows the optimizer to automatically exploit the SMP and shared memory parallelism 
available in IBM processors.  By default, the compiler will attempt to parallelize 
explicitly coded loops as well as those that are generated by the compiler for Fortran 
array language.  If you do not specify 

-qsmp=auto

 or you specify 

-qsmp=noopt

automatic parallelization is turned off, and only constructs that are marked with 
prescriptive directives/pragmas are parallelized. 
 
The 

-qsmp

 compiler option additionally supports suboptions that allow you to guide the 

compiler’s SMP transformations. These include suboptions that control transformations 
of nested parallel regions, use of recursive locks, and what task scheduling models to 
apply.   
 
When using 

-qsmp

 or any other parallel-based programming model, you must invoke the 

compiler with one of the “_r†(re-entrant) variations of the compiler name.  For example, 
rather than use 

xlc

 or 

xlf90

 you must use 

xlc_r

 or 

xlf90_r

.  The different name signals to 

the compiler that you need to use an alternate set of compiler option defaults (e.g.  

-qthreaded

 is specified for you) and the correct set of threadsafe libraries will be used 

for linking the application.  You can use the “_r†versions of the compiler even when you 
are not generating code that executes in parallel.  However, especially for XL Fortran, 

background image

code and libraries will be used in place of the sequential forms that are not always as 
efficient in a single-threaded execution mode. 
 

Aliasing 

The precision of compiler analyses is constrained by the apparent effects of direct or 
indirect memory access.  Memory can be referenced directly through a variable, or 
indirectly through a pointer, function call or reference parameter.  Many apparent 
references to memory are false, and constitute barriers to compiler analysis.  The 
compiler does analysis of possible aliases at all optimization levels but analysis of these 
apparent references is best when using the 

-qipa

 option.  Options such as 

-qalias

 and 

directives or pragmas such as 

CNCALL

 and 

INDEPENDENT

 can have a pervasive 

effect since they fundamentally improve the precision of compiler analysis. 
 
Each language has its own rules as to what constitutes a valid or an invalid alias.  Fortran 
rules are well defined for what can and cannot be done with arguments passed to 
subprograms.  Failure to follow language rules that affect aliasing will often mislead the 
optimizer into performing unsafe or incorrect transformations.  The higher the 
optimization level and the more optional optimizations are applied, and the more likely 
the optimizer will be misled. 
 
The XL compilers supply options that you can use to optimize programs with non-
standard aliasing constructs.  Using these options can result in poor-quality aliasing 
information being made available to the optimizer, and less than optimal code 
performance.  It is recommended that you alter your source code where possible to 
conform to language rules.  Doing so will make your code both more portable and more 
optimizable. 
 
The 

-qalias

 option can be used for all three XL compiler languages to assert whether 

your application follows aliasing rules.  For Fortran and C++, standards-conformant 
program aliasing is assumed by default.  For the C compiler, the invocations that begin 
with “xlc†assume conformance and the invocations that begin with “cc†do not.  The  

-qalias

 option has suboptions that vary by language.  Suboptions exist for both the 

purpose of specifying that your program has non-standard aliasing, and for asserting to 
the compiler that your program exceeds the aliasing requirements of the language 
standard.  The latter set of suboptions can remove barriers to optimization that the 
compiler must assume due to language rules.  For example, in the XL C/C++ compiler, 
specifying -

qalias=typeptr

 allows the optimizer to assume that pointers to different types 

never point to the same or overlapping storage.  This exposes additional optimization 
opportunities because the optimizer performs more precise aliasing analysis in code 
involving pointers. 
 

background image

Additional Performance Options 

In addition to the compiler options already introduced, the XL compilers have many other 
options that you can use to direct the optimizer.  Some of these are specific to individual 
XL compilers rather than the entire XL compiler family and this is noted in the text. 

Optimizer Guidance Options 

 

• 

-qcompact

: (all) 

Default is 

-qnocompact

 

Prefers final code size reduction over execution time performance when a 
choice is necessary. 

Can be useful as a way to constrain the 

-O3

 and higher optimization levels. 

• 

-ma

: (C, C++)

 

The compiler will generate inline code for calls to the 

alloca

 

function. 

• 

-qprefetch

: (all) 

Instructs the compiler to insert prefetch instructions automatically where there 
are opportunities to improve code performance. 

• 

-qro,-qroconst

: (C, C++)

 

Directs the compiler to place string literals (

-qro

) or constant values in read-

only storage (

-qroconst

). 

• 

-qsmallstack

: (all)

 

Default is 

-qnosmallstack 

Instructs the compiler to minimize the use stack (automatic) storage where 
possible; doing so may increase heap usage. 

• 

-qnostrict

: (all) 

Default is 

-qstrict

 with 

-O0

 and 

-O2

-qnostrict

 with 

-O3

-O4

, and 

-O5

 

-qnostrict

 

allows the compiler to reorder floating-point calculations and 

instructions that can potentially cause exceptions. 

Do not specify 

-qstrict

 unless your application relies on absolutely precise 

IEEE floating-point compliant results or the order of floating-point 
computations. 

• 

-qunroll

: (all) 

Default is 

-qnounroll

 (unless optimizing) 

Independently controls loop unrolling.  It is turned on implicitly with any 
optimization level higher than 

-O0

You can specify suboptions that determine the aggressiveness of automatic 
loop unrolling. 

Program Behavior Options 

 

• 

-qaggrcopy

={

ovrlap|noovrlap

}:

 

(C, C++) 

Specify whether aggregate assignments may have overlapping source and 
target locations.

 

Default is 

ovrlap

 with 

cc

 compiler invocations, 

noovrlap

 with 

xlc

 and 

xlC

 

compiler invocations.

 

background image

• 

-qassert

: (all)

 

Default is 

-qassert=deps:itercnt=10

 

deps

 indicates that at least one loop has a memory dependence or conflict 

from iteration to iteration. 

 

For improved performance, try 

-qassert=nodeps

 when no loops in the 

compilation unit carry a dependence around loop iterations. 

itercnt

 modifies the default assumptions about the expected iteration count of 

loops; normally the optimizer will assume ten iterations for a typical loop. 

• 

-qnoeh

 

(C++)  

Default is 

-qeh

 

Asserts that no throw is reachable from the code being compiled in this 
compilation unit. 

Using it can improve execution speed and reduce code footprint where the 
code has no C++ exception handling. 

• 

-qignerrno

: (C, C++) 

 

Default is 

-qnoignerrno

, for 

-O3

 or higher, the default is 

-qignerrno 

Indicates that the value of errno is not needed by the program. 

Can help in optimization of math functions that potentially set errno, such as 
sqrt. 

• 

-qlibansi

: (all)

 

Default is 

-qnolibansi 

Specifies that calls to ANSI standard function names will be bound with 
conforming implementations. 

Allows the compiler to replace the calls with more efficient inline code or at 
least do better call-site analysis. 

Not needed when linking with 

-qipa

• 

-qproto

: (C)

 

Asserts that procedure call points agree with their declarations even if the 
procedure has not been prototyped. 

Useful for well behaved K&R C code. 

• 

-qnounwind

: (all)

 

Default is 

-qunwind

.

 

Asserts that the stack will not be unwound in such a way that register values 
must be accurately restored at call points. 

Most Fortran applications can use 

-qnounwind

 thus allowing the compiler to 

be more aggressive in eliminating register saves/restores at call points. 

Floating-Point Computation Options 

 
The 

-qfloat

 option provides precise control over the handling of floating-point 

calculations.  XL compiler default options result in code that is 

almost 

IEEE 754 

compliant.  Where non-compliant code is generated, it is to allow the compiler the 
freedom to exploit certain optimizations such as floating-point constant folding or to use 
efficient PowerPC instructions that combine operations.  You can use the 

-qfloat

 option 

to prohibit these optimizations.  Specify 

-qfloat=

subopt

 

where 

subopt

 

is one of: 

 

background image

• 

[no]fold

:  

Enables compile time evaluation of floating-point calculations.  You may 
need to disable folding if your application must handle certain floating-point 
exceptions such as overflow or inexact. 

• 

[no]maf

:  

Enables generation of combined multiple-add instructions.  You may need to 
disable maf instructions to produce results identical to those produced by non-
PowerPC systems.  Disabling maf instructions can result in significantly 
slower code. 

• 

[no]rrm

:  

Specifies that rounding mode may not be round-to-nearest (default is 

norrm

or may change across calls. 

• 

[no]rsqrt

:  

Allows computation of a divide by square root to be replaced by a multiply of 
the reciprocal square root. 

 
Note that AIX XL compilers which target older POWER and POWER2â„¢ chipsets 
support additional 

-qfloat

 options. 

Diagnostic Options 

 
The following options can assist you in analyzing the results of compiler optimization.  
You can examine this information to see if expected transformations have occurred. 

• 

-qlist 

Generates an object listing that includes hex and pseudo-assembly 
representations of the generated code and text constants. 

• 

-qreport

Specified as 

-qreport=[hotlist | smplist].

 

Instructs the HOT or IPA optimizer to emit a report including pseudo-code 
along with annotations describing what transformations, such as loop 
unrolling or automatic parallelization, were performed. 

Includes data dependence and other information such as program constructs 
that inhibit optimization. 

• 

-S

Invokes the disassembly tool supplied with the compiler on the object file(s) 
produce by the compiler. 

This produces an assembler file that is compatible with the system assembler. 

The 

-S

 option is not supported with the Mac OS X XL compilers. 

 

User-Directed Source-Level Optimizations 

The XL compilers support many source-level directives and pragmas that you can specify 
to influence the optimizer.  Several have been mentioned in the previous sections.  
Following is an important subset of those that the XL compiler supports along with a 

background image

brief description of the purpose of each.  Fortran uses directives to express these and 
C/C++ uses pragmas. 

Fortran Directives 

 

• 

ASSERT ( ITERCNT(

n

) | [NO]DEPS ) 

Same behavior as the 

-qassert

 option but applicable to a single loop.  This is 

much more useful because the characteristics of each loop can be described to 
the optimizer independently of the other loops in the program. 

• 

CACHE_ZERO 

Inserts a dcbz (data cache block zero) instruction at the given address. 

Useful when storing to contiguous storage as it avoids the level 2 cache store 
miss entirely. 

• 

CNCALL 

Asserts that the calls in the following loop do not cause loop-carried 
dependences. 

• 

INDEPENDENT 

Asserts that the following loop has 

no 

loop-carried dependences. 

Enables locality and parallel transformations. 

• 

PERMUTATION ( 

names 

Asserts that elements of the named arrays take on distinct values on each 
iteration of the following loop - useful with sparse data. 

• 

PREFETCH 

PREFETCH_BY_LOAD (

variable_list

):

 issue dummy

 

loads to cause the 

given variables to be prefetched into cache - useful to activate hardware 
prefetch. 

PREFETCH_FOR_LOAD

 (

variable_list

): issue a 

dcbt 

instruction for each 

of the given variables. 

PREFETCH_FOR_STORE (

variable_list

)

: issue a 

dcbtst 

instruction for 

each of the given variables. 

• 

UNROLL

 

Specified as 

[NO]UNROLL [(

n

)]

 

Used to activate/deactivate compiler unrolling for the following loop. 

Can be used to give a specific unroll factor. 

C/C++ Pragmas 

 

• 

disjoint (variable_list)  

Asserts that none of the named variables (or pointer dereferences) share 
overlapping areas of storage. 

• 

execution_frequency (very_low) 

Asserts that the control path containing the pragma will be infrequently 
executed. 

• 

independent_calls

  

Equivalent to 

CNCALL

 in Fortran. 

background image

• 

independent_loop

 

 Equivalent to 

INDEPENDENT

 in Fortran. 

• 

isolated_call (function_list)  

Asserts that calls to the named functions do not have side effects. 

• 

iterations

 

Equivalent to 

ASSERT(ITERCNT)

 in Fortran. 

• 

leaves (function_list) 

Asserts that calls to the named functions will not return (e.g. exit). 

• 

permutation

 

Equivalent to 

PERMUTATION 

in Fortran. 

• 

sequential_loop

 

Directs the compiler to execute the following loop in a single thread, even if 
the 

-qsmp=auto

 option is specified. 

 

Optimization-Friendly Programming 

There are many ways that you can assist the XL compiler optimizer in maximizing the 
performance of your application that do not involve the use of compiler options or 
source-level directives.  The following are some useful tips and techniques. 

General Programming Tips 

 

• 

Use register-sized integers for scalars. In 32-bit mode, use 4-byte integers.  In 64-bit 
mode, use 8-byte integers.  However, for large arrays of integers, consider using 1- or 
2-byte integers. 

• 

Use the smallest floating-point precision appropriate to your computation. Use 16-
byte floating-point or 32-byte complex data types only when you require extremely 
high precision. 

• 

Obey all language aliasing rules.  Try to avoid 

-qassert=nostd

 in Fortran and  

-qalias=noansi

 in C/C++. 

• 

Use local variables wherever possible for loop index variables and bounds. In C/C++, 
avoid taking the address of loop indices and bounds. 

• 

Keep array index expressions as simple as possible. Where indexing needs to be 
indirect, consider using the 

PERMUTATION

 directive/pragma. 

Fortran Programming Tips 

 

• 

Use the 

xlf90

 or 

xlf95

 driver invocations where possible to ensure portability and 

maximize the use of automatic storage. If this is not possible, consider using the  

-qnosave 

option. 

• 

When writing new code, use module variables rather than common blocks for global 
storage. 

• 

Use modules to group related subroutines and functions. 

• 

Use 

INTENT

 to describe usage of parameters. 

background image

• 

Limit the use of 

ALLOCATABLE

 variables and 

POINTER

 variables to situations 

which demand dynamic allocation. 

• 

Use 

CONTAINS

 only to share thread local storage. 

• 

Avoid the use of 

-qalias=nostd

 by obeying Fortran aliasing rules. 

• 

When using array assignment or 

WHERE

 statements, pay close attention to the 

generated code with 

-qlist

 or 

-qreport

. If performance is inadequate, consider using  

-qhot

 or rewriting array language in loop form. 

C/C++ Programming Tips 

 

• 

For C, use the 

xlc

 or 

xlc_r

 invocations rather than 

cc

 or 

cc_r

 when possible. 

• 

Always include 

string.h 

when doing string operations and 

math.h 

when using the 

math library. 

• 

Pass large class and structure parameters by address or reference, pass everything else 
by value where possible. 

• 

Use unions and pointer type-casting only when necessary and try to follow ANSI type 
rules. 

• 

If a class or structure contains a 

double

 variable, consider putting it first in the 

declaration.  If this is not possible, consider using 

-qalign=natural

• 

In C++, avoid virtual functions and virtual inheritance unless required for class 
extensibility. These are costly in object space and function invocation performance. 

• 

Use 

volatile

 only for truly shared variables. 

• 

Use 

const

 for globals, parameters and functions whenever possible. 

• 

Do limited hand-tuning of small functions by defining them as 

inline

 in a header file. 

 

Summary 

The IBM XL compiler family offers premier optimization capabilities on the AIX, Linux, 
and Mac OS X PowerPC platforms.  You can control the type and depth of optimization 
analysis through compiler options which allow you choose the levels and kinds of 
optimization best suited to your application.  IBM’s long history of compiler 
development gives you control of mature industry-leading optimization technology like 
Interprocedural Analysis (IPA), High-Order loop Transformations (HOT), Profile-
Directed Feedback (PDF), Symmetric-Multiprocessing (SMP) optimizations, as well as a 
unique set of PowerPC optimizations that fully exploit the hardware architecture’s 
capabilities.  This optimization strength combines with the XL compiler family’s 
robustness, capability, and standards conformance on multiple hardware and operating 
system platforms to produce a product set unmatched in the industry. 

Trial Versions and Purchasing 

You can download trial versions of the XL compilers for Mac OS X, Linux, and AIX at 
these IBM web sites: 

http://www.ibm.com/software/awdtools/fortran/xlfortran 
http://www.ibm.com/software/awdtools/ccompilers 

 

background image

Information on how to buy these compilers is also available at the above web sites.  You 
can also obtain the Mac OS X versions of the XL compilers from Absoft© Corporation, 
an IBM OEM partner, at 

http://www.absoft.com

.  

 

Contacting IBM 

IBM welcomes your comments. You can send them to 

compinfo@ca.ibm.com

 or mail 

them to this address: 

XL Compiler Development 
Department 257 
Application Development Technology Centre 
Software Division Toronto Laboratory IBM Canada Ltd. 
8200 Warden Avenue 
Markham, Ontario 
Canada – L6G 1C7 

Copyright Notice 

© Copyright IBM Corp. 2004. All Rights Reserved. 
 
IBM is trademark or registered trademark of International Business Machines 
Corporation in the U.S., other countries or both.