analyze88(1)

NAME

analyze88 − performance analysis and optimization of 88000 object files.

SYNOPSIS

analyze88 [−A] [−D flag] [−H] [−N] [−O file] [−P file] [−S section] [−X routine] [−a routine] [−d file] [−g file] [−i] [−r file] [−s routine] [−v] [−Z keyword] file

OPTIONS

−A Include all the routines in the analysis. This is the default mode of operation.

−D flag Turn on the specified debug flag. You will not be interested in using this unless you know a lot about the inner details of analyze88.

−H Print a summary of the command usage.

−N Set the list of routines to be analyzed to the empty set. This overrides the default setting (which corresponds to −A above).

−O file Generate a new program file in file which has been optimized by replacing many of the two instruction sequences (which are required to reference global memory locations) with single instructions which use the reserved linker registers (r26 through r29) as base registers. This allows faster access to the four most commonly referenced 64k data blocks. Also any references to the first 64k of the text section are optimized to use r0 as a base register. Certain library routines that are known to access the linker registers (setjmp and longjmp) are automatically excluded from the optimization process. The −X option may be used to specifically exclude others. (Normally any reference to a linker register will cause an error).

−P file Generate a new program file in file which has been patched to gather profiling statistics on each basic block and dump them to file.prof on exit. The report88 program can be used to generate various reports from this information. The −X option may be useful with this option.

−S section Normally only the text section of the object file is analyzed. If you wish to analyze some other section, specify the name with this option.

−X routine Declare routine to be the name of a subroutine which causes the program to exit. When the −P option is used, this routine, when called, will dump the accumulated statistics to the .prof file. After writing the statistics data set to the .prof file, the statistics are reset to zero. When the −O option is used, the −X option will exclude the named routine from the optimization.

−a routine Analyze the specific named routine. The leading underscore is required. This can be used after −N to add a routine to the list. If used without −N, it assumes you meant to specify −N, and supplies one for you.

−d file Generate a detailed disassembly listing of each routine included in the analysis. The listing is done on a per basic block basis. By default this only generates the assembler listing, the clock cycle each instruction executes at (relative to the beginning of each basic block), and the reason any instruction is delayed. Use the −v option for more detail.

−g file Generate global program statistics to file.

−i Print various informative bits of information about the object file.

−r file Print summary statistics for each routine to file.

−s routine This subtracts a routine from the list to be analyzed. It pairs with the −A option much like −N and −a team up, only inverted.

−v If this option is used, the disassembly listing is annotated with the details of which instructions are using which machine resources at each cycle.

−Z keyword The −Z option is used to pass some keyword options to analyze88, which are obscure and seldom used. The keywords recognized are:

old-time: This tells analyze88 to compute max time using the old algorithm rather than the new one. The old algorithm added all pending cycles to the end of all blocks to get max time. The new algorithm does not add the cycles to the end of blocks that end in call instructions.

retain: The Harris link editor currently adds extra relocation information to the object file so analyze88 can optimize things like assigned gotos correctly. Normally this information is stripped after optimization. If you are going to want to profile or disassemble the program file, this option will retain the extra relocation information so the additional processing can be more accurate.

oldline: This forces analyze88 to use the old (pre OCS) format for COFF file line number information.

break=<name>: Tell analyze88 the name of the global variable used to contain the break address. This variable is used by the brk() and sbrk() routines to track the next available heap address. When using the −P option, the initial value of this variable must be patched. The default name is curbrk.

DESCRIPTION

The analyze88 tool is available only on Series 4000 and Series 5000 systems. It reads COFF files for the Motorola 88000 architecture. It interprets the instructions in the file to find the routines and basic blocks within each routine, then does a local timing analysis for each basic block to determine information about time spent in the block, places where execution is delayed due to pipe constraints, etc.

The lowest level of detailed output is generated with the −d option, which generates a disassembly listing, and the −v option which annotates that listing with detailed information on the resources being used.

Because there is so much information, it had to be compressed into a fairly cryptic form:

t=# This indicates the relative clock time. Everything on the same line happens at the same time.

u#r An entry that starts with the letter u indicates a resource is now being used. The number following the u is the sequence number of the instruction within the basic block that is using the resource, finally the resource name appears immediately following the number (resource names are things like registers or pipeline stages).

f#r An entry that starts with f indicates the instruction at the given sequence has now freed the resource.

b#r[#] The b entries describe an instruction being blocked because it needs a resource. The number at the end enclosed in brackets is the sequence number of the instruction which currently has the resource and is the cause of the block.

When analyze88 prints an instruction out, it puts it on a line by itself with the clock time it started execution on the end. The fields on the line represent the source line number (blank if no debug information is available in the file), the sequence number within the block, the absolute address of the instruction in the file, the four-byte hex for the instruction itself, then the symbolic disassembly of the instruction.

Currently, max time is defined as the total number of cycles required for all instructions in the block to completely make it through all pipe stages. It therefore represents an extreme worst case upper bound.

Note that all times are strictly local, a block containing a subroutine call will only have the time for the call instruction. No information is computed about the time actually spent in the subroutine, and no information is known about the state of the pipelines when the subroutine returns. The max time for a block ending in a subroutine call does not count any cycles remaining in the pipe at the time the call is made because most of these cycles never cause any delay (the subroutine is usually still in the prolog when the pipe drains).

analyze88 may be invoked at link time by using the Harris link editor’s -O option.

STATISTICS

The analyze88 tool computes several statistics, some of which are more meaningful than others, but all are designed to help someone analyze the quality of generated code for the 88000 architecture.

BURT Burt stands for Bogus Uniform Routine Time, and (as its name indicates) is a fairly bogus statistic which may have some value as a guide. It is computed by multiplying the max time for each basic block by a weighting factor that increases rapidly as the loop nesting level goes up. The accumulated time for all the blocks is the BURT number.

ERNIE Ernie is External Routine Necessary Interface Executions, and is a statistic designed to help you decide if BURT numbers are different because subroutines have been inlined (or vice-versa), or if they are different simply because of different code quality. ERNIE is computed by simply adding up all the nesting level factors for any block that contains a subroutine call.

The above statistics all depend on accurately computing loop nesting levels. If the flow graph is irreducible, then it is difficult to decide just what a loop is, so a warning is generated for routines with irreducible flow graphs. Often when code finally gets generated, a single basic block will be the header of several back edges. Each back edge is counted as a separate loop, so the nesting level for the header may get very high.

OPTIMIZATION

The −O option invokes the post-linker optimization code. This attempts to create what amounts to program-wide common sub-expressions which are kept in the linker reserved registers r26 through r29.

Every attempt has been made to make this optimization work correctly. Extensive flow analysis of register contents is done to insure that only correct substitutions will be done, but there is no guarantee that this optimization will always work. Programs should be tested extensively after running them through this optimization.

Unless the program being optimized was produced by the Harris link editor with the additional relocation information in the vendor section, then fortran programs with assigned goto statements will break, so anyone attempting to optimize fortran should be very cautious. The −X option can be used to exclude any routines known to contain assigned gotos from the list.

If analyze88 detects any routines that already reference any of the linker registers prior to optimization, it will generate a fatal error and refuse to optimize the program. Sometimes certain assembly routines can reference these registers in a harmless fashion. The setjmp and longjmp routines, along with some signal handling code in _sigtramp are known routines that are automatically excluded from optimization. If you have any other routines that reference these registers, you can still optimize by naming them with the −X option, which will cause analyze88 to ignore them.

PROFILING

The −P option patches the input program, generating a new program which will accumulate cycle count statistics at the basic block level and dump them to an output file on exit. The statistics are always dumped to a file with the same name as the executable given as the argument to −P, with the .prof suffix added. For example, if you specified −Pfred then when you run the generated program fred, the file fred.prof will be generated with the profiling statistics.

Currently, the statistics are only as accurate as the timing information shown in the disassembly listing. Both min and max times are accumulated, so the report can only print upper and lower bounds on the cycle count. A future version may attempt to add code that will correct the cycle count with additional information gathered about pipe conflicts that will occur depending on the arc followed to reach each basic block.

It is often difficult to profile some programs, especially those generated by non-Harris compilers. The following guidlines are given as an aid to people attempting to profile foreign code:

The analyze88 tool relies on the symbol table to find subroutine entry points. A stripped program cannot be profiled. Even if a symbol table exists, analyze88 can only identify subroutine entry points if they have associated tdesc information, if they have symbolic debug information identifying them as subroutine entry points, or if they are explicitly named using the −a option.

analyze88 records its profile statistics by writing them into the .bss section. The header of the COFF file is modified to reserve space in .bss, but the runtime environment also needs to be informed that the space is being used. analyze88 does this by first attempting to patch the initial value of the global variable (curbrk) used by the library routines to record the break address. If this variable is not found in the symbol table it then attempts to patch a call to brk() into the main entry point. If it cannot find the brk() entry point in the symbol table, then it cannot successfully patch the program. It may be necessary to re-link the program, forcing the brk() routine to be included by linking in an additional object file that references it, or use the −Zbreak=<name> option to specify a different name for the break variable.

Finally, analyze88 writes the statistics out by patching in a call to the write routine when the __exit routine is called (that is two underscores). If the low level exit routine is not called __exit or if the program exits in a different way (possibly by calling exec() ), then you will need to use the −X option to name the routines that should dump statistics.

After dumping the statistics at an exit point, all the basic block counts are set to zero. This features allows you to divide your program into separate sections which will be profiled independently, each generating a separate data set in the .prof file. All you need to do is call a dummy routine once between each section of the program, then use the −X option to declare these dummy routines as exit points.

If any basic block begins with a trap instruction of some kind, analyze88 will generate a warning. Normally it relies on the flow of control resuming right after the patched instruction, but it is uncertain where control will resume after the kernel gets control. Unless you know what the routine does, it might be wise to exclude it from the list of routines to be profiled.

BUGS

The timing information is not totally accurate. The model that analyze88 uses to simulate the progress of instructions through the various pipes is unable (currently) to support the concept of feed-forward, and accurately represent the way that interacts with getting a writeback slot. Due to this limitation, the model simply assumes that writeback slots will always be available, and ignores the issue.

The worst case timing information should really be generated by propagating live on entry resource utilizations backwards through the flow graph to see how they interact with live on exit resource utilizations from the predecessor blocks, but this is complex and would require a great deal more code to do the analysis.

Finally, the timing information assumes there will never be any cache misses or memory wait states since a static analysis cannot know if a memory reference will be in the cache or not.

Read the section on optimization above for warnings about bugs in the program optimizer.

Museum

Related Articles