1. Introduction

1.1 Overview of the Problem

  • The Jimple representation has no notion of a stack.
  • When transforming bytecode into a jimple representation, stack variables are created to hold the values of the computations occuring on the java stack.
  • When going from a Jimple representation back to bytecode, many redundant stores and load instructions occur, producing inefficient verbose code. These redundent stores and loads usually reference stack variables.
  • To remedy this problem, the Baf representation was created, as well as the Baf optimizations presented here.

1.2 Current Scope of Baf Optimizations

The current set of Baf optimizations as implemented by the class LoadStoreOptimizer, tries to catch only the following types of redundent store/load patterns.

  1. Degenerate Store: Store instruction whose stored value is never used.
    Action taken: replaced by a pop instruction.
  2. Store/Load (sl pair): Store instruction followed a load instruction in the same block refering to the same local variable.
    Pre-Condition: The value stored is only used by that single load.
    Action Taken: Annihilate both the store and the load instructions.
  3. Store/Load/Load (sll triple): Store instruction followed by 2 load instructions in the same block and all refering to the same local variable.
    Pre-Conditions: The value stored is only used by those two loads.
    Action Taken: Replace the the sll triple with a flavor of the dup instruction.

1.3 Examples: Simple Cases

Case0: degenerate store


      public static  void main(java.lang.String[])
      {
	  r0 := @parameter0;
	  load.r r0;
	  staticinvoke runBenchmark;
	  store.l $l0;
	  return;
      }


      reduces to:

      public static  void main(java.lang.String[])
      {
	  r0 := @parameter0;
	  load.r r0;
	  staticinvoke runBenchmark;
	  pop;
	  return;
      }
    

Case1: simple sl pair


    public static long getCachingtime()
    {

	staticinvoke    
        store.l $l0;
        load.l $l0;
        return.l;
    }
    
reduces to:

    public static long getCachingtime()
    {
        staticget ;
        return.l;
    }
    

Case2: simple sll triple


           ...
        store.i $i7;
        load.i $i7;
        load.i $i7;
        fieldget
           ... 
reduces to:

           ...
        dup1
        fieldget
           ...
    

Remarks:

  • The resulting code in case0 can further be simplified by appliying peephole patterns.
  • Case 1 is the pattern most often encountered in the generated Baf code.
  • Case 2 in the above form hardly ever (if ever) occurs in the generated Baf Code.
  • Often the store and load instructions of cases 1 and 2 are seperated by sequences of bytecodes and thus cannot be so easily eliminated. The effect of these interleaving bytecodes sequences on the stack must first be determined.

1.4 Interleaved sl pairs and Interleaved sll triples

As hinted in the previous slide, most sl pairs and sll triples have their store and load instructions interleaved with sequences of instructions which makes their reduction more difficult to ascertain.

Example:
    new spec.io.FileInputStream; store.r $r3; load.r $r3; load.r r1; specialinvoke ; load.r $r3; store.r r2; load.r r2; virtualinvoke getContentLength; store.i i3; load.i i3; newarray; store.r $r4; load.r r0; load.r $r4; fieldput spec.benchmarks._201_compress.Harness.orig_text_buffer; load.i i3; newarray; store.r $r5; load.r r0; load.r $r5; fieldput spec.benchmarks._201_compress.Harness.comp_text_buffer;

To identify the possible reductions in the above code, the effect on the stack of the interleaving sequences must be determined. Also many of these sequences will affect the stack in such a way that no reductions will be possible without some reordering of instructions.

2. Canonical Examples and Solutions

2.1 Interleaving Sequences

The pathological reduction cases as illustrated previously, occur when the store and load(s) are interleaved by a sequence of instructions.

Two pieces of information characterise interleaving sequences:

  • The net effect on the stack height after executing such a sequence. This shall be refered to as the net stack height variation or nshv.
  • The minimum stack height variation attained while executing such a sequence. This shall be refered to as the minimum stack height variation or mshv.

A sequence of instructions having both nshv == 0 and mshv == 0 will be refered to as a level sequence.

2.2 Interleaving Sequences Having Side Effects on the Stack


Consider:

        load.i i0;
        load.i i2;
        add.i;
        store.i $i3;
	push 0;
	store.i i5;
        load.r r1;
        load.i $i3;
        arrayread.b;
        store.i i4;

Here the the interleaving sequence is

load.r r1;
which leaves an item on the stack. Thus the store/load pair cannot be eliminated so easily for obvious reasons.

Incorrect:
        load.i i0;
        load.i i2;
        add.i;
	push 0;
	store.i i5;
        load.r r1;    
        arrayread.b;  <== reference on top of integer.
        store.i i4;
    

Remarks:

  • Of course a swap instruction could be used here, but that would add an instruction, would apply only to type 1 computational types, and would only be valid only for nshv = 1;
  • A more general heuristic used to deal with such situations is simply to try to relocate an instruction in the sequence having a positive stack height side effect, to a previous location in the block, before the store instruction.

Relocating instructions


Consider:
        load.i i0;
        load.i i2;
        add.i;
        store.i $i3;
	push 0;
	store.i i5;
        load.r r1;
        load.i $i3;
        arrayread.b;
        store.i i4;

Issues involved in relocating a candidate:

  • The following instructions, up to the last load of the sl pair or sll triple, must be independent of the candidate. For example the first candidate to move out of the interleaving sequence would be push 0: However store.i $i3 depends on it so this candidate is not valid (see isRequiredByFollowingUnits).
  • A valid new location must be found to relocate the bytecode. Sometimes there are no such locations. For example,
    
    	[start of block]
    	store.i $i3;
    	load.i i5;
    	load.i $i3;
    

    Here load.i i5; cannot be moved to a previous location in the block

  • Independent of the stack, some instructions with side effects cannot be moved past other instructions safely. For example:
    
    	staticinvoke xx
    	store.i $i3;
    	staticget xx
    	load.i $i3;
    	

    Here staticget cannot be moved before staticinvoke (see canMoveUnitOver).

Going back to our previous example,

        load.i i0;
        load.i i2;
        add.i;
        store.i $i3;
	push 0;
	store.i i5;
        load.r r1;
        load.i $i3;
        arrayread.b;
        store.i i4;
    

  • The second candidate to relocate in the sequence is load.r r1;.
  • We need to find a insertion slot in the block such that the sequence of instruction between this slot and the candidate to move forms a level sequence.
  • The first such slot in this example is just after store.i $i3;.
The block can thus be rearranged as follows:

        load.i i0;
        load.i i2;
        add.i;
        store.i $i3;
	load.r r1;
	push 0;
	store.i i5;
        load.i $i3;
        arrayread.b;
        store.i i4;
    

Of course this new ordering does not solve our original problem as the relocation slot was chosen in the store/load sequence. We need to find a slot occuring before the store.
The second and only valid slot occurs before the first instruction, we get:

        load.r r1;
        load.i i0;
        load.i i2;
        add.i;
        store.i $i3;
	push 0;
	store.i i5;
        load.i $i3;
        arrayread.b;
        store.i i4;
    

Now the interleaving sequence is level.so the store/load reduction can occur according to the criteria previously presented.

Final result:
        load.r r1;
        load.i i0;
        load.i i2;
        add.i;
	push 0;
	store.i i5;
        arrayread.b;
        store.i i4;
    

Moving the store next to the load

When the interleaving sequence has a mshv < 0, the previous techniques will not work as we cannot leave the value on the stack (as it would then be eaten up by the sequence).

Consider:
    add.i; staticput spec.io.FileInputStream.num_cached_files; staticinvoke currentTimeMillis; staticget spec.io.FileInputStream.Cachingtime; store.l $l8; load.l l6; sub.l; store.l $l9; load.l $l8; load.l $l9;

In this situation we can try to move the smallest level-block, if any, ending with the store, inorder to put it just before the load. In the above example the level block:

	staticget spec.io.FileInputStream.Cachingtime;
	store.l $l8;
    

can be moved just before load.l $l8;. Thus yeilding,

add.i;
staticput spec.io.FileInputStream.num_cached_files;
staticinvoke currentTimeMillis;
load.l l6;
sub.l;
store.l $l9;
staticget spec.io.FileInputStream.Cachingtime;
store.l $l8;
load.l $l8;
load.l $l9;

And now of course the the sl pair can be reduced.

add.i;
staticput spec.io.FileInputStream.num_cached_files;
staticinvoke currentTimeMillis;
load.l l6;
sub.l;
store.l $l9;
staticget spec.io.FileInputStream.Cachingtime;
load.l $l9;

More complicated sll triple optimizations


Consider:
    load.r r0; dup1.r; fieldget spec.benchmarks._201_compress.Comp_Base.free_ent; store.i $i7; load.i $i7; push 1; add.i; fieldput spec.benchmarks._201_compress.Comp_Base.free_ent; load.i $i7; virtualinvoke set;

  • In the interleaving sequence has negative minimum stack height.
  • Observe that nshv == -2 and mshv == -2.
  • We can use a dup1_x1 instruction here

Result:
    load.r r0; dup1.r; fieldget spec.benchmarks._201_compress.Comp_Base.free_ent; dup1_x1.i_r; push 1; add.i; fieldput spec.benchmarks._201_compress.Comp_Base.free_ent; virtualinvoke set;

2. The Reduction Algorithm

2.1 Design Goals

  • Genericity: strive to the catch all the possible reductions, while making the least number assumptions on the structure of the baf code (as opposed to peepholes).
  • Simplicity: break down the optimization in small step that are easy to understand, prove correct and are extendable.
  • Speed: The optimization must run in a reasonnable amount of time.

2.2 Algorithm Overview

  • The unit over which the algorithm applies is a Baf method.
  • Given a Baf method, the algorithm performs a fixed point iteration until either no reductions are performed or no reordering of the method's instructions has occured.
  • For a given method, a list of it's store instruction is computed and the target cases are identified.
  • The target cases are analysed and accordingly reductions and/or code reordering are possibly preformed for each these.

2.2.1 Identifing the Target Cases

The following relevent cases must be identified:
  1. Degenerate stores
  2. sl pairs
  3. sll triples
This is done by building a list of all the store instructions in the target method and then verifying for a given store, if it belongs to one of the above 3 categories. If the store belongs to the first category, it is simply replaced by a pop.

If we find a sl pair or a sll pair then more consideration must be given.

2.2.2 The reduction of sl pairs and sll triples

The pathological reductions cases as illustrated previously, occur when the store and loads are interleaved by a sequence of instructions.

Once a sl pair is identified, the algorithm used to reduce it is:

While reordering occured  Do
    If the interleaving sequence is a level sequence then 
        eliminate the pair.
    Else if mshv < 0 then 
        try to identify and move a level sequence ending with the store
        and insert it just before the load instruction.
    Else if mshv == -1 && nshv == -1 then
        if the stack value consumed comes from a dup then eliminate.
    Else  nshv > 0 and mshv >= 0 then 
        try to move an instruction belonging to the interleaving 
        sequence to a somewhere before the store instruction in the block.    
Done
Once a sll triple is identified, the algorithm used to reduce it is:

Consider the store and the first load as an sl pair and apply a variation of the sl pair algorithm, where the load is moved next to the store in the cases where the sl pair would be eliminated.

If the algorithm succeeds (ie. the first load now occurs immediately after the store), then

Consider the store and the second load. (Note that the first load will now be in the interleaving sequence).


While reordering occured  do
    If mshv == 0 && nshv == 0 then 
        replace the store by a dup1 instruction and remove the 2 loads.
    Else If mshv == -1 && nshv == -1 then  
        replace the store by a dup1 instruction and remove the 2 loads.
    Else If mshv == -2 && nshv == -2 then 
        replace the store by a dup1_x1 instruction and remove the 2 loads.
    Else If mshv < -2 then 
        give up
    Else 
        try to move an instruction belonging to the interleaving sequence to 
        somewhere before the store instruction in the block.
Done
note:
In the above algorithms, reordering occured means that an instruction was succesfully moved in the final else clause of of each algorithm respectively.

2.3 Code Specific Details

The algorithm is implemented by the methods optimizeStoreLoads and stackIndependent, which make use of a few helper methods notably : Here is a prosaic description of the steps in it's implementation:

OptimizeStoreLoads does the following as a fixed point iteration:

  1. A list of all the stores in the method body is computed and the method then iterates over this list.
  2. For each store the uses of this store instruction are found.
  3. If the store has no uses then the degenerate store action is taken.
  4. If the store has 1 or 2 uses then if these are loads and are in the same basic block then the sl action or the sll action is taken
  5. If an action succeded or something changed the fixed point iteration repeats.
For the sl and sll actions:
  1. The optimizeStoreLoads method implements the logic of these algorithms, while stackIndependent implements the simulation of the interleaving sequences and computes the values for mshv and nshv.
  2. More precisely optimizeLoadStore calls stackIndependent for a given store/load pair and specifies it's context.
  3. If the load/store pair is sl pair then the context specified is STORE_LOAD_ELIMINATION. Otherwise the store/load pair comes from a sll triple and the context specified is STORE_LOAD_LOAD_ELIMINATION.
  4. The method stackIndependent behaves according to the specified context, in effect implementing either the sl algorithm or the sll algorithm.
  5. For the implementation of the Else clause in the sl and sll algorithm, stackIndependent calls tryToMoveUnit.
  6. In implementating the Else If in the sl algorithm, stackIndependent calls pushStoreToLoad
  7. The outcome returned by stackIndependent to optimizeStoreLoads is one of the following:
    • FAILURE when store load elimination is not possible in any form.
    • SUCCESS when the store and load in of a store/load pair have are now next to each other (no interveaving sequence).
    • MAKE_DUP when a sll triple can be replaced by a dup1 instruction.
    • MAKE_DUP_X1 when sll triple can be replaced by a dup1_x1 instruction
    • HAS_CHANGED when store load elimination is not possible in any form, but some unit reordering has occured.
  8. optimizeLoadStore then performs the action described by the above return codes.

3. Inter Block Optimizations

  • All the optimizations introduced up to now required as a pre-condition that the store and load(s) involved be contained in the same block.
  • The transformation into Baf produces many stack variables that span across blocks.
  • The vast majority of these can be caught and reduced using 2 simple inter-block optimizations.

Consider:
     block1:	
        load.r r0
        store.r $r3;
        load.r r0;
        fieldget ;
        ifeq label2;
     block2:
        push 257;
        store.i $i8;
        goto label3;

     block3:
        push 256;
        store.i $i8;

     block4:
        load.r $r3;
        load.i $i8;
        fieldput ;

  • The stack variable $r3 is found in blocks 1 and 4.
  • The stack variable $i8 is found in blocks 2, 3 and 4.
  • This code derives from an simple if/else type construct, and as such occurs fairly often.

3.1 First Optimization

Description: Given 2 distinct blocks x and y, where x contains the store and y the load, then if the set X of all successor blocks of x is the same as the set Y of all predecessor blocks of y, allowing for y to be in Y if neccessary, then an attempt is made to relocate the store into the y block.

At present this is done only if the the sequences from the store to the block tail in x forms a level block as well as all the blocks contained in X.

If this is the case then the store is moved the head of block y.

This is the case in the previous example, where block1 corresponds to x and block4 corresponds to y. The reordering yeilds:

     block1:	
        load.r r0
        load.r r0;
        fieldget ;
        ifeq label2;
     block2:
        push 257;
        store.i $i8;
        goto label3;

     block3:
        push 256;
        store.i $i8;

     block4:
        store.r $r3;
        load.r $r3;
        load.i $i8;
        fieldput ;

3.2 Second Optimization

Description: Given a block y containing a load, if this load is defined by 2 stores in blocks v and w (v != w necessarily), and if both blocks v and w have as a unique successor block y, and y has as unique predecessors blocks v and w, then an attempt is made to unite and relocate the stores into the y block.

At present this is done only if the the sequences from the unit following both stores in blocks v and w to their respective tails are level.

If this is the case then both stores are collapsed into one store inserted at the head of block y.

This is the case in the previous example, where y corresponds to block4, v to block2 and w to block 3. The reordering yeilds:

     block1:	
        load.r r0
        load.r r0;
        fieldget ;
        ifeq label2;
     block2:
        push 257;
        goto label3;

     block3:
        push 256;

     block4:
        store.i $i8;
        store.r $r3;
        load.r $r3;
        load.i $i8;
        fieldput ;

Now that the store/load pairs are all in the same basic block they will be caught and annihilated by the standard optimizations and the final outcome will give:

     block1:	
        load.r r0
        load.r r0;
        fieldget ;
        ifeq label2;
     block2:
        push 257;
        goto label3;

     block3:
        push 256;

     block4:
        fieldput ;

4. Wrap-up

  • The optimizations presented here are all implemented by the class LoadStoreOptimizer.
  • The class is to be used as a singleton; the static method v is provided and returns a instance of the class.
  • The only public method is optimize it takes a Body as an argument and performs all the optimizations
  • The method optimize preforms a fixed point iteration calling in turn the methods optimizeStoreLoads and doInterBlockOptimizations.

Many new optimizations could be added to the ones presented here. A few have even been implementated, but still need to be tested to mesure their effectiveness on the runtime. Here are a few pointers:

  • Often a store/load pair is followed by many more loads. These following loads come from dups under javac. Thus an optimization could convert them to dups and dup2s. In some cases this can dramatically reduce the local variable storage needs.
  • Many single pops could be aggregated as pop2s.
  • Optimizations using the swap instruction could be introduced. A few cases have been identified that have stood up to the present optimizations, but could be easily reduced using a swap based optimization.
  • The current interblock optimizations should be generalized and extended.
  • propagateLoadsBackwards
    An easy way to convert multiple loads into dups is to propagate backwards as far as possible all the loads in a block. Doing this aggregates many loads in a block that were previously seperated by interleaving sequences of instructions. Once this is achieved, converting them into dup's is trivial.
  • propagateLoadsForward
    After calling propagateLoadsBackwards, (and possibly converting them to dups), the block's stack height can soar enormously. A call to propagateLoadsForward will undo the effect of propagateLoadsBackward on the stack height by propagating loads and dups as far forward as possible in the block.
  • superDuper1
    This converts a series of loads into dups
In particular, making use of the following sequence of method calls,
    propagateLoadsBackwards
    optimizeStoreLoads
    propagateLoadsForward
can be usefull to reduce certain sl and sll patterns that have resisted the current optimazations. Propagating the loads backwards can expose certain opportunities otherwise difficult to capture in a genetric way.
Patrice POMINVILLE
Last modified: Wed Aug 25 13:14:20 EDT 1999