There are recurring questions about optimization for the Apollo Core. The most obvious of choices is AMMX, available with Gold2.
viewtopic.php?f=10&t=25 But of course, Apollo Core has a number of additional tricks up it's sleeve. I'm going to outline a some of these in subsequent posts. In general, Apollo Core will be quite happy with the optimizations we used to apply to the 68060. In fact, Apollo Core continues the in-order superscalar execution concept of the 68060.
The rule regarding one memory access per cycle still applies. With this important aspect in mind, Apollo Core allows to run most of the 68k arithmetic operations on the second pipe. Some examples (2nd pipe code with additional indentation by 1):
move.l (a0)+,d0 ;1 cycle
move.l (a1)+,d1 ;next cycle, no superscalar
;
move.l (a0)+,d0 ;superscalar, same cycle
add.l d1,d2 ;(like 68060)
;
mulu.w d3,d4 ;mul is first pipe only (!)
move.l (a0)+,d0 ;but we can do a mem access on 2nd pipe
;
PMUL88.w #$CAFE,E0,E1 ;Gold2 AMMX is first pipe only
addq.l #3,d0 ;but the second pipe is still there
|
But of course, there's more. The truly outstanding feature of Apollo Core is the instruction fusing (in "68k world", at least). A number of typical instruction combinations can be executed jointly by the Apollo Core (see also
http://www.apollo-accelerators.com/wiki ... d=cpu_fuse). The first rule is that the instructions must be present consecutively in the code. At first, this is counter-intuitive with respect to superscalar optimization. But it helps quite a bit. Next example, abs(uint8)-calculation for two variables (d3,d4):
move.l d3,d5 ;F - first example of fusing capabilities
lsr.b #8,d5 ;F (1/2 cycle)
move.l d4,d6 ;F same as above, so 4 instructions,
lsr.b #8,d6 ;F 1 cycle
eor.l d5,d3 ;
eor.l d6,d4 ;
sub.b d5,d3 ;so, 3 cycles total for two branchless
sub.b d6,d4 ;abs.b (preload shift value for .w,.l)
;
tst.b d3 ;tst just to ensure the flags
bge.s .noneg ;F - yes, this combination is rewritten
neg.b d3 ;F think "movec" as another use
.noneg moveq #0,d0 ;-> 0.5 cycle for abs.b,abs.w,abs.l
;
abs.b d3 ;Apollo exclusive: abs.b/w/l
min.b d3,d1 ;Apollo exclusive: d1=min(d1,d3) (.b/w/l)
max.w d4,d5 ;Apollo exclusive: d5=max(d4,d5) (.b/w/l)
|
The examples above show some features of the Apollo Core, including Fusing (continued below), Conditional Rewrite (=single cycle instruction following a bcc) and new instructions.
The type of fusing demonstrated in the previous example applies to AND/OR/SUB/ADD as the operation following a move. Unsigned extension of memory data to 32 bit is quite cheap on Apollo (see below for notes towards signed extension).
moveq #0,D0 ;F unsigned extension in 1/2 cycle
move.b (a0),D0 ;F
move.l D1,D2 ;F who needs 3-op-logic after all?
add.l D3,D2 ;F (except for code density...)
|
Please note that the fusing examples use a .l construct in the first instruction. The Apollo Core can read two registers per cycle per ALU. To keep code execution fast, it is not advisable to use partial move operations (also known from 68060).
Some limitations to the second pipe apply:
- no PC-relative addressing
- max. 8 Byte instruction length
- only instructions executing in one cycle (no mul/div)
- no instructions referencing the flags (no BCC,SCC)
(exception: dbf works in 2nd pipe)
|
Also, the second pipe can update address registers (An)+ or -(An), but only if the first pipe doesn't update an address register in the same cycle:
addq.l #1,a0 ;two cycles
move.l (a2)+,D0 ;
;
move.l (a2)+,D0 ;one cycle
addq.l #1,a0 ;
|
Now for some remarks regarding indexed addressing. We know from the 68060 that the fast-forward from one instruction to another doesn't apply for indexed addressing. Stuff like
addq.l #1,d4 ;
move.b (a0,d4.l*2),d5 ;don't do this _EVER_
|
leads to a slow bubble (2 cycle). The EA calculation is done early in the pipe and needs some distance between the index register calculation and the actual indexed move. Apollo can help in this regard if there is no easy way to put these calculations/moves apart. Since Gold2, the Apollo Core updates An in the EA unit, when the source is a register.
addq.l #1,A1 ;A1 update in EA unit
move.l (A0,A1),D5 ;no bubble
add.l D1,A1 ;A1 update in EA unit
move.l (A0,A1),A2 ;still executes fast
;(A2 updated in ALU, don't use immediately)
|
Here is a quick list of commonly used fused code constructs:
move.l (Am)+,(An)+ ;F to move.q (Am)+,(An)+
move.l (Am)+,(An)+ ;F
;
move.b (d16,An),Dn ;F signed extension in 1/2 cycle
extb.l Dn ;F
;
move.w (d16,An),Dn ;F
extb.l Dn ;F
;
move.l Dm,Dn ;F
not.X Dn ;F (X=b/w/l)
;
move.l Dm,Dn ;F
neg.X Dn ;F (X=b/w/l)
;
move.l Dm,Dn ;F (same for subq)
addq.X #Val,Dn ;F (X=b/w/l)
;
move.l Dm,Dn ;F (same for sub)
add.X Do,Dn ;F (X=b/w/l)
;
move.l Dm,Dn ;F
andi.w #num,Dn ;F
;
move.l Dm,Dn ;F same applies for or.X
and.X Do,Dn ;F
;
moveq #val,Dn ;F same applies for or.X
and.X Dm,Dn ;F
|
Another feature is Bonding. Some explanations and examples are given at a later time.
Some other notes towards performance tuning:
- Apollo Core executes many Bitfield operations in a single cycle. This is very useful for VLC decoding (audio/video).
- Code alignment is not necessary. The Apollo ICache can fetch 16 bytes in a row from any even address.