The rule regarding one memory access per cycle still applies. With this important aspect in mind, Apollo Core allows to run most of the 68k arithmetic operations on the second pipe. Some examples (2nd pipe code with additional indentation by 1):

Code: Select all

```
move.l (a0)+,d0 ;1 cycle
move.l (a1)+,d1 ;next cycle, no superscalar
;
move.l (a0)+,d0 ;superscalar, same cycle
add.l d1,d2 ;(like 68060)
;
mulu.w d3,d4 ;mul is first pipe only (!)
move.l (a0)+,d0 ;but we can do a mem access on 2nd pipe
;
PMUL88.w #$CAFE,E0,E1 ;Gold2 AMMX is first pipe only
addq.l #3,d0 ;but the second pipe is still there
```

Code: Select all

```
move.l d3,d5 ;F - first example of fusing capabilities
lsr.b #8,d5 ;F (1/2 cycle)
move.l d4,d6 ;F same as above, so 4 instructions,
lsr.b #8,d6 ;F 1 cycle
eor.l d5,d3 ;
eor.l d6,d4 ;
sub.b d5,d3 ;so, 3 cycles total for two branchless
sub.b d6,d4 ;abs.b (preload shift value for .w,.l)
;
tst.b d3 ;tst just to ensure the flags
bge.s .noneg ;F - yes, this combination is rewritten
neg.b d3 ;F think "movec" as another use
.noneg moveq #0,d0 ;-> 0.5 cycle for abs.b,abs.w,abs.l
;
abs.b d3 ;Apollo exclusive: abs.b/w/l
min.b d3,d1 ;Apollo exclusive: d1=min(d1,d3) (.b/w/l)
max.w d4,d5 ;Apollo exclusive: d5=max(d4,d5) (.b/w/l)
```

The type of fusing demonstrated in the previous example applies to AND/OR/SUB/ADD as the operation following a move. Unsigned extension of memory data to 32 bit is quite cheap on Apollo (see below for notes towards signed extension).

Code: Select all

```
moveq #0,D0 ;F unsigned extension in 1/2 cycle
move.b (a0),D0 ;F
move.l D1,D2 ;F who needs 3-op-logic after all?
add.l D3,D2 ;F (except for code density...)
```

Some limitations to the second pipe apply:

Code: Select all

```
- no PC-relative addressing
- max. 8 Byte instruction length
- only instructions executing in one cycle (no mul/div)
- no instructions referencing the flags (no BCC,SCC)
(exception: dbf works in 2nd pipe)
```

Code: Select all

```
addq.l #1,a0 ;two cycles
move.l (a2)+,D0 ;
;
move.l (a2)+,D0 ;one cycle
addq.l #1,a0 ;
```

Code: Select all

```
addq.l #1,d4 ;
move.b (a0,d4.l*2),d5 ;don't do this _EVER_
```

Code: Select all

```
addq.l #1,A1 ;A1 update in EA unit
move.l (A0,A1),D5 ;no bubble
add.l D1,A1 ;A1 update in EA unit
move.l (A0,A1),A2 ;still executes fast
;(A2 updated in ALU, don't use immediately)
```

Code: Select all

```
move.l (Am)+,(An)+ ;F to move.q (Am)+,(An)+
move.l (Am)+,(An)+ ;F
;
move.b (d16,An),Dn ;F signed extension in 1/2 cycle
extb.l Dn ;F
;
move.w (d16,An),Dn ;F
extb.l Dn ;F
;
move.l Dm,Dn ;F
not.X Dn ;F (X=b/w/l)
;
move.l Dm,Dn ;F
neg.X Dn ;F (X=b/w/l)
;
move.l Dm,Dn ;F (same for subq)
addq.X #Val,Dn ;F (X=b/w/l)
;
move.l Dm,Dn ;F (same for sub)
add.X Do,Dn ;F (X=b/w/l)
;
move.l Dm,Dn ;F
andi.w #num,Dn ;F
;
move.l Dm,Dn ;F same applies for or.X
and.X Do,Dn ;F
;
moveq #val,Dn ;F same applies for or.X
and.X Dm,Dn ;F
```

Some other notes towards performance tuning:

- Apollo Core executes many Bitfield operations in a single cycle. This is very useful for VLC decoding (audio/video).

- Code alignment is not necessary. The Apollo ICache can fetch 16 bytes in a row from any even address.