AC68080 Optimization Notes

Post Reply
User avatar
bax
Posts: 11
Joined: Sat 19 Nov 2016 13:46

AC68080 Optimization Notes

Post by bax » Tue 31 Jan 2017 15:31

There are recurring questions about optimization for the Apollo Core. The most obvious of choices is AMMX, available with Gold2. viewtopic.php?f=10&t=25 But of course, Apollo Core has a number of additional tricks up it's sleeve. I'm going to outline a some of these in subsequent posts. In general, Apollo Core will be quite happy with the optimizations we used to apply to the 68060. In fact, Apollo Core continues the in-order superscalar execution concept of the 68060.

The rule regarding one memory access per cycle still applies. With this important aspect in mind, Apollo Core allows to run most of the 68k arithmetic operations on the second pipe. Some examples (2nd pipe code with additional indentation by 1):

Code: Select all

	move.l (a0)+,d0	;1 cycle
	move.l (a1)+,d1	;next cycle, no superscalar
	;
	move.l (a0)+,d0	;superscalar, same cycle
	 add.l  d1,d2		;(like 68060)
	;
	mulu.w d3,d4		;mul is first pipe only (!)
	 move.l (a0)+,d0	;but we can do a mem access on 2nd pipe
	;
	PMUL88.w #$CAFE,E0,E1   ;Gold2 AMMX is first pipe only
	 addq.l  #3,d0          ;but the second pipe is still there
But of course, there's more. The truly outstanding feature of Apollo Core is the instruction fusing (in "68k world", at least). A number of typical instruction combinations can be executed jointly by the Apollo Core (see also http://www.apollo-accelerators.com/wiki ... d=cpu_fuse). The first rule is that the instructions must be present consecutively in the code. At first, this is counter-intuitive with respect to superscalar optimization. But it helps quite a bit. Next example, abs(uint8)-calculation for two variables (d3,d4):

Code: Select all

	move.l  d3,d5	;F - first example of fusing capabilities
	lsr.b     #8,d5	;F   (1/2 cycle)
	 move.l d4,d6	;F   same as above, so 4 instructions,
	 lsr.b    #8,d6		;F   1 cycle
	eor.l     d5,d3	;
	 eor.l    d6,d4	;
	sub.b   d5,d3	;so, 3 cycles total for two branchless 
	 sub.b  d6,d4	;abs.b (preload shift value for .w,.l)
	;
	tst.b	d3			;tst just to ensure the flags
	bge.s	.noneg	;F - yes, this combination is rewritten 
	neg.b	d3		;F   think "movec" as another use
.noneg moveq	#0,d0	;-> 0.5 cycle for abs.b,abs.w,abs.l
	;
	abs.b    d3	;Apollo exclusive: abs.b/w/l
	min.b   d3,d1	;Apollo exclusive: d1=min(d1,d3) (.b/w/l)
	max.w  d4,d5	;Apollo exclusive: d5=max(d4,d5) (.b/w/l)
The examples above show some features of the Apollo Core, including Fusing (continued below), Conditional Rewrite (=single cycle instruction following a bcc) and new instructions.

The type of fusing demonstrated in the previous example applies to AND/OR/SUB/ADD as the operation following a move. Unsigned extension of memory data to 32 bit is quite cheap on Apollo (see below for notes towards signed extension).

Code: Select all

	moveq  #0,D0	;F unsigned extension in 1/2 cycle
	move.b (a0),D0	;F
	 move.l D1,D2	;F who needs 3-op-logic after all?
	 add.l  D3,D2	;F (except for code density...)
Please note that the fusing examples use a .l construct in the first instruction. The Apollo Core can read two registers per cycle per ALU. To keep code execution fast, it is not advisable to use partial move operations (also known from 68060).

Some limitations to the second pipe apply:

Code: Select all

- no PC-relative addressing
- max. 8 Byte instruction length
- only instructions executing in one cycle (no mul/div)
- no instructions referencing the flags (no BCC,SCC)
  (exception: dbf works in 2nd pipe)
Also, the second pipe can update address registers (An)+ or -(An), but only if the first pipe doesn't update an address register in the same cycle:

Code: Select all

	addq.l #1,a0	;two cycles
	move.l (a2)+,D0	;
	;
	move.l (a2)+,D0	;one cycle
	 addq.l #1,a0	;
Now for some remarks regarding indexed addressing. We know from the 68060 that the fast-forward from one instruction to another doesn't apply for indexed addressing. Stuff like

Code: Select all

	addq.l #1,d4		;
	move.b (a0,d4.l*2),d5	;don't do this _EVER_
leads to a slow bubble (2 cycle). The EA calculation is done early in the pipe and needs some distance between the index register calculation and the actual indexed move. Apollo can help in this regard if there is no easy way to put these calculations/moves apart. Since Gold2, the Apollo Core updates An in the EA unit, when the source is a register.

Code: Select all

	addq.l #1,A1		;A1 update in EA unit
	move.l (A0,A1),D5	;no bubble
	add.l  D1,A1		;A1 update in EA unit
	move.l (A0,A1),A2	;still executes fast 
	                	;(A2 updated in ALU, don't use immediately)
Here is a quick list of commonly used fused code constructs:

Code: Select all

	move.l (Am)+,(An)+	;F to move.q (Am)+,(An)+
	move.l (Am)+,(An)+	;F
	;
	move.b (d16,An),Dn	;F signed extension in 1/2 cycle
	extb.l Dn		;F
	;
	move.w (d16,An),Dn	;F 
	extb.l Dn		;F
	;
	move.l Dm,Dn		;F
	not.X  Dn		;F (X=b/w/l)
	;
	move.l Dm,Dn		;F
	neg.X  Dn		;F (X=b/w/l)
	;
	move.l Dm,Dn		;F (same for subq)
	addq.X #Val,Dn		;F (X=b/w/l)
	;
	move.l Dm,Dn		;F (same for sub)
	add.X  Do,Dn		;F (X=b/w/l)
	;
	move.l Dm,Dn		;F
	andi.w #num,Dn		;F
	;
	move.l Dm,Dn		;F same applies for or.X
	and.X  Do,Dn		;F
	;
	moveq  #val,Dn		;F same applies for or.X
	and.X  Dm,Dn		;F
Another feature is Bonding. Some explanations and examples are given at a later time.

Some other notes towards performance tuning:
- Apollo Core executes many Bitfield operations in a single cycle. This is very useful for VLC decoding (audio/video).
- Code alignment is not necessary. The Apollo ICache can fetch 16 bytes in a row from any even address.
Last edited by bax on Thu 2 Feb 2017 14:56, edited 2 times in total.
A500 Rev8a1, V500 V2, ECS, 1.5 MB Chip, 32 GB CF
A500 Rev5, 1 MB Chip, 2 MB Fast, A590 + SCSI2SD
A4000, CS060MK-I 50 MHz, 112 MB Fast, AriadneII, Scandoubler, MelodyZ2, CV64
A500 Rev8b, Project Red a1k edition

User avatar
guibrush
Posts: 90
Joined: Sat 19 Nov 2016 14:12

Re: AC68080 Optimization Notes

Post by guibrush » Tue 31 Jan 2017 17:03

Thank you very much for your hard work bax !
A600 V1.5, V600 V2, A604n, RTC, Indi ECS, 32Gb CF, PCMCIA Ethernet, HD floppy, TOM, HP Pavilion 23xi monitor
A500+V8A, V500+, Indi ECS, 32Gb CF, SD-net, SD floppy, Kipper 2Mb chip expansion, TOM, Eizo EV2450

anchor
Posts: 1
Joined: Sat 19 Nov 2016 15:19

Re: AC68080 Optimization Notes

Post by anchor » Sat 11 Mar 2017 06:32

thanks bax,

since i can't found doc about 68080, your posts are essential.
i have no information about the execution times on 68080, so i asking:
what is the best approach if i want to fill a screen with a given color?
i loading the color values to registers, and ...
1 - use movem.l register-list, <ea> + dbf
(more than 256 bits with one instruction)
2 - or use new STORE command + dbf
(64 bits with one instruction)
3 - other way

SamuraiCrow
Posts: 7
Joined: Sat 3 Dec 2016 15:57

Re: AC68080 Optimization Notes

Post by SamuraiCrow » Sun 12 Mar 2017 11:04

anchor wrote:thanks bax,

since i can't found doc about 68080, your posts are essential.
i have no information about the execution times on 68080, so i asking:
what is the best approach if i want to fill a screen with a given color?
i loading the color values to registers, and ...
1 - use movem.l register-list, <ea> + dbf
(more than 256 bits with one instruction)
2 - or use new STORE command + dbf
(64 bits with one instruction)
3 - other way
Use method 2. The DBF should run in the second pipe thus executing a 64 bit iteration per clock.

User avatar
bax
Posts: 11
Joined: Sat 19 Nov 2016 13:46

Re: AC68080 Optimization Notes

Post by bax » Mon 13 Mar 2017 07:47

anchor wrote:thanks bax,

since i can't found doc about 68080, your posts are essential.
i have no information about the execution times on 68080, so i asking:
what is the best approach if i want to fill a screen with a given color?
i loading the color values to registers, and ...
1 - use movem.l register-list, <ea> + dbf
(more than 256 bits with one instruction)
2 - or use new STORE command + dbf
(64 bits with one instruction)
3 - other way
Sam is right. MOVEM stores 32 Bit per cycle. You can verify the runtime behavior of your code with flype's CPUMon. It'll tell you when something stalls.
A500 Rev8a1, V500 V2, ECS, 1.5 MB Chip, 32 GB CF
A500 Rev5, 1 MB Chip, 2 MB Fast, A590 + SCSI2SD
A4000, CS060MK-I 50 MHz, 112 MB Fast, AriadneII, Scandoubler, MelodyZ2, CV64
A500 Rev8b, Project Red a1k edition

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest