Apollo Accelerators

Apollo Accelerators Discussion Forum
» » »
[Post Reply] [*]  Page 1 of 1  [ 2 posts ]
Author Message
bax
Post subject: AMMX Introduction Posted: Tue 6 Dec 2016 12:41
Offline
User avatar
Posts: 16
Joined: Sat 19 Nov 2016 13:46
Depending on available time, I've got the intent on outlining in this thread what the SIMD extensions to the Apollo Core offer and how the functionality might apply to the one or other coding problem. Please be aware that the extensions are a work in progress and might change without notice before an official release of a finished core (working title: Gold2).

AMMX, as Gunnar named it, is a 64 Bit SIMD extension. Apart from the fact that it shares the 64 Bit width with the MMX of a well known company, the concept we followed is more geared towards the SIMD extensions in RISC architectures (AltiVEC, Wireless MMX). In the current state of development, 32 registers are available for SIMD usage. These 32 registers include the well-known D0-D7 (extended to 64 Bit) and 24 new registers which are SIMD exclusive. This way, a lot of work can be done in registers, reducing the strain on memory reads and writes considerably.

Most instructions follow a 3 operand logic D=A op B, where the results of the operation between A and B is stored in any C of the registers. It must be noted at this point that the input operand A doesn't have to be a register. Any effective address in 68k notatation is allowed, including immediates. Allow me to show some examples at this point.
   PADDW    D0,D1,D2	          ; 4x16 Bit addition D0+D1=D2
   PADDW    (A0),D1,D2	          ; same, from memory (unaligned)
   PADDW    #$8100810081008100,D1,D2 ; add 4x16 Bit constant
   PADDW.W  #$8100,D1,D2          ; same as above, with implicit splat
The latter two code lines above demonstrate a convenient feature in AMMX. You can specify immediates also in SIMD code, something you don't find easily somewhere else. The constants can be given in full 64 Bit. While this may be useful for some applications, the 64 Bit immediates result in instruction words of 12 Bytes. As an alternative, we added a second way of specifiying constants. The .w Syntax in the last of the example mnemonics triggers the implicit distribution of the immediate data word to all four 16 Bit slots. This way, the latter two instructions are identical in their arithmetic operation. The difference with implicit splat is a reduction of the instruction word to 6 Bytes.

These two concepts of 3 operand logic and immediates can help to save a number of move instructions that were common to 68k code.

In terms of data movement, two basic operations are supported: LOAD and STORE. While input data for the operations can be gathered by the <VEA> for one of the operands in the arithmetic operations, the destination is a register in the majority of instructions. Therefore, movement to memory needs to be done by STORE. Example:
   LOAD    (A0)+,D1	;D1=64 bit from any memory location, A0=A0+8
   PAVGB   (A1)+,D1,D1  ;8x unsigned byte average (a+b+1)>>1
   STORE   D1,(A2)+     ;write result
A special case of STORE is also provided, one that can selectively write the individual bytes. The STOREM Rn,Rm,<ea> will only write bytes of which the corresponding mask bit is set (both in MSB to LSB notation).
   moveq   #4,d3           ;yes yes, this will stall in the following <VEA> calculation
   LOAD    4(A0,D3.l*4),D1 ;D1=64 bit from any memory location
   moveq   #%01010101,D2   ;D2.b=bit mask which bytes (bit=1) are to be written
   STOREM  D1,D2,(A2)+     ;write every second byte from D1 
The third special STORE variant is targeted at 8 Bit pixel data. Typical operations in image/video processing result in intermediate results exceeding the 8 Bit range, which implies clipping before going back to 8 Bit. The Apollo features its own interpretation of PACKUSWB for this purpose. Clipping is done to (0,255). Example:
   LOAD    (A0)+,D1     ;4 signed words: a0.w a1.w a2.w a3.w
   LOAD    (A1)+,D2     ;4 signed words: b0.w b1.w b2.w b3.w
   PACKUSWB D1,D2,(A2)+ ;8 unsigned bytes: a0 a1 a2 a3 b0 b1 b2 b3 
   ; operation: vn.b = ( vn.w < 0 ) ? 0 : ( ( vn.w > 255 ) ? 255 : vn.w ); // n=0...7
One catch with SIMD is that you can not always guarantee that you are able to layout your data as needed by the arithmetics. That's why coders have been fond of the permute instruction, introduced with Morotola's PPC7400 (aka G4) series. The Apollo core offers one, too. Two input registers Ra and Rb can be permuted by a given permutation constant into the destionation Rd. Example:
 ;byte permutation key semantics for Rm,Rn 
 ; Rm m0 m1 m2 m3 m4 m5 m6 m7 = 0 1 2 3 4 5 6 7
 ; Rn n0 n1 n2 n3 n4 n5 n6 n7 = 8 9 a b c d e f
 ;
 ; ex1: word interleaving
 LOAD    (A0)+,D1     ;4 signed words: m0.w m1.w m2.w m3.w
 LOAD    (A1)+,D2     ;4 signed words: n0.w n1.w n2.w n3.w
 VPERM   #$018923ab,D1,D2,D3 ;D3: m0.w n0.w m1.w n1.w
 ; ex2: unsigned byte to words
 LOAD    (A0),D4      ;8 unsigned bytes m0 m1 m2 m3 m4 m5 m6 m7
 moveq   #0,d5        ;0.l
 VPERM   #$F0F1F2F3,D4,D5,D6 ; first  four bytes as words m0.w m1.w m2.w m3.w
 VPERM   #$F4F5F6F7,D4,D5,D6 ; second four bytes as words m4.w m5.w m6.w m7.w
Let's come to arithmetics. Bit-wise operations are:
   PAND  <VEA>,Rb,Rd
   POR   <VEA>,Rb,Rd
   PEOR  <VEA>,Rb,Rd
   PANDN <VEA>,Rb,Rd
Addition/Subtraction can be done on 8 Bit or 16 Bit.
   PADDB <VEA>,Rb,Rd    ;Rd = Rb + <VEA>
   PADDW <VEA>,Rb,Rd    ;
   PSUBB <VEA>,Rb,Rd    ;Rd = Rb - <VEA>
   PSUBW <VEA>,Rb,Rd    ;
One special case of add/sub is the BFLYW. A common recurrence in signal transforms (FFT,DCT,DWT) is the butterfly, an operation where the result of an addition and subtraction of two operands is required. In order to augment such transforms, the AMMX offers BFLYW <VEA>,Rb,Rd:Rd+1. Please note that the destination register is actually a consecutive pair (with an even index for the first one).
   BFLYW D0,D1,D2:D3 ; D2 = D1 + D0 , D3 = D1 - D0  (4 words each)
As a side note, we replaced 28 add+sub combinations by butterflies in an 8x8 iDCT, roughly 15% of the total instructions in that function block..

Multiplies are currently offered by the PMUL88 <VEA>,Rb,Rd instruction. It multiplies four words with the given operand and shifts down by 8 Bits after the multiply (Rd = (Rb*<VEA>)>>8 ). Example:
   PMUL88.W #16,D0,D1   ; D1 = (D0*16)>>8 = D0/16
   PMUL88.W #1024,D0,D1 ; D1 = (D1*1024)>>8 = D0*4
   PMUL88.W D2,D3,D4    ;
The multiply is implemented with full throughput. The implicit downshift (>>8) can serve as short range shift replacement with the respective multipliers.

Now, a second special operation pair is TRANS. It comes in two flavors, TRANSHi and TRANSLo. Normally, a matrix transpose is quite time consuming as you can only shuffle two operands with other ISA's (just counted 18 instructions for an 8x4 block with 16 bits per element using Intel SSE in an old routine of mine). This normal overhead may well have a significant impact on SIMD execution speed when it comes to matrix operations. Apollo's TRANS operations allow to transpose a 4x4 block with 16 bit per element from row to column order and vice versa. Example:
   LOAD (A0)+,E0  ; A0 B0 C0 D0 (A-D = 16 bit words)
   LOAD (A0)+,E1  ; A1 B1 C1 D1 (A-D = 16 bit words)
   LOAD (A0)+,E2  ; A2 B2 C2 D2 (A-D = 16 bit words)
   LOAD (A0)+,E3  ; A3 B3 C3 D3 (A-D = 16 bit words)
   ;Now transpose the first two words of E0-E3 into the output registers
   TRANSHi E0-E3,E4:E5 ; E4: A0 A1 A2 A3
                                     ; E5: B0 B1 B2 B3
   ;Transpose the lower two words of E0-E3 into the chosen output registers
   TRANSLo E0-E3,E6:E7 ; E6: C0 C1 C2 C3
                                     ; E7: D0 D1 D2 D3
   ; Done. Calculate or store...
As one can clearly see from the operand list, TRANS is a beast. It takes 32 Bytes as input and provides 16 Bytes output, with a throughput of 1. This requires some compromises. Technically spoken, an Apollo Feature called "late write" is used here. This induces latency. Place instructions for two cycles between a TRANS and the instruction referencing it's result to avoid bubbles. The second compromise concerns input and output registers. The inputs are restricted to a consecutive block of registers, starting with an index dividable by 4, i.e. D0-D3,D4-D7,E0-E3,...,E20-E23. A similar restriction applies to the outputs. Here, the register index must be a multiple of two. TRANS does not accept memory locations or immediates.

AMMX Instruction words and effective address format

The AMMX instruction words are 32 bit in length. The first word is organized as follows:
-------------------------------------------------------------
| Bit     | 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 |
| Content |  1  1  1  1  1  1  1  A  B  D <------VEA------> |
-------------------------------------------------------------
The second word contains the register indices and instruction numbering. Currently, 5 bits are in use for the op itself. The remaining 3 Bits are reserved for future opcodes.
-------------------------------------------------------------
| Bit     | 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 |
| Content | <- REG-B -> <- REG-D ->  0  0  0 <---- Op ----> |
-------------------------------------------------------------
AMMX generally provides the usual m68k addressing modes for one of the operands. In case of operations with clear or typical target into memory, <VEA> applies to the destination. Otherwise, one of the inputs is of the type <VEA>. Exceptions to this scheme are VPERM and TRANS, where memory operands are not allowed.

Before the <VEA> in itself is explained, some other notes towards the Apollo Core. There have been requests to provide additional address registers. As a consequence, Apollo Core offers the additional registers Bn = B0...B7. A number of scalar instructions has been implemented to support the new Bn. In terms of instruction format, these registers were carried over to the AMMX instructions. The distinction between classic addressing modes and the new register set(s) is done by Bits 8,7,6 (=A,B,D) of the instruction word.

The most prominent use of the Bits A,B,D in AMMX is the bank selection for E8-E23. When one of the selector bits is set, E8-E23 are selected instead of D0-D7,E0-E7 for the Operands <REG-B> and <REG-D>. D0-D7 correspond to the bit combinations 0000...0111 and E0-E7 to 1000...1111 in <REG-B/D>. With bank selector bit on, the registers are E8-E23 with the consecutive bit combinations 0000...1111.


The following table lists the valid <VEA> addressing modes for AMMX. Please note that the 68020+ memory indirect modes are not among the valid choices for AMMX. Also, the register direct mode only refers to 64 Bit registers, replacing An/Bn source operands by En.

Another difference between <EA> and <VEA> was chosen in terms of immediates. The default immediate encoding specifies the full 64 bit with four extension words. With A=1, the short variant with implicit splat is selected.

The other extension words to <VEA> modes are unchanged in comparison to <EA>. Please refer to the 68020+ manual for the common encoding.
 +---------+-------------------------------------------------+
 | MOD REG | Effective Adressing Mode in dependency of A-Bit |
 +---------+-------------------------------------------------+
 |         |        A=0              |         A=1           |
 +---------+-------------------------+-----------------------+
 | 000 --- | Dn                      |     E8...E15          |
 | 001 --- | E0-E7                   |     E16...E23         |
 | 010 --- | (An)                    |     (Bn)              |
 | 011 --- | (An)+                   |     (Bn)+             |
 | 100 --- | -(An)                   |     -(Bn)             |
 | 101 --- | (d16,An)                |     (d16,Bn)          |
 +---------+-------------------------------------------------+
 | 110 --- | (d8,An,Xn.SIZE*SCALE)   | (d8,Bn,Xn.SIZE*SCALE) |
 | 110 --- | (bd,An,Xn.SIZE*SCALE)   | (bd,Bn,Xn.SIZE*SCALE) |
 +---------+-------------------------------------------------+
 | 111 010 |                      (d16,PC)                   |
 | 111 011 |                (d8,PC,Xn.SIZE*SCALE)            |
 | 111 011 |                (bd,PC,Xn.SIZE*SCALE)            |
 | 111 000 |                      (xxxx).W                   |
 | 111 001 |                    (xxxxxxxx).L                 |
 | 111 100 | #<xxxxxxxxxxxxxxxx>.q   | #<xxxx>.w             |
 +---------+-------------------------------------------------+
--- More to come later ---


Last edited by bax on Fri 17 Feb 2017 22:13, edited 4 times in total.

_________________

A500 Rev8a1, V500 V2, ECS, 1.5 MB Chip, 32 GB CF
A500 Rev5, 1 MB Chip, 2 MB Fast, A590 + SCSI2SD
A4000, CS060MK-I 50 MHz, 112 MB Fast, AriadneII, Scandoubler, MelodyZ2, CV64
A500 Rev8b, Project Red a1k edition


Top
[Profile] [Quote]
SamuraiCrow
Post subject: Re: AMMX Introduction Posted: Tue 6 Dec 2016 14:16
Offline
Posts: 12
Joined: Sat 3 Dec 2016 15:57
Thanks for the post, Bax! I've needed this information for a long time!


Top
[Profile] [Quote]
Display: Sort by: Direction:
[Post Reply]  Page 1 of 1  [ 2 posts ] Return to “Coding and benchmarks”

Jump to: 

Who is online

Users browsing this forum: No registered users and 1 guest


» The team | Delete all board cookies | All times are UTC+02:00

cron