pack
InstructionThe pack instruction from Zbkb is the best RISC-V
instruction that nobody implements. On RV32, Zbkb contains the following
bit manipulation instructions which are not present in Zbb:
zip: mostly useless 1unzip: for when you executed a zip and you
didn’t mean tobrev8: likely useful for something, people keep adding
these to ISAs 2pack: pure magicpackh: also pretty greatExpressed in Verilog, the operation of pack is:
rd <= {rs2[15:0], rs1[15:0]};Expressed in English, pack is the concatenation of the
low halves of its two input registers. Its little brother
packh is the concatenation of the two least-significant
bytes into a zero-extended halfword.
Zbb dropped the pack and packh instructions
late in the ratification process because they
were cut based on a SPECint benchmark with early compiler support
and inertia was then in favour of leaving them out. A subset of
pack was retained in Zbb as the zext.h
pseudo-instruction (rs2 is zero)3,
but not the remaining encodings, and nothing for packh (as
zext.b is a pseudo-instruction for andi).
The rationale for the full instructions’ retention in Zbkb was likely to speed up loads from unaligned halfword and word fields on processors which lack native support for unaligned reads, since packed fields inside of octet streams are common in cryptographic workloads.
This post looks at three surprising uses of pack which
have nothing to do with word or halfword data. Maybe it will become your
favourite instruction too.
This excerpt from the RP2350
bootrom unpacks the immediate operand from an Armv8-M Base
BL (T1) or B.W (T4) instruction:
vexecute32_bw:
// r_inst[12:0] = 1 0 S imm10
addi r_tmp2, r_inst, -1024 // r_tmp2[12:0] = S !S !S imm10
slli r_tmp0, r_work2, 5 // concatenate imm11 to end (plus 5 incidental zeroes)
pack r_tmp2, r_tmp0, r_tmp2 // (it's called pack but I use it to unpack things???)
slli r_tmp2, r_tmp2, 3 // Sign-extend and scale by 2 overall (sll 5 + 3 - 7)
srai r_tmp2, r_tmp2, 7 // {{8{S}}, !S, !S, imm10, imm11, 1'b0}
bexti r_tmp0, r_work2, 13 // J1
bexti r_tmp1, r_work2, 11 // J2
sh1add r_tmp0, r_tmp0, r_tmp1 // {J1, J2}
slli r_tmp0, r_tmp0, 22
xor r_tmp0, r_tmp0, r_tmp2 // Mix the pasta and the sauce
add r_pc, r_pc, r_tmp0
next_instructionThe encoding of the BL instruction was designed using
the classic party game pin the tail on the donkey.4 It seems to break the bitfield
renderer on Arm’s
online docs so here is my bitfield diagram:
First halfword (r_inst):
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
+---+---+---+---+---+---+---------------------------------------+
| 1 | 1 | 1 | 1 | 0 | S | imm10[9:0] |
+---+---+---+---+---+---+---------------------------------------+
\________________/
op = 11110
Second halfword (r_work2):
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
+---+---+---+---+---+-------------------------------------------+
| 1 | 1 |J1 | 1 |J2 | imm11[10:0] |
+---+---+---+---+---+-------------------------------------------+
The offset is computed as:
I1 = !(J1 ^ S)
I2 = !(J2 ^ S)
imm32 = SignExtend( S : I1 : I2 : imm10 : imm11 : '0' , 32 )
bit: 24 23 22 21:12 11:1 0
Here pack is used for concatenation of unpacked bits.
There are very few instructions in RISC-V that can take bits from the
same locations in two different source registers and combine them
without overlap: the other ones I’m aware of are the sh*add
family, which also make an appearance in this excerpt. It looks like the
P extension might add more.
pack is profitable because it replaces the usual
s*li + or combinations, as well as masking the
bits above imm11 in r_work2 that would
otherwise contaminate the result. These bits are masked because they end
up above the concatenation boundary. One other detail that makes it
particularly applicable here is that the result is about to be shifted
anyway so the fact that the concatenation is at a 16-bit
boundary doesn’t really matter: the first comment describes the
pre-shift as “5 incidental zeroes” because the 5 is just subtracted from
the shift distance of the subsequent shift-by-8. This may seem
situational but it’s actually quite likely when reconstituting bitfields
that the next thing you do after combining some bits is to shift
them.
32 × 32 → 64-bit multiplication
using only the 32 × 32 → 32-bit
mul instruction is useful because high-half multiplies are
significantly slower than the low mul on some RISC-V
implementations. The instructions are mandatory if you have M or Zmmul,
but useful levels of performance are not.5
Computing the widening multiply requires a carry into bits
0 and 16 of the high word. You can accomplish
this by adding the low carry, then shifting the high carry by 16, then
adding the high carry. You could also just pack the two
carries and add them in one go:
// Values to be multiplied are in a0 and a1.
srli a2, a0, 16
srli a3, a1, 16
zext.h a0, a0
zext.h a1, a1
mul a4, a2, a3 // A_h * B_h
mul a5, a2, a1 // A_h * B_l
mul a2, a0, a3 // A_l * B_h
mul a0, a0, a1 // A_l * B_l
// Sum the crossed terms, save carry-in for hi[16]
add a5, a2, a5
sltu t0, a5, a2
// Add crossed terms to low word, save carry-in for hi[0]
slli a2, a5, 16
add a0, a0, a2
sltu a3, a0, a2
// Pack carries with correct significance
pack a3, a3, t0
// hi: A_h * B_h + (crossed terms >> 16) + ci0 + (ci1 << 16)
srli a5, a5, 16
add a1, a4, a5
add a1, a1, a3
// Product is in {a1, a0}The second argument to memset is a byte value that must
be replicated up to register width in order to use the widest possible
store instruction. This can be accomplished like
this:
packh a1, a1, a1
pack a1, a1, a1This can also be accomplished in one operation with
xperm8 rd, rs1, zero but the Zbkx instructions have higher
implementation complexity than pack/packh.
The best you can do without Zbkb or Zbkx is three instructions:
li t0, 0x01010101 // expands to lui + addi
mul a1, a1, t0This also needs to assume that a1 is zero
outside of the lower eight bits. The packh +
pack version and the xperm8 version are
independent of bits 31:8.
memset is one of the most-frequently-called C library
functions, often with quite a small length parameter for initialising
short arrays or structs on the stack. These savings on the O(1) part of
memset add up.
On RV64 a similar trick applies to generating repeating bit patterns
useful for SIMD-within-a-register tricks. This tends not to be useful on
RV32, since lui + addi generates any 32-bit
value, but there are sometimes minor code size savings.
I know, I promised three uses for pack at the start, but
it’s just that good. Also this one is packh, so it doesn’t
count. This is from a single-precision
floating-point add routine. a2 contains the exponent,
a6 contains the sign bit (smeared across the entire
register) and a4 contains the significand, with an implicit
one in bit 31 that needs to be cleared.
// Pack it and ship it
packh a2, a2, a6
slli a2, a2, 23
slli a4, a4, 1
srli a4, a4, 9
add a0, a4, a2
retThe result is the concatenation
sign : exponent : significand where the fields are 1, 8 and
23 bits in size respectively.
I think it’s a shame that pack and packh
didn’t make it into the standard B extension (made up of Zba, Zbb and
Zbs), and consequently not into RVA23. When it comes to Hazard3 I’m not
too bothered because I can choose what ISA variants I ship, and you can
bet they will always include Zbkb. Overall though I think the RISC-V
software landscape is made slightly poorer by portable software not
being able to assume the presence of these versatile and inexpensive
instructions.
The decision to drop pack and packh seems
to me somewhat arbitrary, and likely a symptom of a long and gruelling
ratification process and the pressure to ship the extensions. Several
people spoke up in their favour but it wasn’t enough to keep them on the
list.
⇥ Return to wren.wtf
One use for zip is Morton-order access
e.g. to swizzled textures, but iterating in Morton order is
already accomplished efficiently with masked addition as described here
or here.↩︎
The somewhat equivalent RBIT (synthesises
as rev8 + brev8) is useful on Armv7-M and
Armv8-M Main to synthesise the missing CTZ as
RBIT + CLZ. They’re also useful sometimes in
CRC calculations. The most important use of bit reverse instructions is
of course the efficient emulation of bit reverse instructions from other
architectures.↩︎
It’s actually a little different on RV64, but for brevity I’m just discussing the specifics of RV32.↩︎
The BL instruction (or the fujoshi
instruction) was originally specified as a separate 16-bit prefix and
16-bit suffix, with a range of 4 MB. Armv5T added a different
suffix that could go with the prefix (BLX immediate).
Armv6T2 stuffed in more bits to expand the range to 16 MB even though
the new bits were previously constant-1 and they needed to stay
backwards-compatible, hence the weird XNOR encoding. Finally the T32
encoding formally defined BL as a 32-bit instruction.↩︎
For example, Hazard3 with the configuration
MUL_FAST = 1, MULH_FAST =
0 has a single-cycle mul, but high-half
multiplies have the same cycle count as division. This is usually a
well-balanced configuration because the full widening multiplier is a
lot of gates, and mul is executed far more frequently than
mulh/mulhu/mulhsu.↩︎