Hot on the heels of my last post, which had sat in draft mode for too long until I finally (on a plane) counted the instruction cycles, comes a second hand-optimization of a C function into assembly language.  On the same trip I came across another interesting CS problem from HackerRank that reduces to a couple (20 or fewer) of Thumb assembly instructions.  In fact my original C solution had been much longer and more involved until a series of successive insights that reduced it to a single if-then-else test, followed by immediately evaluating one of three expressions (based on a register being negative, zero or positive) that yields a string-pointer return value in r0 which is either “YES” or “NO”.

This solution particularly elegantly illustrates two aspects of assembly optimization by hand:

  • Reducing the number of inputs as quickly as possible in order to avoid using unclobberable registers (r4 through r11 for ARM, which according to the ATPCS must be preserved across function calls)
  • Setting the condition flags as a result of performing necessary register manipulations, which partly involves ordering those instructions so the flag values are already set at the moment a branch must be decided.

But this is getting ahead of ourselves.  First let’s step back from native assembly language nuance to read the problem and review the C solution that HackerRank wants the coder to obtain.

Kangaroo problem description

Two kangaroo avatars are somewhere (maybe at the same integer) on a number line, and will each jump a certain (maybe the same) positive integer number toward +∞ at the same instant, over and over again.  Write a function given the initial position and jump distance of each kangaroo, that returns “YES” if they will ever land at the same point on the number line at the same time, i.e. after the nth jump for some value of n their position is the same, and “NO” otherwise.

The intended approach, especially if you’re in a job interview, is not to begin a jump-by-jump simulation until one of the kangaroos passes the other.  The reason is that a solution using this naive approach is tempting to obtain quickly by directly converting the problem statement into code:

  1. Set up a loop to increment the initial position of each kangaroo by its jump distance,
  2. Compare for equality (in which case we output “YES” then exit), otherwise
  3. Apply a slightly more complicated test to figure out whether to output “NO” or continue the next iteration of the loop since the inter-kangaroo distance is shrinking.

Before starting down this path, as with all interview questions, take a break before starting to write.  Try to get inside the head of your interviewer, and above all: talk about your insights into the problem as they occur to you.

Why Kangaroo is a good interview problem

Interviewers will see this approach as supporting mastery of the programming language claimed on the resume but—if “turned in” as the solution—also as evidence of lack of insight into the nature of computational problems.  Perhaps peers will see a hasty approach and fear the generation of code they’ll need to rewrite because it uses resources inefficiently or just plain ignores corner cases: If the jump distance of each kangaroo is the same, do you avoid an infinite loop that waits in vain for the laggard to catch up?  If extended to a machine integer size able to represent the initial distance from kangaroo 1 to kangaroo 2 as greater than the number of atoms on Earth, and the jump delta is just one unit, does your code overflow undetected or just take way longer than necessary to execute?

The first insight is to realize that the four inputs (two x positions and two v jump distances) can be immediately boiled down to just two quantities: the initial difference d between kangaroo 1 and kangaroo 2, and the catch-up distance c between kangaroo 1 and kangaroo 2 on each jump.  It is useful in the interest of compact code to expect the parameters of the “aft” kangaroo (the one starting closer to -∞ and trying to catch up) to be passed in x1 and v1.  But both c and d are signed quantities, so a simple test for the sign of d allows us to effectively swap the kangaroo parameters by forcing d positive and ensuring that the sign of c is in agreement with whether the “aft” kangaroo is catching up (c > 0) or falling behind (c < 0).  A third possibility is that the distance between them is constant, in which case whether they collide at the end of one jump is a function of whether they started at the same position.

Here is the compact solution in C:

// C submission for https://www.hackerrank.com/challenges/kangaroo
char* kangaroo(int x1, int v1, int x2, int v2) {
 static char* answer_yes = "YES";
 static char* answer_no = "NO";

 int d = x2 - x1; // dEFICIT of aft kangaroo behind fore kangaroo 
 int c = v1 - v2; // cLOSING in distance between kangaroos per jump

 if (d < 0) { // swap kangaroos since x1 > x2
  d = -d; // d is now nonnegative
  c = -c; // c sign must be flipped too
 }
 if (c < 0) // aft kangaroo gets farther away with each jump, not closer
  return answer_no;

 else if (c == 0) // kangaroos jump same distance, answer is their initial match
  return (d == 0) ? answer_yes : answer_no;

 else // aft kangaroo hits fore kangaroo iff c (positive) divides evenly into d
  return (d % c == 0) ? answer_yes : answer_no;
}

ARM implementation

For just about every modern architecture the majority of registers must be preserved by a called function, in the interest of low-overhead function calls.  Optimizing compilers will try to avoid doing so, not only for the stack hit but also because the flexibility to return to the caller prematurely (before the closing brace of the declaration) is lost.

Register usage

As seen from their order in the function declaration the parameters x1 and x2 arrive to kangaroo() in the even argument registers r0 and r2 respectively, while the parameters v1 and v2 arrive in the odd argument registers r1 and r3 respectively.  As the first step we’ll place c in r1 by subtracting r3 from r1 (neither value is needed again), and d in r2 by subtracting r0 from r2 (ditto).

The initial guess of return value placed in r0 is “answer_no” since the expected value from a Monte Carlo analysis shows that a kangaroo pair with random parameters are unlikely ever to collide.  As soon as a collision is known to be impossible, a “mov pc,lr” gets us out of the function with this correct return value.  If we do detect a kangaroo collision we’ll change r0 to “answer_yes” right before exiting.

Lastly r3 is available for calculating a remainder, in one of two ways according to whether the architecture supports hardware division.  No higher registers are needed, a good outcome since they would not only have to be saved on the stack but also restored at a common exit point that would result in more branching.

Condition flags

The reason for calculating c before d wasn’t alphabetical; the first decision that must be made is whether d is negative, indicating the parameters for the “fore” and “aft” kangaroo got swapped on the call.  The N flag will be set automatically by the subtraction that sets d, so it is not necessary to do a separate compare as long as c isn’t set after d (as the C code would suggest is possible).

The code

Putting the above three principles into action gives an implementation for kangaroo() that cannot be beaten by a compiler on either size or execution speed.  For compactness the answer_no and answer_yes string addresses used by the ldr pseudo-instruction are left undeclared in the snippet below.

PUBLIC __kangaroo
 SECTION `.text`:CODE:NOROOT(2)

THUMB
__kangaroo:    ;                 //r0    //r1    //r2    //r3
 subs r1,r1,r3 ;char* kangaroo(int x1, int v1, int x2, int v2);
 subs r2,r2,r0
 ldr  r0,=answer_no
 bpl  no_kangaroo_swap
 rsbs r2,r2,#0
 rsbs r1,r1,#0 
no_kangaroo_swap: 
 cmp  r1,#0
 bpl  c_nonnegative 
 mov  pc,lr
 
c_nonnegative:
 bne  c_nonzero
#if defined(ARMv7M) || defined(ARMv8M)
 cbnz r2,c_zero_return;
#else 
 cmp  r2,#0 
 bne  c_zero_return
#endif
 ldr  r1,=answer_yes
c_zero_return:
 mov  pc,lr 
 
c_nonzero:
#if defined(ARMv7M)
 udiv r3,r2,r1
 umul r3,r1,r3
 subs r3,r2,r3 
 bne  c_pos_return 
#else
divide_loop: 
 subs r2,r2,r1
 bmi  c_pos_return
 bne  divide_loop
#endif
 ldr  r1,=answer_yes
c_pos_return:
 mov  pc_lr

There are more optimizations possible of course to avoid division in specific cases, but are only worth considering if they are perceived as likely to occur.  For example, a jump delta c of 1 will always result in a YES answer when d > 0.  And jump delta c of 2 will always result in a YES answer when the LSB of d is also 0.  The temptation to make the function grow to the point of unintelligibility must be avoided.

Note that the compact instructions CBZ and CBNZ are back in the low-end cores effective with the ARMv8-M Baseline specification, meaning they’ll be there on the Cortex-M22 (and this code will save an instruction cycle and two bytes).

This post was inspired by my brief subscription last year to HackerRank practice coding challenges by email. It was like a getting an entire lower-division CS homework assignment every week again, for better or worse. Most of the problems in the easy-to-medium ratings are quickly solved in C, less obviously implemented in hand-rolled assembly code, but one in particular seemed amenable to the 6809 addressing modes.

My impetus to pick up 6809 code had in turn been inspired by a few things, in chronological order:

  1. returning to a former employer in 2013 and noticing the MC6809 DIPs stocked for lab used had not yet depleted from the most recent (but likely decades old) order
  2. printing the datasheet and discovering it had several 16-bit registers
  3. listening to the CoCo Crew Podcast and the hosts’ discussion of position-independent code on the Color Computer

Here is my official solution in C to the Funny String problem, where the problem defines a funny string as one in which the set formed by calculating absolute alphabetical distance between every two successive letters in the string is the same for the reversed string.  The example for the problem illustrates “acxz” as funny (because its difference set {2, 21, 2} is symmetric) and “bcxz” as not funny (because its set {1, 21, 2} is not).

void print_funny(const char* string) {
  int a;
  const char* x;
  const char* y;

  a = strlen(string);
  if (a > 1) {
    x = string+0;
    y = string+a;
    for (a = (a+1)/2; a; a--) {
      int b = *--y;
      b -= *(y-1);
      b = (b >= 0) ? b : -b;
      {
        int refdif = b;
        b = *x++;
        b -= *x;
        b = (b >= 0) ? b : -b;
        if ((b -= refdif) != 0) {
          printf("Not ");
          break;
        }
      }
    }
  }
  printf("Funny\n");
}

A similar implementation for the CoCo running Extended Color BASIC just needs to:

  1. find the string from the USRn() call, pointing X at the first character or quitting with an error code of -1 if the argument is too short (or not a string at all)
  2. grab the string length from the BASIC structure, setting Y to the final character and iterating through the following loop half as many times (rounded down since an odd-length string will just leave a difference in the center that will match itself) so that X and Y meet in the middle
  3. for every pair of two adjacent characters at/after X and at/after Y, compare both the differences

Generating the assembly listing file with asm6809 -l funny.lst funny.s shows the total instruction count for the nfunny() function is 29 and byte count is 50. Note the inclusion of a #define FASTDUP flag to optimize the processing of repeat letters, e.g. whitespace, which adds 6 instructions totalling 11 bytes but reducing those iterations from 55 cycles to just 39:

nfunny ldb  #$ff  ;register int16_t nfunny(register uint1_t a,
       tsta       ;                        register uint8_t* x) {
       beq  6f    ;
       lda  ,x    ;  if (a && (x[0] > 1)) { // x: string struct
       bita #$fe  ;    register uint8_t a = x[0]; // a: strlen, >= 2
       beq  6f    ;    register uint8_t* xp = x; // x: &string[0]
       leax [2,x] ;    register int8_t* x = *((int8_t**)xp + 2);
       leay a,x   ;    register int8_t* y = x+a; // y: &string[a]
       inca       ;
       lsra       ;    for (a = (a + 1)/2; a; a--) { // process half

1      ldb  ,-y   ;      register int8_t b = *--y;
       subb -1,y  ;      b -= *(y-1);
       bpl  2f    ;
       comb       ;
       incb       ;      b = (b >= 0) ? b : -b;
if FASTDUP        ;#if defined(FASTDUP)
2      bne  2f    ;      if (b == 0) { // quicker compares but larger
       ldb  ,x+   ;        b = *x++;
       subb ,x    ;        b -= *x;
       beq  4f    ;        if (b == 0) continue;
       clra       ;        else return b & 0x00ff;
       bra  7f    ;      } else
endif             ;#endif
2      stb  ,-s   ;      { // duplicates not optimized but short code
       ldb  ,x+   ;        uint8_t refdif = b;
       subb ,x    ;
       bpl  3f    ;        b = *x++;
       comb       ;        b -= *x;
       incb       ;        b = (b >= 0) ? b : -b;
3      subb ,s+   ;        if ((b -= refdif) != 0) // mustn't be funny
       bne  5f    ;          return b & 0x007f;
4      deca       ;      }
       bne  1b    ;    }
       clrb       ;    return 0; // confirmed to be funny
5      andb #$7f  ;  } else
6      sex        ;    return -1; // string invalid or too short
7      jmp  $b4f4 ;}

In the wild, and certainly depending on the language, enough strings will have sequential duplicate characters that the 11 extra bytes will be affordable.  The return value in the double-width accumulator will be passed back to Extended Color BASIC by the jump to the ROM routine at $b4f4:

  • -1, if the string is zero-length
  • 0, if the string is funny
  • >0, if the string is not funny

So the ECB program becomes (assuming the above routine was loaded at $2800):

200 DEFUSR1=&H2800
210 INPUT A$
220 IF USR1(A$)<>0 PRINT "NOT ";
230 PRINT "FUNNY"

 

I  thought I would put out a quick review of the book I referred to in yesterday’s post.  Here is a shot of the cover of ISBN 978-1466470033, bought on Amazon for $25 in September 2012 (now out of print though and listed as used thereon for the somewhat ridiculous markup of $974.11).

1466470038.01._SX450_SY635_SCLZZZZZZZ_

The third edition ISBN 978-1484921906 is out now, and costs under $20.

When the DEC Alpha processor first came out two decades ago as I was studying the MIPS R2000 in school, I naturally assumed I’d learn to program a 64-bit system in assembly someday. At the time I figured it’d be Intel’s rumored Merced architecture that they were developing alongside HP, but the eventual Itanium processor fizzled out right as it reached the server market. Customers just weren’t willing to gamble on the promised IA-64 software becoming available, not in the mature server business. The Alpha had never really taken off either (and eventually got killed by the HP merger after Compaq acquired DEC) so when Advanced Micro Devices quickly came out with their 64-bit architecture soon thereafter it seemed like a winner by default.

The “amd64” (as the Linux kernel refers to it) retained compatibility with the IA-32 instruction set so that users could still run all their old systems unmodified.  Intel soon followed suit with their own implementation of the “x86_64” architecture and while I prefer the AMD documentation set to Intel’s, neither takes a sufficiently tutorial approach for someone not raised on the architecture.

For those looking for a clean 64-bit assembly instruction set unburdened with 32-bit baggage let alone 16-bit, this book spends a minimum of time focusing on the legacy.  Registers may now be referred to by number, but since the architecture still isn’t fully orthogonal the lower-numbered ones are still referred to as rdx, rsi, etc.

Seyfarth’s book delivers on its promises for a very reasonably priced textbook of this quality.  After several chapters and well-contrived exercises, I felt about as comfortable formulating a solution in AMD64 as in ARM or MIPS.  I like the LaTeX look of the chapters as opposed to the typeset feel of a $100 text, and extra resources are available online at http://rayseyfarth.com/asm

One suggestion I have is that it is somewhat hard to read arbitrary 16-digit hex numbers in the examples.  C++14 allows the insertion of the tick digit separator inside hex (or any) literal constants now, so if there is any yasm syntax to break them up similarly I would suggest it for a possible fourth edition.

A few years ago I picked up on Amazon an excellent 64-bit assembly programming tutorial (on the AMD64 architecture a.k.a. x86_64), which since seems to have gone out of print. Written by a university professor, it captures that exact TeX feel of college handouts during my formative semester of MIPS programming in the early 90’s.

Careful formulation of a problem around an instruction-set architecture (ISA) can just about guarantee beating gcc optimizations, so I knew I wanted to use register modes that don’t nicely map onto C code. The ability to manipulate eight-byte words directly was too tempting to pass up. What could I do that combined this new register width with Intel holdovers from the 1970’s such as support for direct (via registers AH, BH, etc.) access to the lately pointless, second-lowest byte of a quadword?

I decided to count how many quadword-aligned memory locations represent the first of 8 identical bytes. In other words, when fetching aligned quadwords we will see if it consists of a repeating byte pattern (such as 0x2b2b2b2b2b2b2b2b), and if so we increment a counter before proceeding to examine the next 8 bytes.

Furthermore, I decided to repeat this quadword memory test on a large scale, in fact on a larger chunk than would ever be possible in a 32-bit address space. Choosing 4GB of .bss segment is sufficient to prove we’re running on a 64-bit machine. A gcc -O2 C implementation of the check function churns through a 32-bit memory space in a fairly consistent 1.8 seconds on a virtual machine in my i7 laptop:

typedef union {
  unsigned long q;
  unsigned int d[2];
  unsigned short h[4];
  unsigned char b[8];
} flexible_t;

unsigned long check(flexible_t x) {
  return (x.d[0] == x.d[1]) && (x.h[0] == x.h[1]) && (x.b[0] == x.b[1]);
}

However, gcc can only muster enough insight into what is happening to squeeze this function down to:

check:
mov %rdi,$rdx
xor %eax,%eax
shr $0x20,%rdx
cmp %edx,%edi
je  check3
check2:
repz retq
nop
check3:
mov %rdi,%rdx
shr $0x10,%rdx
cmp %di,%dx
jne check3
mov %edi,%edx
movzbl %dh,%eax
cmp %dil,%al
sete %al
retq
nopl 0x0(%rax,%rax,1)

So what’s the quickest AMD64 instruction sequence that checks a quadword for byte consistency?  Much shorter than gcc’s effort as it turns out, since the C representation of the function doesn’t give much insight into the most suitable x86 instructions for the task.

My 4GB search takes the same laptop just 1.5s, a full 17% faster since this tight loop has now been further optimized by a few instructions. Ray Seyfarth uses yasm for the course, so the code builds with the line in the top comment:

;;; ../yasm -f elf64 -g dwarf2 -l all8same.lst all8same.asm; gcc -o all8same all8same.o

 segment .bss
array resq 0x80000000 ; two billion quadwords (16GB), just because we can
;;; check() returns a binary value (in rax):
;;; 1 if all eight bytes in the 64-bit input (passed in rdi)
;;; 0 otherwise
 segment .text
check:             ;uint64_t check(uint64_t rdi) {
 xor     esi,esi   ; rsi = 0;
 mov     eax,1     ; rax = 1;
 mov     rdx,rdi   ; rdx = rdi;
 cmp     dl,dh     ; if ((rdx & 0xff) != ((rdx & 0xff00)>>8))
 cmovne  eax,esi   ;  rax = rsi;
 rol     edx,16    ; rdx = ((rdx & 0xffffffffffff0000)>>16)|((rdx & 0x0000ffffffffffff)<<16);
 xor     edx,edi   ; if (((rdx & 0xffffffff) ^= (rdi & 0xffffffff)) == 0)
 cmovnz  eax,esi   ;  rax = rsi;
 mov     rdx,rdi   ; rdx = rdi;
 rol     rdx,32    ; rdx = ((rdx & 0xffffffff00000000)>>32)|((rdx & 0x00000000ffffffff)<<32);
 xor     rdx,rdi   ; if ((rdx ^= rdi) == 0)
 cmovnz  eax,esi   ;  rax = rsi;
 ret               ; return rax;
                   ;}
 global main
main:
 ;; main() entry point
 push rbp
 mov rbp,rsp
 sub rsp,16
 mov r10,array ; long* r10 = array;
 xor r13d,r13d ; int r13d = 0;
 xor rbx,rbx ; for (int ebx = 0;
forloop: 
 cmp ebx,0x7fffffff ; ebx < 0x80000000;
 jg done ; ebx++)
 lea rcx,[r10+rbx*4] ;
 mov rdi,[rcx] ;
 call check ; 
 add r13d,eax ; r13d += check(array[ebx]);
 inc ebx ;
 jmp forloop ;
done: 
 ;; main() exit point
 mov eax,r13d ; return eax = r13d;
 leave
 ret

Further helping the pipeline is the lack of branches/jumps inside of the check() function aside from conditional moves so it executes in constant time and no branch prediction is needed.

It is nothing new anymore to design a lone IC into a small-footprint radio transmitter application, just providing power and a baseband input signal on one pair of pins and a simple antenna on another. Few designers would include stock PIC microcontrollers in this category of ready-to-go, monolithic RF systems. With the new peripheral set of some newer 8-bit offerings and a little creativity, however, the humble PIC can become a self-contained—if un-optimized—analog FM transmitter.  (Passive filters tend to be very useful as well, but ignore these for now.)

In fact no extra hardware beyond a powered PIC12F1501 is needed in order to modulate an input waveform out to a rudimentary antenna. Not much programming is required, either, once the peripherals have been configured.  So rather than sketch a schematic, I’ve just labelled the pins in comments at the top of the source code. An RF bandpass filter (not shown) is highly recommended to eliminate mixer products far away from the desired carrier frequency; these are visible in the FFT plot below.

This minimalist approach is possible because a heterodyne radio transmitter requires at the bare minimum just a local oscillator (LO) and a mixer to multiply the analog input with the carrier frequency. If phase modulation instead of amplitude modulation is to be accomplished, a voltage-controlled-oscillator (VCO) usually first converts the varying-envelope modulating signal into an intermediate-frequency (IF) constant-amplitude waveform.

The 12F1501 and 16F1700-series PIC parts actually possess just enough peripheral hardware even at the lowest pincounts to bring the dream of an inexpensive monolithic hobby transmitter within reach:

  1. a factory-trimmed internal LO built into most modern low-BOM-cost microcontrollers
  2. a serviceable VCO, cobbled together from two other blocks
  3. a mixer approximated by a single core-independent logic gate
  4. a differential antenna power amplifier in the form of a complementary waveform generator (CWG)

An eight-pin PIC12F1501 can be purchased for $0.77 online from http://www.microchipdirect.com, with the pins allocated as follows:

  • 3 pins for power/reset
  • 1 pin for the Vin audio signal
  • 2 differential pins for the antenna
  • 1 pin for optional LO input (to achieve carriers other than 16MHz)

The leftover eighth pin can function as a probing point for exposing either the LO frequency divided by 4, or the single-ended RF output.

The peripherals themselves route together internally to do nearly all the heavy lifting. In fact the only thing the 8-bit digital core is ever doing is copying 10-bit analog-to-digital-converter (ADC) samples into the 16-bit increment parameter field for the numerically controlled oscillator (NCO) as they become available from the conversion.  Together these form the VCO.  The NCO is overflow-based realized by a simple counter and therefore has extremely high phase noise, but it turned out to enable a good enough VCO for this demonstration.

Perhaps most surprising is that an analog mixer to produce an FM signal after the up-conversion can be approximated by a logic gate, if the baseband signal has been pre-modulated. Since all the FM information is contained in the phase of the signals, i.e. the rising and falling edges through the DC midpoint, two digital levels are enough to represent them. Consider the AC waveforms to have passed through a high-gain sgn(x) function so that a logic Low is a -1 and a logic High is a +1. Multiplying two such continuous time waveforms together to perform mixing in the frequency domain is exactly the exclusive not-or (an inverted XOR, the “^” operator in C) function since -1 * +1 = -1 (L ^ H = H ^ L = L) and -1 * -1 = +1 * +1 = +1 (L ^ L = H ^ H = H).

Rather than hook up a square wave generator I connected the handy 1kHz calibration signal output at the front of my scope. Waveform math at a high enough sample rate produced an on-screen FFT spectrum (in red):

An oscilloscope both provides a handy 1kHz square wave output and displays an FFT showing the first three harmonics of the PIC's FM of that square wave by a 16MHz "carrier."

An oscilloscope both provides a handy 1kHz square wave output and displays an FFT showing the first three harmonics of the PIC’s FM of that square wave by a 16MHz “carrier.”

16MHz is the maximum internal LO frequency, and while trimmed enough for an taking a screenshot is not very stable across time and temperature. The next available step downward in frequency is 8MHz.  By #undef’ining the INTCLOCK symbol in the code an external precision time reference (officially 20MHz or less, presumably from a crystal oscillator’s output) can be fed into the mixer, and will incidentally run the rest of the PIC logic as well—including the ADC. So be mindful that Nyquist frequency for sampling any audio also will increase or decrease proportionally relative to the default 16MHz internally generated carrier.

Finally, let’s give it a listen on an Agilent E4402B spectrum analyzer:

1kHz square-wave frequency modulated by the PIC onto its 16MHz clock.  The spectrum analyzer is demodulating it back to a 1kHz tone on a speaker!

1kHz square-wave frequency modulated by the PIC onto its 16MHz clock. The spectrum analyzer is demodulating it back to a 1kHz tone on a speaker!

This is a design exercise only to show what is possible. Standard disclaimers about ensuring compliance with local/regional regulations on spectrum usage apply.  FM deviation (modulation index) may be reduced by scaling each ADC output before copying it into the NCO control registers, but this step will further decrease Nyquist due to the extra processing that adds onto the conversion time.

This post is about a useful feature that assembly coders should lobby ARM to retrofit onto their v6-M instruction set.  It is a staple not only of Microchip’s PIC instruction set, but also ARM’s own later v7-M architecture update: the compare-and-branch-if-zero (or -not-zero). As far as I can tell the omission of cbz and cbnz from ARM v6-M is purely historical, since their 16-bit instruction encoding is a perfect fit for that architecture.  This pair represents, in fact, the only two–byte-long instructions to be found in v7-M but not in v6-M.

Even the 8-bit PIC has it

Before examining cbz and cbnz, some background is due on a well-known microcontroller platform that offers a complementary instruction pair for testing against zero and skipping forward.  The main PIC instruction pair that enables conditional branching is btfss/btfsc: “bit test file, skip if set/clear”.  Frequently one can frame a problem in such a way as to beat the compiler on speed and code size.

As an example, a two-instruction sequence, namely

; (PIC) skipping ahead based on a single-bit value being zero
      btfsc file,b
      goto  next
      .
      .
next: ...

implements a specific but frequent “if” instruction in C:

if (file & (1<<b)) {
 .
 .
}

Indeed when b is a constant in the range 0 to 7, Microchip’s XC8 compiler produces the btfsc code above.  When this pattern above is applied to the STATUS register and the bit for its Z(ero) flag, the three-instruction sequence

; (PIC) skipping ahead based on a variable's value being zero
      movf  file,f
      btfsc STATUS,Z
      goto  next
      .
      .
next: ...

realizes the ubiquitous test instruction in C:

if (file) {
 .
 .
}

It is not necessary always to unconditionally branch after the btfss/btfsc test.  If the contents of the C “if” block is a single operation such as

if (file) file2 = file;

that happens to map onto a single PIC instruction, a branch can be avoided and execution speed benefits.  Because the movf file,w to set the Z flag in STATUS copies it to the working register, the conditional instruction can be one that copies from the working register into the destination, namely file2:

; humans compile "if (file) file2 = file;" to 3 PIC instructions
      movf  file,w
      btfss STATUS,Z
      movwf file2
next: ...            ; always reached after 3 cycles

(Without any unconditional branch statements.)  In this example where we just want to set file2 to file if the result will be nonzero, XC8 even in its optimizing “PRO” mode doesn’t compile to the above code but rather branches immediately so that it has a second instruction slot available to generate a redundant movf file,w even though the desired value is already there:

; XC8 compiles "if (file) file2 = file;" to 5 PIC instructions
      movf  file,w
      btfsc STATUS,Z
      goto  next
      movf  file,w
      movwf file2
next: ...            ; reached after either 4 cycles or 5 cycles

In tight loops a 25-40% speed hit like this can be a big deal.  When using a compiler on any platform, one should always look through the assembly code generated in the innermost “for” statements.  In super-critical code I once even exploited the skip-next-instruction feature of btfsc to put another btfsc in that slot:

; hand-written PIC code to speed a tight "while" loop evaluation
      banksel PORTA
loop:                       ; do {
      movf  PORTA,w         ;  w = PORTA;
      btfsc PORTC,RC2       ; } while ((PORTC & (1<<RC2) == 0)||
      btfsc file,vbit       ;          (file & (1<<vbit) != 0));
      goto  loop            ;

No compiler would ever generate this, and I wouldn’t suggest pursuing these kind of structures as a matter of course.  (This was for a software implementation of a two-axis quadrature encoder where reaction time of the PIC directly impacted maximum line density of the zebra wheel.)

The gaping hole in ARM v6-M

For this same reason, I like the ARM v7-M Thumb instructions cbz and cbnz which can do the same “if (register)” test in a single instruction and even preserve the Z flag by incorporating the label (encoded as a forward branch up to 126 instructions) into the statement:

; "if (file) file2 = file;" takes just 3 ARM v7-M instructions
      ldr   r0,file
      cbz   r0,next
      str   r0,file2
next: ...

But even though this code fits in 6 bytes, my 8-pin-DIP LPC810 can’t run it! The instruction set for the Cortex M0 family ARM targeted at the 8-bit microcontroller segment includes all 16-bit-encodable Thumb instructions except cbz/cbnz.  An optimizing Cortex-M compiler will actually generate the following when it is restricted to the v6-M instruction set, and furthermore clobbers the Z flag:

; but "if (file) file2 = file;" requires 4 ARM v6-M instructions
      ldr   r0,file
      cmp   r0,#0
      beq   next
      str   r0,file2
next: ...

With a Cortex M0 target gcc has no choice but to compile a simple zero/nonzero test into the more verbose form above. Clobbering the Z flag furthermore increases code size and execution time. The only reason for cbz/cbnz to be absent is that neither was historically part of the original (pre-2009) Thumb instruction set on which ARM based v6-M.  From an encoding standpoint it fits in the 16-bit instruction format and we are after all trying to save code space on sub-4KiB M0+ parts. Plus, I actually find the cbz format in the first example a tad easier to read through with objdump.

Power to the people

By entering the low-end microcontroller market with Cortex-M ARM Ltd. expanded their user base from just a few hundred consumer electronics manufacturers, to thousands of white-goods makers and millions of hobbyists. Make your voice heard, and at least let them know publicly whenever they miss the boat on an easily incorporated feature like cbz/cbnz!

My employer sent me to an ARM assembly language class once. I learned in the course introduction what made ARM special compared to the other high-performance architectures of the day: conditional instructions, multiple-register load/store instructions, current processor status register (CPSR) and the coprocessor hardware model….to name a few.

This general set of guidelines held true through version 7 of the instruction set architecture (ISA). ARM v8 is published, and chips are already in production. It consists of both a new “Aarch64” definition, and an “Aarch32” update that retains instructions which would still be recognizable from an assembly listing. But is 64-bit ARM “really” ARM as we know it? Their own data brief claims that it is a rationalization of the instruction set.

“Rationalization” in this case may just mean the Acorn folks have retired, and this is the new crop of ISA designers who were getting snot wiped off their noses for them in 1980. As you would expect, based on updated statistics of compiled code performance and thirty more years of published papers they are tweaking the design as they see fit.