Announcing the first usable release of PIC16 zOS, which brings RTOS features such as as software interrupts and preemptive multitasking to microcontrollers with as little as 1KB of RAM and flash.

To download this GPL’ed set of macros, which should assemble in absolute mode for any enhanced-midrange PIC12 or PIC16 using either the mpasm or gpasm assembler, visit  It is now possible on our favorite $0.50 mixed-signal microcontrollers to start, sleep, fork and kill up to 5 concurrent tasks (called jobs) in an operating system optimized around its architecture.  Any of those jobs can be fully interactive time-shared terminals attached to independent UART pin pairs.  It thus hearkens back to earlier (but not quite as scaled-down) multi-user 8-bit systems from the 1980s such as Microware’s OS9 and its open-source re-implementation NitrOS9.

zOS Origins

What got me started in early 2017 on what eventually became zOS was not initially the desire to write an entire niche RTOS in assembly, but rather to address a common complaint I hear about programming the PIC core in that language. Without exception, everyone mentions having neglected the nuances about RAM banking at some point and tracked down the omitted banksel macro only after stepping through the code.

What if, I wondered, it was possible to refactor my historical library of PIC assembly programs into modules small enough to each run with 80 bytes of statically allocated RAM? In that case, they could be assigned their own job number that coincided with the instantaneous contents of the bank select register (BSR).  RAM banking wouldn’t be needed, since those 80 bytes would automatically appear between 0x20 and 0x6f when that job was running in the foreground.  Writing a special function register (SFR) in the range 0x0c to 0x1f would be handled either before the RTOS scheduler was initialized (many SFRs just set up the PIC operating mode and are never touched again) or by a separate job in the classical model of a hardware driver, activated either as a hardware or software interrupt.

Thinking small

What kinds of PIC jobs could satisfy such a stringent requirement of 80 bytes? My thinking went like this:

  • a console I/O job could use it to buffer most of a single line of text output while feeding characters one by one to the UART peripheral each time a TXIF interrupt indicated a prior one had been sent; a second job could do the same for TX2IF on a dual-UART part to give separate stdio and stderr consoles
  • with a little more code, this same I/O job could be registered to handle serial port input interrupts, implementing such on-board debugging facilities as a memory monitor/disassembler or even simple command interpreters that don’t need long ASCII buffers
  • a SPI driver job could manage short exchanges of bytes with any of the bus devices through their register interfaces, as well as longer data exchanges such as SD cards since…
  • a heap allocator job could implement malloc() and free() on the leftover RAM (above the 512 bytes required for a 5-job system) through one or two software interrupt (SWI) lines that it watches.  Its own 80 bytes are a table tracking the status of up to 40 memory address “handles” not yet released, with space-coalescing “garbage collection” algorithms run whenever this job comes to the foreground.

Even in a system running all of these simultaneously, there are one or two job slots left unoccupied for two main programs to run, or two instances of the same re-entrant main program.  With a certain code-factoring, the reduced application code due to multiple small running jobs sharing the same executable code can somewhat mitigate the already modest <1KB flash footprint of zOS itself.

This re-entrant capability is illustrated in the included olirelay.asm demonstration program, which assigns jobs 1 through 4 to the four pairs of optocouplers and reed relays on an Olimex board, and job 5 to a serial console logging the board’s operation for debugging.  Since the operation of the four channels is almost completely symmetric, their jobs are all initialized using the same memory address.  In this relay-control example, jobs in fact detect which GPIOs they are in charge of based on their job number, which they obtain by reading the BSR. This method allows relay jobs to be created, destroyed or put to sleep to manage the activity status of the channels or even to run other short jobs that would otherwise exceed the limit of 5.  When a job comes back into existence, it resumes management of the corresponding optocoupler/relay port.

zOS RAM overhead

Why a hard limit of 5 tasks, when most open-source RTOSes would allow more scalability? Put simply, this is the sweet spot that allows a tiny kernel to run usefully on even the smallest PIC devices.  The 80 bytes guaranteed available in RAM bank 0 is divided into 5 regions of 16 bytes and it is this constricted space that the minimalist scheduler searches extremely efficiently, since there are 16 bytes of state variables that must be stashed away:

  • a 15-bit starting address in the up to 32K of flash program storage, which is used both for cloning new instance(s) of an existing job and as a handle for finding which job number a particular program got assigned to, with the leftover high bit indicating a primitive “privileged” status that allows a job to start/kill/fork other jobs
  • a 15-bit Program Counter (PC) value indicating where the job would have fetched its next instruction when instead its execution got preempted, with the leftover high bit used to indicate that the job is asleep and shouldn’t be returned to the foreground prematurely (usually because it needs a specific interrupt to occur first)
  • STATUS register contents
  • WREG byte
  • STKPTR index
  • PCLATH byte (5-bit temp register or upper portion of PC, set before a jump/call)
  • two 16-bit pointers into RAM or flash memory, called FSR0 and FSR1
  • a 15-bit address of the interrupt service routine (ISR) registered by the job, or zero if none
  • a byte mask indicating which bits of PIRx registers are of interest for processing a HWI, or zero if none, to speed searching without a hardware vector table
  • a byte mask indicating which of the five SWI lines 3-7 is of interest, or zero if none (the lowest 3 bits of an SWI value are used to implement up to 7 zOS system calls to manage jobs)

Notice that much of this shadow register ordering matches that of the _SHAD registers in the uppermost bank the PIC uses for HW interrupts (HWI), except for the BSR which merely becomes an index 1-5 into these 16-byte records and doesn’t need to be explictly stored.  The SWI implementation, incidentally, is really just a jump to a fixed address (0x002) that immediately disables hardware interrupts so that it can stash the values of these registers in the shadowing area.  Then, upon returning from interrupt context the scheduler doesn’t need to know whether it was doing a HWI or SWI; it just executes a retfie instruction to restore the _SHAD values and everything just works.

The usage of the 16 bytes of global (unbanked) SRAM memory is allocated as follows:

  • 0x70 and 0x71 reserved for use by zOS, specifically the scheduler and interrupt handler
  • 0x72 to 0x7b ten unclaimed contiguous bytes; macros are provided for implementing a simple two-byte mailbox interface for inter-job communication
  • 0x7c to 0x7f arguments from user space to zOS system calls such as launching new jobs; the zOS_ARG() macro ensures interrupts are off before touching these bytes

Trading stack depth for space

By the way, rather than reserve another 5*32=160 bytes of memory to store a full copy of the stack for each process, I have assigned each running job either 2 or 3 levels of the stack (with the ISR allowed either 1 or 0 levels for function calling respectively). This corresponds with the earlier mid-range PIC family and turns out not to be a major impediment in well-compiled assembly language programs since adjacent return instructions can be avoided by jumping rather than calling. The 32 bytes of memory used as temporary storage for restoring the Rolodex-style stack (at the end of each context switch that happens every time an interrupt gets processed) also becomes usable as a dependable source of scratch memory during the processing of those interrupts so that ISR’s don’t need to use any statically or dynamically allocated memory.

Non-cooperative scheduling

It turned out that not only was writing the software interrupt interface (SWI) facility using the _SHAD registers straightforward (I’d done it as a one-day exercise in 2015) but getting a round-robin scheduler attached to the TMR0 interrupt was also surprisingly few lines of code. Every time an interrupt arrives, zOS searches for a job with a registered handler by a simple hashing of the PIFx bits against the mask bits stored in the last two bytes of the valid job record above.  Rather than performing the retfie itself, the ISR jumps to the scheduler which finds the next ready job after the one that caused the interrupt and restores its record into the _SHAD registers before finally executing a retfie.  The 8-bit TMR0 interrupt universally available across the PIC family is not special to zOS (aside from optionally driving heartbeat LED) and merely ensures that jobs are rescheduled at a certain minimum frequency to avoid starvation.

Programming in the zOS environment

With the history and design decisions described, most readers who have made it this far will want to know how to write code for it.  There is no C compiler interface yet, but as advertised the PIC assembly language gets a lot more amenable when bank switching ceases to be a consideration.

Rules of thumb:

  1. job number is the bank number; don’t touch the BSR with interrupts enabled since that’s the only way the scheduler knows which job was running and by extension where to shadow its state
  2. instead use the SWI facility to access hardware directly, similar to DOS programming in the old days; the SWI feature that zOS implements for the is extremely efficient and will automatically shadow all possible PIC registers as well as optionally return a byte value from the call in WREG
  3. if absolutely necessary to bank switch inside a job’s user space, just disable global interrupts with the INTCON.  This option must be exercised sparingly and the BSR must be unchanged upon re-enabling interrupts or everything will fall apart quickly

All symbols that zOS defines are prefixed with zOS_ so there will be no chance of a name collision with any of your own code.  Optionally, there is a bash script called “generic” that will change this prefix from zOS_ to RTOS if you prefer a more obvious convention.

zOS gives PIC a software interrupt capability; use it!

As described at the outset, the main way to avoid bank switching in application code is to isolate the hardware accesses with drivers accessed through either the traditional PIC hardware interrupt or through the new software interrupts.

Using a SWI interface to perform function calls with up to four bytes of argument neither impacts the stack as much as a call or callw instruction (since the PC is shadowed on the sixteenth, unclaimed stack word that is always available since by definition a context switch cannot happen while an interrupt is active) nor clobbers any of the shadowable registers.  This shadowing is as close as we could ever get to a midrange PIC procedure call standard protecting certain registers across subroutine calls.  Also, a task invoking its own SWI for convenience can specify a code of -1 rather than needing to explicitly mention its own SWI line to invoke it.  This is useful for writing re-entrant code in jobs that wouldn’t be possible if each instance had to know its own SWI.

Rather than burning one of 5 SWI lines for each of 5 jobs to allow them to communicate simple status updates to each other, the mailbox interface allows for lower-priority communications by writing global memory.  A find-by-address system call zOS_FND will tell if another job is running, and doubling the job number then adding 0x70 gives its global mailbox byte pair.

Sample code

The most basic zOS application would include the files and define a single program without any of its own interrupts, just accessing the OS (in this case to end a running job) through the zOS_SWI macro call:

        processor 16fxxxxx
        include p16fxxxxx

        org 0
        include      ; Assembles the <1K zOS kernel code
        include ; Just defines the other macros used below
        include setup.mcc    ; Output from Microchip Code Configurator
        zOS_INT 0,0          ; No hardware or software interrupts
        zOS_ADR prog1,zOS_UNP; Start prog1 as an unprivileged job
        zOS_LAU WREG
        btfsc   STATUS,Z     ; Successful launch returns a job 1 to 5
        zOS_ACT FSR0         ; Un-sleep job now (vs later, by another)
        zOS_RUN INTCON,INTCON; Starts scheduler; zOS is now in control

       ;;; program 1 setup code (example to loop 256 times then end)
p1cntr equ      0x20
       clrf     p1cntr       ; Local variable space runs 0x20 to 0x6f
       ;;; program 1's loop
       decfsz   p1cntr,f
       bra      p1loop
       ;;; jobs either sleep (zOS_SLP) or end (END) when nothing to do
       movf     zOS_ME       ; Put own job number in working register
       zOS_ARG  0            ; Turn off interrupts, pass job to system
       zOS_SWI  zOS_END      ; All jobs have perms to end themselves
       bra      $            ; Not reached (no caller so don't return) 


Adding interrupts

Assembly for a more typical set of multiple programs using zOS for interrupt handling would follow this example:

        processor 16fxxxxx
        include p16fxxxxx

        org 0                ; (zOS allows stub code inserted up front)
        include      ; Assembles the <1K zOS kernel code
        include ; Just defines the other macros used below
        include setup.mcc    ; Output from Microchip Code Configurator       pagesel main
        pagesel main
        goto    main         ; Moving zOS_RUN to bottom for aesthetics

        zOS_NAM "program one"; <=16 chars to appear in running job list
        ;;; just yield processor continuously and do all work in ISR
        zOS_SWI zOS_YLD
        bra     prog1

P1HWI   equ     ADIF         ; ADC ready flag bit 6 in PIR1 (example)
P1SWI   equ     zOS_SW3      ; SWI line must be in the range 3 to 7
        ;;; interrupt service routine for program 1
        banksel PIE1
        btfss   PIE1,P1HWI   ; HW interrupt enabled?
        bra     p1nothw
        banksel PIR1
        btfss   PIR1,P1HWI   ; HW interrupt occurred?
        bra     p1nothw
        bcf     PIR1,P1HWI   ; Clear the interrupt flag to acknowledge
        zOS_RFI              ; Return (presumably after processing)
        movlw   P1SWI
        andwf   zOS_MSK,w
        btfss   STATUS,Z 
        bra     p1soft
        zOS_RET              ; Matched hash but not our interrupt
        movlw   1
        zOS_RFS              ; Return a value (e.g. 1) in WREG to caller

        zOS_NAM "program two"; etc.
        movlw   low p1isr
        zOS_ARG 0
        movlw   high p1isr
        zOS_ARG 1
        zOS_INT P1HWI,P2SWI  ; Specify the interrupt handlers at launch
        zOS_ADR prog1,zOS_UNP
        zOS_LAU WREG
        btfsc   STATUS,Z     ; Successful launch returns a job 1 to 5

        ;;; instead of immediately activating the program(s), launch
        ;;; a 9600bps job-management console on EUSART number 1 (which
        ;;; uses PIR1 for TX/RC interrupts) and heartbeat on pin RA0
        zOS_MAN 1,.32000000/.9600,PIR1,LATA,RA0,0
        movlw   zOS_SW7
        movwi   0[FSR0]      ; register SWI 7 for printing characters
        zOS_RUN INTCON,INTCON; Starts scheduler; zOS is now in control

Demonstration programs

There are just two public demos at this point: olirelay.asm as described above and demo_hea.asm which churns the memory allocation code in a stress test.  Note that the dual-port RAM of the USB-enabled PIC16F15xx devices which Microchip inexplicably put at the bottom rather than the top of the linear memory map currently precludes a zOS driver for a USB driver job.  Resolving this is high on the priority list, since USB stacks practically beg to be run under an RTOS of some kind.

A 2048 adventure…stay tuned

In the shorter term, a subsequent post is in the works detailing my attempts to implement the 2048 game in a single job over a serial connection on a PIC.  Rather than factoring the code into multiple jobs it uses mainly the easy setup and extension of a console job, as well as the convenient software-interrupt I/O features of zOS.  It will, however, start a concurrent game in parallel on each UART if it finds a second (or third or fourth or fifth) one.  If only there were any quintuple-UART PIC16 parts, as in the PIC18 family, we could fill all 5 jobs with consoles!


Hot on the heels of my last post, which had sat in draft mode for too long until I finally (on a plane) counted the instruction cycles, comes a second hand-optimization of a C function into assembly language.  On the same trip I came across another interesting CS problem from HackerRank that reduces to a couple (20 or fewer) of Thumb assembly instructions.  In fact my original C solution had been much longer and more involved until a series of successive insights that reduced it to a single if-then-else test, followed by immediately evaluating one of three expressions (based on a register being negative, zero or positive) that yields a string-pointer return value in r0 which is either “YES” or “NO”.

This solution particularly elegantly illustrates two aspects of assembly optimization by hand:

  • Reducing the number of inputs as quickly as possible in order to avoid using unclobberable registers (r4 through r11 for ARM, which according to the ATPCS must be preserved across function calls)
  • Setting the condition flags as a result of performing necessary register manipulations, which partly involves ordering those instructions so the flag values are already set at the moment a branch must be decided.

But this is getting ahead of ourselves.  First let’s step back from native assembly language nuance to read the problem and review the C solution that HackerRank wants the coder to obtain.

Kangaroo problem description

Two kangaroo avatars are somewhere (maybe at the same integer) on a number line, and will each jump a certain (maybe the same) positive integer number toward +∞ at the same instant, over and over again.  Write a function given the initial position and jump distance of each kangaroo, that returns “YES” if they will ever land at the same point on the number line at the same time, i.e. after the nth jump for some value of n their position is the same, and “NO” otherwise.

The intended approach, especially if you’re in a job interview, is not to begin a jump-by-jump simulation until one of the kangaroos passes the other.  The reason is that a solution using this naive approach is tempting to obtain quickly by directly converting the problem statement into code:

  1. Set up a loop to increment the initial position of each kangaroo by its jump distance,
  2. Compare for equality (in which case we output “YES” then exit), otherwise
  3. Apply a slightly more complicated test to figure out whether to output “NO” or continue the next iteration of the loop since the inter-kangaroo distance is shrinking.

Before starting down this path, as with all interview questions, take a break before starting to write.  Try to get inside the head of your interviewer, and above all: talk about your insights into the problem as they occur to you.

Why Kangaroo is a good interview problem

Interviewers will see this approach as supporting mastery of the programming language claimed on the resume but—if “turned in” as the solution—also as evidence of lack of insight into the nature of computational problems.  Perhaps peers will see a hasty approach and fear the generation of code they’ll need to rewrite because it uses resources inefficiently or just plain ignores corner cases: If the jump distance of each kangaroo is the same, do you avoid an infinite loop that waits in vain for the laggard to catch up?  If extended to a machine integer size able to represent the initial distance from kangaroo 1 to kangaroo 2 as greater than the number of atoms on Earth, and the jump delta is just one unit, does your code overflow undetected or just take way longer than necessary to execute?

The first insight is to realize that the four inputs (two x positions and two v jump distances) can be immediately boiled down to just two quantities: the initial difference d between kangaroo 1 and kangaroo 2, and the catch-up distance c between kangaroo 1 and kangaroo 2 on each jump.  It is useful in the interest of compact code to expect the parameters of the “aft” kangaroo (the one starting closer to -∞ and trying to catch up) to be passed in x1 and v1.  But both c and d are signed quantities, so a simple test for the sign of d allows us to effectively swap the kangaroo parameters by forcing d positive and ensuring that the sign of c is in agreement with whether the “aft” kangaroo is catching up (c > 0) or falling behind (c < 0).  A third possibility is that the distance between them is constant, in which case whether they collide at the end of one jump is a function of whether they started at the same position.

Here is the compact solution in C:

// C submission for
char* kangaroo(int x1, int v1, int x2, int v2) {
 static char* answer_yes = "YES";
 static char* answer_no = "NO";

 int d = x2 - x1; // dEFICIT of aft kangaroo behind fore kangaroo 
 int c = v1 - v2; // cLOSING in distance between kangaroos per jump

 if (d < 0) { // swap kangaroos since x1 > x2
  d = -d; // d is now nonnegative
  c = -c; // c sign must be flipped too
 if (c < 0) // aft kangaroo gets farther away with each jump, not closer
  return answer_no;

 else if (c == 0) // kangaroos jump same distance, answer is their initial match
  return (d == 0) ? answer_yes : answer_no;

 else // aft kangaroo hits fore kangaroo iff c (positive) divides evenly into d
  return (d % c == 0) ? answer_yes : answer_no;

ARM implementation

For just about every modern architecture the majority of registers must be preserved by a called function, in the interest of low-overhead function calls.  Optimizing compilers will try to avoid doing so, not only for the stack hit but also because the flexibility to return to the caller prematurely (before the closing brace of the declaration) is lost.

Register usage

As seen from their order in the function declaration the parameters x1 and x2 arrive to kangaroo() in the even argument registers r0 and r2 respectively, while the parameters v1 and v2 arrive in the odd argument registers r1 and r3 respectively.  As the first step we’ll place c in r1 by subtracting r3 from r1 (neither value is needed again), and d in r2 by subtracting r0 from r2 (ditto).

The initial guess of return value placed in r0 is “answer_no” since the expected value from a Monte Carlo analysis shows that a kangaroo pair with random parameters are unlikely ever to collide.  As soon as a collision is known to be impossible, a “mov pc,lr” gets us out of the function with this correct return value.  If we do detect a kangaroo collision we’ll change r0 to “answer_yes” right before exiting.

Lastly r3 is available for calculating a remainder, in one of two ways according to whether the architecture supports hardware division.  No higher registers are needed, a good outcome since they would not only have to be saved on the stack but also restored at a common exit point that would result in more branching.

Condition flags

The reason for calculating c before d wasn’t alphabetical; the first decision that must be made is whether d is negative, indicating the parameters for the “fore” and “aft” kangaroo got swapped on the call.  The N flag will be set automatically by the subtraction that sets d, so it is not necessary to do a separate compare as long as c isn’t set after d (as the C code would suggest is possible).

The code

Putting the above three principles into action gives an implementation for kangaroo() that cannot be beaten by a compiler on either size or execution speed.  For compactness the answer_no and answer_yes string addresses used by the ldr pseudo-instruction are left undeclared in the snippet below.

PUBLIC __kangaroo

__kangaroo:    ;                 //r0    //r1    //r2    //r3
 subs r1,r1,r3 ;char* kangaroo(int x1, int v1, int x2, int v2);
 subs r2,r2,r0
 ldr  r0,=answer_no
 bpl  no_kangaroo_swap
 rsbs r2,r2,#0
 rsbs r1,r1,#0 
 cmp  r1,#0
 bpl  c_nonnegative 
 mov  pc,lr
 bne  c_nonzero
#if defined(ARMv7M) || defined(ARMv8M)
 cbnz r2,c_zero_return;
 cmp  r2,#0 
 bne  c_zero_return
 ldr  r1,=answer_yes
 mov  pc,lr 
#if defined(ARMv7M)
 udiv r3,r2,r1
 umul r3,r1,r3
 subs r3,r2,r3 
 bne  c_pos_return 
 subs r2,r2,r1
 bmi  c_pos_return
 bne  divide_loop
 ldr  r1,=answer_yes
 mov  pc_lr

There are more optimizations possible of course to avoid division in specific cases, but are only worth considering if they are perceived as likely to occur.  For example, a jump delta c of 1 will always result in a YES answer when d > 0.  And jump delta c of 2 will always result in a YES answer when the LSB of d is also 0.  The temptation to make the function grow to the point of unintelligibility must be avoided.

Note that the compact instructions CBZ and CBNZ are back in the low-end cores effective with the ARMv8-M Baseline specification, meaning they’ll be there on the Cortex-M22 (and this code will save an instruction cycle and two bytes).

This post was inspired by my brief subscription last year to HackerRank practice coding challenges by email. It was like a getting an entire lower-division CS homework assignment every week again, for better or worse. Most of the problems in the easy-to-medium ratings are quickly solved in C, less obviously implemented in hand-rolled assembly code, but one in particular seemed amenable to the 6809 addressing modes.

My impetus to pick up 6809 code had in turn been inspired by a few things, in chronological order:

  1. returning to a former employer in 2013 and noticing the MC6809 DIPs stocked for lab used had not yet depleted from the most recent (but likely decades old) order
  2. printing the datasheet and discovering it had several 16-bit registers
  3. listening to the CoCo Crew Podcast and the hosts’ discussion of position-independent code on the Color Computer

Here is my official solution in C to the Funny String problem, where the problem defines a funny string as one in which the set formed by calculating absolute alphabetical distance between every two successive letters in the string is the same for the reversed string.  The example for the problem illustrates “acxz” as funny (because its difference set {2, 21, 2} is symmetric) and “bcxz” as not funny (because its set {1, 21, 2} is not).

void print_funny(const char* string) {
  int a;
  const char* x;
  const char* y;

  a = strlen(string);
  if (a > 1) {
    x = string+0;
    y = string+a;
    for (a = (a+1)/2; a; a--) {
      int b = *--y;
      b -= *(y-1);
      b = (b >= 0) ? b : -b;
        int refdif = b;
        b = *x++;
        b -= *x;
        b = (b >= 0) ? b : -b;
        if ((b -= refdif) != 0) {
          printf("Not ");

A similar implementation for the CoCo running Extended Color BASIC just needs to:

  1. find the string from the USRn() call, pointing X at the first character or quitting with an error code of -1 if the argument is too short (or not a string at all)
  2. grab the string length from the BASIC structure, setting Y to the final character and iterating through the following loop half as many times (rounded down since an odd-length string will just leave a difference in the center that will match itself) so that X and Y meet in the middle
  3. for every pair of two adjacent characters at/after X and at/after Y, compare both the differences

Generating the assembly listing file with asm6809 -l funny.lst funny.s shows the total instruction count for the nfunny() function is 29 and byte count is 50. Note the inclusion of a #define FASTDUP flag to optimize the processing of repeat letters, e.g. whitespace, which adds 6 instructions totalling 11 bytes but reducing those iterations from 55 cycles to just 39:

nfunny ldb  #$ff  ;register int16_t nfunny(register uint1_t a,
       tsta       ;                        register uint8_t* x) {
       beq  6f    ;
       lda  ,x    ;  if (a && (x[0] > 1)) { // x: string struct
       bita #$fe  ;    register uint8_t a = x[0]; // a: strlen, >= 2
       beq  6f    ;    register uint8_t* xp = x; // x: &string[0]
       leax [2,x] ;    register int8_t* x = *((int8_t**)xp + 2);
       leay a,x   ;    register int8_t* y = x+a; // y: &string[a]
       inca       ;
       lsra       ;    for (a = (a + 1)/2; a; a--) { // process half

1      ldb  ,-y   ;      register int8_t b = *--y;
       subb -1,y  ;      b -= *(y-1);
       bpl  2f    ;
       comb       ;
       incb       ;      b = (b >= 0) ? b : -b;
if FASTDUP        ;#if defined(FASTDUP)
2      bne  2f    ;      if (b == 0) { // quicker compares but larger
       ldb  ,x+   ;        b = *x++;
       subb ,x    ;        b -= *x;
       beq  4f    ;        if (b == 0) continue;
       clra       ;        else return b & 0x00ff;
       bra  7f    ;      } else
endif             ;#endif
2      stb  ,-s   ;      { // duplicates not optimized but short code
       ldb  ,x+   ;        uint8_t refdif = b;
       subb ,x    ;
       bpl  3f    ;        b = *x++;
       comb       ;        b -= *x;
       incb       ;        b = (b >= 0) ? b : -b;
3      subb ,s+   ;        if ((b -= refdif) != 0) // mustn't be funny
       bne  5f    ;          return b & 0x007f;
4      deca       ;      }
       bne  1b    ;    }
       clrb       ;    return 0; // confirmed to be funny
5      andb #$7f  ;  } else
6      sex        ;    return -1; // string invalid or too short
7      jmp  $b4f4 ;}

In the wild, and certainly depending on the language, enough strings will have sequential duplicate characters that the 11 extra bytes will be affordable.  The return value in the double-width accumulator will be passed back to Extended Color BASIC by the jump to the ROM routine at $b4f4:

  • -1, if the string is zero-length
  • 0, if the string is funny
  • >0, if the string is not funny

So the ECB program becomes (assuming the above routine was loaded at $2800):

200 DEFUSR1=&H2800
210 INPUT A$
220 IF USR1(A$)<>0 PRINT "NOT ";


I  thought I would put out a quick review of the book I referred to in yesterday’s post.  Here is a shot of the cover of ISBN 978-1466470033, bought on Amazon for $25 in September 2012 (now out of print though and listed as used thereon for the somewhat ridiculous markup of $974.11).


The third edition ISBN 978-1484921906 is out now, and costs under $20.

When the DEC Alpha processor first came out two decades ago as I was studying the MIPS R2000 in school, I naturally assumed I’d learn to program a 64-bit system in assembly someday. At the time I figured it’d be Intel’s rumored Merced architecture that they were developing alongside HP, but the eventual Itanium processor fizzled out right as it reached the server market. Customers just weren’t willing to gamble on the promised IA-64 software becoming available, not in the mature server business. The Alpha had never really taken off either (and eventually got killed by the HP merger after Compaq acquired DEC) so when Advanced Micro Devices quickly came out with their 64-bit architecture soon thereafter it seemed like a winner by default.

The “amd64” (as the Linux kernel refers to it) retained compatibility with the IA-32 instruction set so that users could still run all their old systems unmodified.  Intel soon followed suit with their own implementation of the “x86_64” architecture and while I prefer the AMD documentation set to Intel’s, neither takes a sufficiently tutorial approach for someone not raised on the architecture.

For those looking for a clean 64-bit assembly instruction set unburdened with 32-bit baggage let alone 16-bit, this book spends a minimum of time focusing on the legacy.  Registers may now be referred to by number, but since the architecture still isn’t fully orthogonal the lower-numbered ones are still referred to as rdx, rsi, etc.

Seyfarth’s book delivers on its promises for a very reasonably priced textbook of this quality.  After several chapters and well-contrived exercises, I felt about as comfortable formulating a solution in AMD64 as in ARM or MIPS.  I like the LaTeX look of the chapters as opposed to the typeset feel of a $100 text, and extra resources are available online at

One suggestion I have is that it is somewhat hard to read arbitrary 16-digit hex numbers in the examples.  C++14 allows the insertion of the tick digit separator inside hex (or any) literal constants now, so if there is any yasm syntax to break them up similarly I would suggest it for a possible fourth edition.

A few years ago I picked up on Amazon an excellent 64-bit assembly programming tutorial (on the AMD64 architecture a.k.a. x86_64), which since seems to have gone out of print. Written by a university professor, it captures that exact TeX feel of college handouts during my formative semester of MIPS programming in the early 90’s.

Careful formulation of a problem around an instruction-set architecture (ISA) can just about guarantee beating gcc optimizations, so I knew I wanted to use register modes that don’t nicely map onto C code. The ability to manipulate eight-byte words directly was too tempting to pass up. What could I do that combined this new register width with Intel holdovers from the 1970’s such as support for direct (via registers AH, BH, etc.) access to the lately pointless, second-lowest byte of a quadword?

I decided to count how many quadword-aligned memory locations represent the first of 8 identical bytes. In other words, when fetching aligned quadwords we will see if it consists of a repeating byte pattern (such as 0x2b2b2b2b2b2b2b2b), and if so we increment a counter before proceeding to examine the next 8 bytes.

Furthermore, I decided to repeat this quadword memory test on a large scale, in fact on a larger chunk than would ever be possible in a 32-bit address space. Choosing 4GB of .bss segment is sufficient to prove we’re running on a 64-bit machine. A gcc -O2 C implementation of the check function churns through a 32-bit memory space in a fairly consistent 1.8 seconds on a virtual machine in my i7 laptop:

typedef union {
  unsigned long q;
  unsigned int d[2];
  unsigned short h[4];
  unsigned char b[8];
} flexible_t;

unsigned long check(flexible_t x) {
  return (x.d[0] == x.d[1]) && (x.h[0] == x.h[1]) && (x.b[0] == x.b[1]);

However, gcc can only muster enough insight into what is happening to squeeze this function down to:

mov %rdi,$rdx
xor %eax,%eax
shr $0x20,%rdx
cmp %edx,%edi
je  check3
repz retq
mov %rdi,%rdx
shr $0x10,%rdx
cmp %di,%dx
jne check3
mov %edi,%edx
movzbl %dh,%eax
cmp %dil,%al
sete %al
nopl 0x0(%rax,%rax,1)

So what’s the quickest AMD64 instruction sequence that checks a quadword for byte consistency?  Much shorter than gcc’s effort as it turns out, since the C representation of the function doesn’t give much insight into the most suitable x86 instructions for the task.

My 4GB search takes the same laptop just 1.5s, a full 17% faster since this tight loop has now been further optimized by a few instructions. Ray Seyfarth uses yasm for the course, so the code builds with the line in the top comment:

;;; ../yasm -f elf64 -g dwarf2 -l all8same.lst all8same.asm; gcc -o all8same all8same.o

 segment .bss
array resq 0x80000000 ; two billion quadwords (16GB), just because we can
;;; check() returns a binary value (in rax):
;;; 1 if all eight bytes in the 64-bit input (passed in rdi)
;;; 0 otherwise
 segment .text
check:             ;uint64_t check(uint64_t rdi) {
 xor     esi,esi   ; rsi = 0;
 mov     eax,1     ; rax = 1;
 mov     rdx,rdi   ; rdx = rdi;
 cmp     dl,dh     ; if ((rdx & 0xff) != ((rdx & 0xff00)>>8))
 cmovne  eax,esi   ;  rax = rsi;
 rol     edx,16    ; rdx = ((rdx & 0xffffffffffff0000)>>16)|((rdx & 0x0000ffffffffffff)<<16);
 xor     edx,edi   ; if (((rdx & 0xffffffff) ^= (rdi & 0xffffffff)) == 0)
 cmovnz  eax,esi   ;  rax = rsi;
 mov     rdx,rdi   ; rdx = rdi;
 rol     rdx,32    ; rdx = ((rdx & 0xffffffff00000000)>>32)|((rdx & 0x00000000ffffffff)<<32);
 xor     rdx,rdi   ; if ((rdx ^= rdi) == 0)
 cmovnz  eax,esi   ;  rax = rsi;
 ret               ; return rax;
 global main
 ;; main() entry point
 push rbp
 mov rbp,rsp
 sub rsp,16
 mov r10,array ; long* r10 = array;
 xor r13d,r13d ; int r13d = 0;
 xor rbx,rbx ; for (int ebx = 0;
 cmp ebx,0x7fffffff ; ebx < 0x80000000;
 jg done ; ebx++)
 lea rcx,[r10+rbx*4] ;
 mov rdi,[rcx] ;
 call check ; 
 add r13d,eax ; r13d += check(array[ebx]);
 inc ebx ;
 jmp forloop ;
 ;; main() exit point
 mov eax,r13d ; return eax = r13d;

Further helping the pipeline is the lack of branches/jumps inside of the check() function aside from conditional moves so it executes in constant time and no branch prediction is needed.

It is nothing new anymore to design a lone IC into a small-footprint radio transmitter application, just providing power and a baseband input signal on one pair of pins and a simple antenna on another. Few designers would include stock PIC microcontrollers in this category of ready-to-go, monolithic RF systems. With the new peripheral set of some newer 8-bit offerings and a little creativity, however, the humble PIC can become a self-contained—if un-optimized—analog FM transmitter.  (Passive filters tend to be very useful as well, but ignore these for now.)

In fact no extra hardware beyond a powered PIC12F1501 is needed in order to modulate an input waveform out to a rudimentary antenna. Not much programming is required, either, once the peripherals have been configured.  So rather than sketch a schematic, I’ve just labelled the pins in comments at the top of the source code. An RF bandpass filter (not shown) is highly recommended to eliminate mixer products far away from the desired carrier frequency; these are visible in the FFT plot below.

This minimalist approach is possible because a heterodyne radio transmitter requires at the bare minimum just a local oscillator (LO) and a mixer to multiply the analog input with the carrier frequency. If phase modulation instead of amplitude modulation is to be accomplished, a voltage-controlled-oscillator (VCO) usually first converts the varying-envelope modulating signal into an intermediate-frequency (IF) constant-amplitude waveform.

The 12F1501 and 16F1700-series PIC parts actually possess just enough peripheral hardware even at the lowest pincounts to bring the dream of an inexpensive monolithic hobby transmitter within reach:

  1. a factory-trimmed internal LO built into most modern low-BOM-cost microcontrollers
  2. a serviceable VCO, cobbled together from two other blocks
  3. a mixer approximated by a single core-independent logic gate
  4. a differential antenna power amplifier in the form of a complementary waveform generator (CWG)

An eight-pin PIC12F1501 can be purchased for $0.77 online from, with the pins allocated as follows:

  • 3 pins for power/reset
  • 1 pin for the Vin audio signal
  • 2 differential pins for the antenna
  • 1 pin for optional LO input (to achieve carriers other than 16MHz)

The leftover eighth pin can function as a probing point for exposing either the LO frequency divided by 4, or the single-ended RF output.

The peripherals themselves route together internally to do nearly all the heavy lifting. In fact the only thing the 8-bit digital core is ever doing is copying 10-bit analog-to-digital-converter (ADC) samples into the 16-bit increment parameter field for the numerically controlled oscillator (NCO) as they become available from the conversion.  Together these form the VCO.  The NCO is overflow-based realized by a simple counter and therefore has extremely high phase noise, but it turned out to enable a good enough VCO for this demonstration.

Perhaps most surprising is that an analog mixer to produce an FM signal after the up-conversion can be approximated by a logic gate, if the baseband signal has been pre-modulated. Since all the FM information is contained in the phase of the signals, i.e. the rising and falling edges through the DC midpoint, two digital levels are enough to represent them. Consider the AC waveforms to have passed through a high-gain sgn(x) function so that a logic Low is a -1 and a logic High is a +1. Multiplying two such continuous time waveforms together to perform mixing in the frequency domain is exactly the exclusive not-or (an inverted XOR, the “^” operator in C) function since -1 * +1 = -1 (L ^ H = H ^ L = L) and -1 * -1 = +1 * +1 = +1 (L ^ L = H ^ H = H).

Rather than hook up a square wave generator I connected the handy 1kHz calibration signal output at the front of my scope. Waveform math at a high enough sample rate produced an on-screen FFT spectrum (in red):

An oscilloscope both provides a handy 1kHz square wave output and displays an FFT showing the first three harmonics of the PIC's FM of that square wave by a 16MHz "carrier."

An oscilloscope both provides a handy 1kHz square wave output and displays an FFT showing the first three harmonics of the PIC’s FM of that square wave by a 16MHz “carrier.”

16MHz is the maximum internal LO frequency, and while trimmed enough for an taking a screenshot is not very stable across time and temperature. The next available step downward in frequency is 8MHz.  By #undef’ining the INTCLOCK symbol in the code an external precision time reference (officially 20MHz or less, presumably from a crystal oscillator’s output) can be fed into the mixer, and will incidentally run the rest of the PIC logic as well—including the ADC. So be mindful that Nyquist frequency for sampling any audio also will increase or decrease proportionally relative to the default 16MHz internally generated carrier.

Finally, let’s give it a listen on an Agilent E4402B spectrum analyzer:

1kHz square-wave frequency modulated by the PIC onto its 16MHz clock.  The spectrum analyzer is demodulating it back to a 1kHz tone on a speaker!

1kHz square-wave frequency modulated by the PIC onto its 16MHz clock. The spectrum analyzer is demodulating it back to a 1kHz tone on a speaker!

This is a design exercise only to show what is possible. Standard disclaimers about ensuring compliance with local/regional regulations on spectrum usage apply.  FM deviation (modulation index) may be reduced by scaling each ADC output before copying it into the NCO control registers, but this step will further decrease Nyquist due to the extra processing that adds onto the conversion time.

This post is about a useful feature that assembly coders should lobby ARM to retrofit onto their v6-M instruction set.  It is a staple not only of Microchip’s PIC instruction set, but also ARM’s own later v7-M architecture update: the compare-and-branch-if-zero (or -not-zero). As far as I can tell the omission of cbz and cbnz from ARM v6-M is purely historical, since their 16-bit instruction encoding is a perfect fit for that architecture.  This pair represents, in fact, the only two–byte-long instructions to be found in v7-M but not in v6-M.

Even the 8-bit PIC has it

Before examining cbz and cbnz, some background is due on a well-known microcontroller platform that offers a complementary instruction pair for testing against zero and skipping forward.  The main PIC instruction pair that enables conditional branching is btfss/btfsc: “bit test file, skip if set/clear”.  Frequently one can frame a problem in such a way as to beat the compiler on speed and code size.

As an example, a two-instruction sequence, namely

; (PIC) skipping ahead based on a single-bit value being zero
      btfsc file,b
      goto  next
next: ...

implements a specific but frequent “if” instruction in C:

if (file & (1<<b)) {

Indeed when b is a constant in the range 0 to 7, Microchip’s XC8 compiler produces the btfsc code above.  When this pattern above is applied to the STATUS register and the bit for its Z(ero) flag, the three-instruction sequence

; (PIC) skipping ahead based on a variable's value being zero
      movf  file,f
      btfsc STATUS,Z
      goto  next
next: ...

realizes the ubiquitous test instruction in C:

if (file) {

It is not necessary always to unconditionally branch after the btfss/btfsc test.  If the contents of the C “if” block is a single operation such as

if (file) file2 = file;

that happens to map onto a single PIC instruction, a branch can be avoided and execution speed benefits.  Because the movf file,w to set the Z flag in STATUS copies it to the working register, the conditional instruction can be one that copies from the working register into the destination, namely file2:

; humans compile "if (file) file2 = file;" to 3 PIC instructions
      movf  file,w
      btfss STATUS,Z
      movwf file2
next: ...            ; always reached after 3 cycles

(Without any unconditional branch statements.)  In this example where we just want to set file2 to file if the result will be nonzero, XC8 even in its optimizing “PRO” mode doesn’t compile to the above code but rather branches immediately so that it has a second instruction slot available to generate a redundant movf file,w even though the desired value is already there:

; XC8 compiles "if (file) file2 = file;" to 5 PIC instructions
      movf  file,w
      btfsc STATUS,Z
      goto  next
      movf  file,w
      movwf file2
next: ...            ; reached after either 4 cycles or 5 cycles

In tight loops a 25-40% speed hit like this can be a big deal.  When using a compiler on any platform, one should always look through the assembly code generated in the innermost “for” statements.  In super-critical code I once even exploited the skip-next-instruction feature of btfsc to put another btfsc in that slot:

; hand-written PIC code to speed a tight "while" loop evaluation
      banksel PORTA
loop:                       ; do {
      movf  PORTA,w         ;  w = PORTA;
      btfsc PORTC,RC2       ; } while ((PORTC & (1<<RC2) == 0)||
      btfsc file,vbit       ;          (file & (1<<vbit) != 0));
      goto  loop            ;

No compiler would ever generate this, and I wouldn’t suggest pursuing these kind of structures as a matter of course.  (This was for a software implementation of a two-axis quadrature encoder where reaction time of the PIC directly impacted maximum line density of the zebra wheel.)

The gaping hole in ARM v6-M

For this same reason, I like the ARM v7-M Thumb instructions cbz and cbnz which can do the same “if (register)” test in a single instruction and even preserve the Z flag by incorporating the label (encoded as a forward branch up to 126 instructions) into the statement:

; "if (file) file2 = file;" takes just 3 ARM v7-M instructions
      ldr   r0,file
      cbz   r0,next
      str   r0,file2
next: ...

But even though this code fits in 6 bytes, my 8-pin-DIP LPC810 can’t run it! The instruction set for the Cortex M0 family ARM targeted at the 8-bit microcontroller segment includes all 16-bit-encodable Thumb instructions except cbz/cbnz.  An optimizing Cortex-M compiler will actually generate the following when it is restricted to the v6-M instruction set, and furthermore clobbers the Z flag:

; but "if (file) file2 = file;" requires 4 ARM v6-M instructions
      ldr   r0,file
      cmp   r0,#0
      beq   next
      str   r0,file2
next: ...

With a Cortex M0 target gcc has no choice but to compile a simple zero/nonzero test into the more verbose form above. Clobbering the Z flag furthermore increases code size and execution time. The only reason for cbz/cbnz to be absent is that neither was historically part of the original (pre-2009) Thumb instruction set on which ARM based v6-M.  From an encoding standpoint it fits in the 16-bit instruction format and we are after all trying to save code space on sub-4KiB M0+ parts. Plus, I actually find the cbz format in the first example a tad easier to read through with objdump.

Power to the people

By entering the low-end microcontroller market with Cortex-M ARM Ltd. expanded their user base from just a few hundred consumer electronics manufacturers, to thousands of white-goods makers and millions of hobbyists. Make your voice heard, and at least let them know publicly whenever they miss the boat on an easily incorporated feature like cbz/cbnz!