Archives for category: Uncategorized

I  thought I would put out a quick review of the book I referred to in yesterday’s post.  Here is a shot of the cover of ISBN 978-1466470033, bought on Amazon for $25 in September 2012 (now out of print though and listed as used thereon for the somewhat ridiculous markup of $974.11).


The third edition ISBN 978-1484921906 is out now, and costs under $20.

When the DEC Alpha processor first came out two decades ago as I was studying the MIPS R2000 in school, I naturally assumed I’d learn to program a 64-bit system in assembly someday. At the time I figured it’d be Intel’s rumored Merced architecture that they were developing alongside HP, but the eventual Itanium processor fizzled out right as it reached the server market. Customers just weren’t willing to gamble on the promised IA-64 software becoming available, not in the mature server business. The Alpha had never really taken off either (and eventually got killed by the HP merger after Compaq acquired DEC) so when Advanced Micro Devices quickly came out with their 64-bit architecture soon thereafter it seemed like a winner by default.

The “amd64” (as the Linux kernel refers to it) retained compatibility with the IA-32 instruction set so that users could still run all their old systems unmodified.  Intel soon followed suit with their own implementation of the “x86_64” architecture and while I prefer the AMD documentation set to Intel’s, neither takes a sufficiently tutorial approach for someone not raised on the architecture.

For those looking for a clean 64-bit assembly instruction set unburdened with 32-bit baggage let alone 16-bit, this book spends a minimum of time focusing on the legacy.  Registers may now be referred to by number, but since the architecture still isn’t fully orthogonal the lower-numbered ones are still referred to as rdx, rsi, etc.

Seyfarth’s book delivers on its promises for a very reasonably priced textbook of this quality.  After several chapters and well-contrived exercises, I felt about as comfortable formulating a solution in AMD64 as in ARM or MIPS.  I like the LaTeX look of the chapters as opposed to the typeset feel of a $100 text, and extra resources are available online at

One suggestion I have is that it is somewhat hard to read arbitrary 16-digit hex numbers in the examples.  C++14 allows the insertion of the tick digit separator inside hex (or any) literal constants now, so if there is any yasm syntax to break them up similarly I would suggest it for a possible fourth edition.

A few years ago I picked up on Amazon an excellent 64-bit assembly programming tutorial (on the AMD64 architecture a.k.a. x86_64), which since seems to have gone out of print. Written by a university professor, it captures that exact TeX feel of college handouts during my formative semester of MIPS programming in the early 90’s.

Careful formulation of a problem around an instruction-set architecture (ISA) can just about guarantee beating gcc optimizations, so I knew I wanted to use register modes that don’t nicely map onto C code. The ability to manipulate eight-byte words directly was too tempting to pass up. What could I do that combined this new register width with Intel holdovers from the 1970’s such as support for direct (via registers AH, BH, etc.) access to the lately pointless, second-lowest byte of a quadword?

I decided to count how many quadword-aligned memory locations represent the first of 8 identical bytes. In other words, when fetching aligned quadwords we will see if it consists of a repeating byte pattern (such as 0x2b2b2b2b2b2b2b2b), and if so we increment a counter before proceeding to examine the next 8 bytes.

Furthermore, I decided to repeat this quadword memory test on a large scale, in fact on a larger chunk than would ever be possible in a 32-bit address space. Choosing 4GB of .bss segment is sufficient to prove we’re running on a 64-bit machine. A gcc -O2 C implementation of the check function churns through a 32-bit memory space in a fairly consistent 1.8 seconds on a virtual machine in my i7 laptop:

typedef union {
  unsigned long q;
  unsigned int d[2];
  unsigned short h[4];
  unsigned char b[8];
} flexible_t;

unsigned long check(flexible_t x) {
  return (x.d[0] == x.d[1]) && (x.h[0] == x.h[1]) && (x.b[0] == x.b[1]);

However, gcc can only muster enough insight into what is happening to squeeze this function down to:

mov %rdi,$rdx
xor %eax,%eax
shr $0x20,%rdx
cmp %edx,%edi
je  check3
repz retq
mov %rdi,%rdx
shr $0x10,%rdx
cmp %di,%dx
jne check3
mov %edi,%edx
movzbl %dh,%eax
cmp %dil,%al
sete %al
nopl 0x0(%rax,%rax,1)

So what’s the quickest AMD64 instruction sequence that checks a quadword for byte consistency?  Much shorter than gcc’s effort as it turns out, since the C representation of the function doesn’t give much insight into the most suitable x86 instructions for the task.

My 4GB search takes the same laptop just 1.5s, a full 17% faster since this tight loop has now been further optimized by a few instructions. Ray Seyfarth uses yasm for the course, so the code builds with the line in the top comment:

;;; ../yasm -f elf64 -g dwarf2 -l all8same.lst all8same.asm; gcc -o all8same all8same.o

 segment .bss
array resq 0x80000000 ; two billion quadwords (16GB), just because we can
;;; check() returns a binary value (in rax):
;;; 1 if all eight bytes in the 64-bit input (passed in rdi)
;;; 0 otherwise
 segment .text
check:             ;uint64_t check(uint64_t rdi) {
 xor     esi,esi   ; rsi = 0;
 mov     eax,1     ; rax = 1;
 mov     rdx,rdi   ; rdx = rdi;
 cmp     dl,dh     ; if ((rdx & 0xff) != ((rdx & 0xff00)>>8))
 cmovne  eax,esi   ;  rax = rsi;
 rol     edx,16    ; rdx = ((rdx & 0xffffffffffff0000)>>16)|((rdx & 0x0000ffffffffffff)<<16);
 xor     edx,edi   ; if (((rdx & 0xffffffff) ^= (rdi & 0xffffffff)) == 0)
 cmovnz  eax,esi   ;  rax = rsi;
 mov     rdx,rdi   ; rdx = rdi;
 rol     rdx,32    ; rdx = ((rdx & 0xffffffff00000000)>>32)|((rdx & 0x00000000ffffffff)<<32);
 xor     rdx,rdi   ; if ((rdx ^= rdi) == 0)
 cmovnz  eax,esi   ;  rax = rsi;
 ret               ; return rax;
 global main
 ;; main() entry point
 push rbp
 mov rbp,rsp
 sub rsp,16
 mov r10,array ; long* r10 = array;
 xor r13d,r13d ; int r13d = 0;
 xor rbx,rbx ; for (int ebx = 0;
 cmp ebx,0x7fffffff ; ebx < 0x80000000;
 jg done ; ebx++)
 lea rcx,[r10+rbx*4] ;
 mov rdi,[rcx] ;
 call check ; 
 add r13d,eax ; r13d += check(array[ebx]);
 inc ebx ;
 jmp forloop ;
 ;; main() exit point
 mov eax,r13d ; return eax = r13d;

Further helping the pipeline is the lack of branches/jumps inside of the check() function aside from conditional moves so it executes in constant time and no branch prediction is needed.

It is nothing new anymore to design a lone IC into a small-footprint radio transmitter application, just providing power and a baseband input signal on one pair of pins and a simple antenna on another. Few designers would include stock PIC microcontrollers in this category of ready-to-go, monolithic RF systems. With the new peripheral set of some newer 8-bit offerings and a little creativity, however, the humble PIC can become a self-contained—if un-optimized—analog FM transmitter.  (Passive filters tend to be very useful as well, but ignore these for now.)

In fact no extra hardware beyond a powered PIC12F1501 is needed in order to modulate an input waveform out to a rudimentary antenna. Not much programming is required, either, once the peripherals have been configured.  So rather than sketch a schematic, I’ve just labelled the pins in comments at the top of the source code. An RF bandpass filter (not shown) is highly recommended to eliminate mixer products far away from the desired carrier frequency; these are visible in the FFT plot below.

This minimalist approach is possible because a heterodyne radio transmitter requires at the bare minimum just a local oscillator (LO) and a mixer to multiply the analog input with the carrier frequency. If phase modulation instead of amplitude modulation is to be accomplished, a voltage-controlled-oscillator (VCO) usually first converts the varying-envelope modulating signal into an intermediate-frequency (IF) constant-amplitude waveform.

The 12F1501 and 16F1700-series PIC parts actually possess just enough peripheral hardware even at the lowest pincounts to bring the dream of an inexpensive monolithic hobby transmitter within reach:

  1. a factory-trimmed internal LO built into most modern low-BOM-cost microcontrollers
  2. a serviceable VCO, cobbled together from two other blocks
  3. a mixer approximated by a single core-independent logic gate
  4. a differential antenna power amplifier in the form of a complementary waveform generator (CWG)

An eight-pin PIC12F1501 can be purchased for $0.77 online from, with the pins allocated as follows:

  • 3 pins for power/reset
  • 1 pin for the Vin audio signal
  • 2 differential pins for the antenna
  • 1 pin for optional LO input (to achieve carriers other than 16MHz)

The leftover eighth pin can function as a probing point for exposing either the LO frequency divided by 4, or the single-ended RF output.

The peripherals themselves route together internally to do nearly all the heavy lifting. In fact the only thing the 8-bit digital core is ever doing is copying 10-bit analog-to-digital-converter (ADC) samples into the 16-bit increment parameter field for the numerically controlled oscillator (NCO) as they become available from the conversion.  Together these form the VCO.  The NCO is overflow-based realized by a simple counter and therefore has extremely high phase noise, but it turned out to enable a good enough VCO for this demonstration.

Perhaps most surprising is that an analog mixer to produce an FM signal after the up-conversion can be approximated by a logic gate, if the baseband signal has been pre-modulated. Since all the FM information is contained in the phase of the signals, i.e. the rising and falling edges through the DC midpoint, two digital levels are enough to represent them. Consider the AC waveforms to have passed through a high-gain sgn(x) function so that a logic Low is a -1 and a logic High is a +1. Multiplying two such continuous time waveforms together to perform mixing in the frequency domain is exactly the exclusive not-or (an inverted XOR, the “^” operator in C) function since -1 * +1 = -1 (L ^ H = H ^ L = L) and -1 * -1 = +1 * +1 = +1 (L ^ L = H ^ H = H).

Rather than hook up a square wave generator I connected the handy 1kHz calibration signal output at the front of my scope. Waveform math at a high enough sample rate produced an on-screen FFT spectrum (in red):

An oscilloscope both provides a handy 1kHz square wave output and displays an FFT showing the first three harmonics of the PIC's FM of that square wave by a 16MHz "carrier."

An oscilloscope both provides a handy 1kHz square wave output and displays an FFT showing the first three harmonics of the PIC’s FM of that square wave by a 16MHz “carrier.”

16MHz is the maximum internal LO frequency, and while trimmed enough for an taking a screenshot is not very stable across time and temperature. The next available step downward in frequency is 8MHz.  By #undef’ining the INTCLOCK symbol in the code an external precision time reference (officially 20MHz or less, presumably from a crystal oscillator’s output) can be fed into the mixer, and will incidentally run the rest of the PIC logic as well—including the ADC. So be mindful that Nyquist frequency for sampling any audio also will increase or decrease proportionally relative to the default 16MHz internally generated carrier.

Finally, let’s give it a listen on an Agilent E4402B spectrum analyzer:

1kHz square-wave frequency modulated by the PIC onto its 16MHz clock.  The spectrum analyzer is demodulating it back to a 1kHz tone on a speaker!

1kHz square-wave frequency modulated by the PIC onto its 16MHz clock. The spectrum analyzer is demodulating it back to a 1kHz tone on a speaker!

This is a design exercise only to show what is possible. Standard disclaimers about ensuring compliance with local/regional regulations on spectrum usage apply.  FM deviation (modulation index) may be reduced by scaling each ADC output before copying it into the NCO control registers, but this step will further decrease Nyquist due to the extra processing that adds onto the conversion time.

This post is about a useful feature that assembly coders should lobby ARM to retrofit onto their v6-M instruction set.  It is a staple not only of Microchip’s PIC instruction set, but also ARM’s own later v7-M architecture update: the compare-and-branch-if-zero (or -not-zero). As far as I can tell the omission of cbz and cbnz from ARM v6-M is purely historical, since their 16-bit instruction encoding is a perfect fit for that architecture.  This pair represents, in fact, the only two–byte-long instructions to be found in v7-M but not in v6-M.

Even the 8-bit PIC has it

Before examining cbz and cbnz, some background is due on a well-known microcontroller platform that offers a complementary instruction pair for testing against zero and skipping forward.  The main PIC instruction pair that enables conditional branching is btfss/btfsc: “bit test file, skip if set/clear”.  Frequently one can frame a problem in such a way as to beat the compiler on speed and code size.

As an example, a two-instruction sequence, namely

; (PIC) skipping ahead based on a single-bit value being zero
      btfsc file,b
      goto  next
next: ...

implements a specific but frequent “if” instruction in C:

if (file & (1<<b)) {

Indeed when b is a constant in the range 0 to 7, Microchip’s XC8 compiler produces the btfsc code above.  When this pattern above is applied to the STATUS register and the bit for its Z(ero) flag, the three-instruction sequence

; (PIC) skipping ahead based on a variable's value being zero
      movf  file,f
      btfsc STATUS,Z
      goto  next
next: ...

realizes the ubiquitous test instruction in C:

if (file) {

It is not necessary always to unconditionally branch after the btfss/btfsc test.  If the contents of the C “if” block is a single operation such as

if (file) file2 = file;

that happens to map onto a single PIC instruction, a branch can be avoided and execution speed benefits.  Because the movf file,w to set the Z flag in STATUS copies it to the working register, the conditional instruction can be one that copies from the working register into the destination, namely file2:

; humans compile "if (file) file2 = file;" to 3 PIC instructions
      movf  file,w
      btfss STATUS,Z
      movwf file2
next: ...            ; always reached after 3 cycles

(Without any unconditional branch statements.)  In this example where we just want to set file2 to file if the result will be nonzero, XC8 even in its optimizing “PRO” mode doesn’t compile to the above code but rather branches immediately so that it has a second instruction slot available to generate a redundant movf file,w even though the desired value is already there:

; XC8 compiles "if (file) file2 = file;" to 5 PIC instructions
      movf  file,w
      btfsc STATUS,Z
      goto  next
      movf  file,w
      movwf file2
next: ...            ; reached after either 4 cycles or 5 cycles

In tight loops a 25-40% speed hit like this can be a big deal.  When using a compiler on any platform, one should always look through the assembly code generated in the innermost “for” statements.  In super-critical code I once even exploited the skip-next-instruction feature of btfsc to put another btfsc in that slot:

; hand-written PIC code to speed a tight "while" loop evaluation
      banksel PORTA
loop:                       ; do {
      movf  PORTA,w         ;  w = PORTA;
      btfsc PORTC,RC2       ; } while ((PORTC & (1<<RC2) == 0)||
      btfsc file,vbit       ;          (file & (1<<vbit) != 0));
      goto  loop            ;

No compiler would ever generate this, and I wouldn’t suggest pursuing these kind of structures as a matter of course.  (This was for a software implementation of a two-axis quadrature encoder where reaction time of the PIC directly impacted maximum line density of the zebra wheel.)

The gaping hole in ARM v6-M

For this same reason, I like the ARM v7-M Thumb instructions cbz and cbnz which can do the same “if (register)” test in a single instruction and even preserve the Z flag by incorporating the label (encoded as a forward branch up to 126 instructions) into the statement:

; "if (file) file2 = file;" takes just 3 ARM v7-M instructions
      ldr   r0,file
      cbz   r0,next
      str   r0,file2
next: ...

But even though this code fits in 6 bytes, my 8-pin-DIP LPC810 can’t run it! The instruction set for the Cortex M0 family ARM targeted at the 8-bit microcontroller segment includes all 16-bit-encodable Thumb instructions except cbz/cbnz.  An optimizing Cortex-M compiler will actually generate the following when it is restricted to the v6-M instruction set, and furthermore clobbers the Z flag:

; but "if (file) file2 = file;" requires 4 ARM v6-M instructions
      ldr   r0,file
      cmp   r0,#0
      beq   next
      str   r0,file2
next: ...

With a Cortex M0 target gcc has no choice but to compile a simple zero/nonzero test into the more verbose form above. Clobbering the Z flag furthermore increases code size and execution time. The only reason for cbz/cbnz to be absent is that neither was historically part of the original (pre-2009) Thumb instruction set on which ARM based v6-M.  From an encoding standpoint it fits in the 16-bit instruction format and we are after all trying to save code space on sub-4KiB M0+ parts. Plus, I actually find the cbz format in the first example a tad easier to read through with objdump.

Power to the people

By entering the low-end microcontroller market with Cortex-M ARM Ltd. expanded their user base from just a few hundred consumer electronics manufacturers, to thousands of white-goods makers and millions of hobbyists. Make your voice heard, and at least let them know publicly whenever they miss the boat on an easily incorporated feature like cbz/cbnz!

My employer sent me to an ARM assembly language class once. I learned in the course introduction what made ARM special compared to the other high-performance architectures of the day: conditional instructions, multiple-register load/store instructions, current processor status register (CPSR) and the coprocessor hardware model….to name a few.

This general set of guidelines held true through version 7 of the instruction set architecture (ISA). ARM v8 is published, and chips are already in production. It consists of both a new “Aarch64” definition, and an “Aarch32” update that retains instructions which would still be recognizable from an assembly listing. But is 64-bit ARM “really” ARM as we know it? Their own data brief claims that it is a rationalization of the instruction set.

“Rationalization” in this case may just mean the Acorn folks have retired, and this is the new crop of ISA designers who were getting snot wiped off their noses for them in 1980. As you would expect, based on updated statistics of compiled code performance and thirty more years of published papers they are tweaking the design as they see fit.

Here’s a program I just wrote and assembled for an 8-bit PIC device, which assembles under MPASM (and gpasm) using  the instruction set of the 6502, rather than PIC native code:

        processor 16f1508

        LDA# 0x5a
        LDX# 0xef
store5a STA 0x2000,X
        BNE store5a 
        STA 0x2000,X
        LDA# 0xa5
        BIT 0x2000
        ROL A,
        BIT 0x2000
        BEQ store5a

This example works as intended, setting the 240 bytes of PIC linear memory to all Z’s in ASCII and then performing a couple of ultimately pointless bit tests and stack operations.  Besides the inclusion of a macro definition file to enable this translation a few deviations from standard 6502 syntax give it away: superfluous argument placeholders such as the comma after the ROtate Left Accumulator instruction (a second argument is necessary in case the first one were to have be indexed with X), as well as the strange absorption of the immediate-mode hash signs into the opcode itself.

Assembly-language translation for PIC processors has been done before, using the familiar shorthand of another processor’s instruction set but usually with more of an eye on efficiency and readability than full duplication.  The program snippet above, which would require just 27 bytes of program storage on a true 6502 machine, expands into 247 PIC 14-bit words, for a total of over 400 bytes or about 5% of this device’s flash memory.  Not surprisingly, the macros have to be defined fairly pessimistically and can’t attempt any sort of optimization after the fact on their own.  (Although an optimizing assembler could well do so.)  Mimicking the 6502 on a RAM-banking/ROM-paging PIC device with only one working register and a barely accessible return stack takes a lot of work behind the scenes to accomplish the following in particular:

  • the X and Y index registers (to which purpose I assign FSR0L and FSR1L just in case a pointer access is relative to page boundary and we fortuitously just  need to copy that page into FSR0H or FSR1H)
  • single-opcode stack push/pull operations
  • indexed memory accesses
  • result-less compare and bit test operations

Nevertheless, several 6502 opcodes do map onto a single PIC instruction and don’t require any sort of temporary storage, atomicity or memory protection. These are more along the lines of the assembly-language translations usually attempted:

  • LDA# (becomes: movlw)
  • DEX (becomes: decf FSR0L,f)
  • CLC (becomes: bcf STATUS,C)
  • ROL A (becomes: rlf WREG)

I have not yet tested the full set of 6502 macro definitions extensively, so there are probably several bugs still lurking.  Be sure to disable case sensitivity (MPLAB’s radio button, or gpasm -i) if using uppercase for the 6502 invocations, as I have in order to distinguish the two instruction sets above.

So what features of the 6502 are supported?  Basically, everything except the Binary Coded Decimal (BCD) mode and the oVerflow flag which are the only parts of 6502 that are completely alien to the PIC architecture.  Several opcodes do require special handling or otherwise have a different syntax than on a real 6502 machine:

  1. The Negative flag is supported by means of referring to bit 7 of a particular file register in an additional operand; result-less compares and bit tests require the intermediate results to be stored in a PIC file register.
  2. The immediate mode requires that the hash sign (#) indicating a literal value be placed on the opcode itself, since it is a totally different macro than for a memory access
  3. Indexed indirect (using X) and indirect indexed (using Y) addressing modes are supported, but in this case the question mark (?) character  is required in order to highlight the difference to the assembler after its preprocessor strips off the all-important parentheses.  The parenthesis can still be used in source code, in order to keep some semblance to the 6502 modes:
    • LDA? (0x2000),(X) in lieu of LDA (0x2000,X)…a rarely used 6502 mode anyway!
    • LDA? (0x2000),Y in lieu of LDA (0x2000),Y

Interrupts should work properly, whether using 6502 macros within the ISR or the interrupted code.   Seven of the PIC file registers (0x79 through 0x7f) get used for storing temporary values during the course of a macro.  If not using 6502 macros in interrupt routines, these locations can also be used for temporary storage by PIC code until the next 6502 macro:

  • 0x79 holds the state of the STATUS register in bits 4-0, as well as the GIE bit of INTCON in bit 7
  • 0x7a holds the value of the WREG/Accumulator
  • 0x7b holds the value of the Bank Select Register
  • 0x7c holds the value of the FSR0L/X register
  • 0x7d holds the value of FSR0H
  • 0x7e holds the value of the FSR1L/Y register
  • 0x7f holds the value of FSR1H

One reason for all the extra generated code is that for any instruction that accesses intermediate results in memory or the STATUS register, the GIE bit of INTCON is turned off (SEI in 6502 parlance).  As with previous posts, this is not an efficiency exercise.  I wanted to explore how closely these two 8-bit favorites really could be made to resemble each other.  The code generated will be far from optimized, but the fact that it can be done at all within MPASM’s relatively limited macro expansion capability is kind of neat.

Here’s a head-scratcher: 8-bit PIC processors have a reset vector (where code execution starts, i.e. the value of the PC at reset) of 0x000. The interrupt vector is just four instructions higher, 0x004. What can a PIC coming out of reset accomplish in just a few instructions? A fair amount, it turns out.

Most general-purpose demonstration code doesn’t try to fit any meaningful action into such tight quarters. Since any useful program is bound to be more than four instructions long and have interrupts enabled, one of these four is invariably dedicated to a “goto” instruction to skip past the interrupt service routine, or ISR. Microchip’s XC8 (formerly Hi-Tech’s PICC) compiler doesn’t bother trying to fill the other three with any useful code at all: it merely rewrites the page register PCLATH explicitly with its reset value of zero, increments the PC via an equally meaningless “goto”–instead of a 100% faster “nop”–and then repeats the process with a real page location and a real “goto.”

That’s right, disassembled XC8 output begins with the following overly cautious code, in this case for the trivial “main() {}” program:

0x000: movlp   0
0x001: goto    0x002
0x002: movlp   7
0x003: goto    0x7fc

So at least this compiler’s output doesn’t give much insight into useful purposes for the first four memory locations, prior to the obligatory “goto” no later than 0x003. The first two instructions become the equivalent of a 3-cycle “nop” instruction. The next two instructions explicitly jump to code located at the top of memory. So basically a six-cycle “goto” over the ISR.

Assuming one page selection before the “goto” really could become necessary (if the ISR were to grow beyond one 2kword page–enough for 16 different interrupt sources with 256-word handlers each!), memory at 0x002 and 0x003 could always hold handy constants readily visible from a hex dump. In my own code I’ll sometimes replace any “nop” instructions at 0x002 and 0x003 with “movlw” opcodes that are my code version, values of my “#define” constants or some other at-a-glance indicator:

0x000: movlp   0
0x001: goto    start
0x002: movlw   CODE_VERSION
0x003: movlw   (REV_MONTH<<4)|REV_DAY

But clearly the architects of PIC could have made the reset vector be 0x002, thereby forcing 0x000 and 0x001 to be spent on a jump in all cases. They clearly left the door open for some useful work to get done before burning two instruction cycles (still at least 400ns on a modern 20MHz part) on a “goto” to get to the main code section.

Think quick!

I recently had occasion to make full use of the three instructions in the bottom of PIC instruction memory before the jump over my ISR. As part of a power management application I wanted to pull down on a pin if code happened to start executing from a cold (power-on reset) boot and that pin was low anyway.  This pin also drove a common cathode net for a bunch of Schottky diodes whose anodes were on other PIC pins, in order to clamp any signal back to the main microprocessor to within 0.2V of ground.

Likewise, after reading that pin’s state as an input and thus ascertaining the nature of the reset (cold POR versus warm) I changed it to an output and charged a capacitor on it up to the supply rail to allow the anode pins to toggle as well as to retain this state bit for any subsequent resets.  This seemingly complex decision can be made to fit in three instructions, due to several fortuitous behaviors:

  1. the BSR register defaults to zero at reset and thus PORTA/PORTB/PORTC can be read immediately
  2. “movf PORTA,f” has the effect of copying the current (input) state of those RA pins into their (output) latch register LATA on the first instruction cycle
  3. accessing the BSR then subsequently changing TRISA/TRISB/TRISC (to expose certain LATA/LATB/LATC bit states) takes the remaining two instructions before the “goto”

There are of course registers that can be read to ascertain the nature of a reset (STATUS and PCON) but there is too much bank selection and masking required for each register to make the decision within three instructions, let alone act on it.  Time is somewhat of the essence in my circuit since not only could the capacitor lose its charge due to leakage, but I also put a high-value discharge resistor to ensure that the next true cold boot would be correctly recognized.  Getting everything done before incurring a branch penalty cuts the execution time almost in half!

The following snippet of generic code senses the level of the RA0 pin and then keeps it driven to that state, thereby making an external capacitor-based scheme quite effective for quickly detecting and preserving information about a warm versus cold boot:

0x000:  movf    PORTA,f ; void main(void) { LATA = PORTA;
0x001:  movlb   2       ; 
0x002:  bcf     TRISA,0 ; TRISA &= 0xfe; //RA0 locked to initial value }

All eight pins RA0 through RA7 could even be nailed down with “clrf TRISA”.  Try getting the C code I’ve provided in the comments to compile down as tightly!