I wish this collection of PDF tutorials had been around when I started working with the 8-bit PIC: David Meiklejohn’s 2012 series “Introduction to PIC Programming: Mid-Range Architecture and Assembly Language.”

Contrary to the title, they also make a good refresher (and source of sample code) for experts who just don’t take on a microcontroller design every single year or at every single employer. The treatment of the material and the quality of the writing are among the best. This is not snarky pablum pasted from forum posts but rather the output of a technically skilled author, for about half the price of a Newnes book on PIC that may not prove as useful. Find it online at the Gooligum website

Forgot how many CPU cycles elapse before a write to the Timer0 value actually gets incremented?

Forgot how many “nop” instructions to put near a “sleep” command, where, and why?

Forgot how to use a watchdog timer to periodically wake from sleep?

Although the answers can all be gleaned from the PIC datasheets, I find value in Meiklejohn’s more narrative and task-focused method of presentation.

One last note: It seems there is a new business model in place at the site, whereby only the introductory modules are free of charge. So a discussion of Timer0 comes for free, but the rather different Timer1 and Timer2 will cost you. In case you were hesitating, consider gaining access to the latter money well spent.

One of the first things that many programmers who’ve dabbled in Forth do upon learning Tcl, or vice versa in my case, is to try to implement lexical features of one inside of the other.  In mid-2011 I felt the urge to learn how to do some programming in a low-overhead version of Forth, more than a decade after I seriously got into the Tcl scripting language for similar reasons of wanting something flexible but floppy-bootable.
This trend surely stems from the fact that neither language imposes many rules of syntax beyond streams of whitespace-separated words.

Forth: cross-platform compiled code (careful, can come cryptic!)

Forth is actually quite venerable among compiled languages, having come on the scene in the 1970’s after a long incubation in the service of Charles Moore’s individual programming efforts since 1958.  Natively supporting only 16-bit integers on an internal calculation stack that supplements the more universal “return” stack, many Forth dictionary versions also include support for floating point values–as well as “double” integers–by special processing of multiple adjacent stack locations.  Forth only really does one of two things once it’s up and running as an interpreter/compiler:

  1. accept numeric constants onto the stack, or
  2. try to jump to a subroutine that corresponds to a sequence of non-numeric input characters taken to be a “word” of code

This behavior results in the same postfix syntax as an RPN calculator, and when reading through any Forth code it is essential go slowly enough to be able to visualize the topmost stack contents at any given point. Values are continually sent to and consumed from the stack, so adding two numbers and printing the result is as straightforward as the ubiquitous Forth example ( 2 2 + . ) in which the addition function ( + ) replaces two inputs on the stack with its output ( 4 ) which is then in turn consumed by the stack-printing function ( . ).

Tcl: omitting both vowels from the name, not just one

A Tcl interpreter similarly just plows through streams of characters, but goes a step further by not assigning particular significance to any elements that other languages might try to parse as expressions, including numerical constants!  You can write a procedure called “2”, name a variable “+”, or even have a procedure called “2+2” coexist with a variable by the same name.  The variable value would be accessed Bourne-shell style by preceding it with a dollar sign ($), whereas the procedure call would be the first word on a line or immediately following a function-calling left bracket ([) so there is never any ambiguity.

Tcl’s native format is the parameter list rather than an implied stack, but both being at their core a sequence of numbers there is an obvious natural correspondence between the two. If you actually want to perform a mathematical calculation in Tcl you put it in a list passed to a function that interprets its argument list as an expression, so that

expr 2 + 2

evaluates to “4”.  Once you think about it, this lack of any imposed interpretation on unquoted strings makes sense in today’s computing environment, now that we’ve moved away from just FORTRAN-like calculations and now trade all sorts of text over the Web that are encoded according to various contexts.

A brief aside: a procedure called left parenthesis

By the way, in both languages you can define most ASCII punctuation marks including a parenthesis to be procedure calls, so that whereas the Tcl definition

proc ( {arg args} {expr ( $arg $args}

would seem to contain unbalanced parentheses, it is actually defining “(” as a wrapper procedure. Like any other Tcl procedure it can take zero or more mandatory arguments (one in this case, arbitrarily called “arg”) followed by any number of optional arguments (always called “args”).  Now there are two equivalent forms of the simple arithmetic script above, with the first one taking just slightly longer to execute as it gets translated into the second:

  1. ( 2 + 2 )
  2. expr 2 + 2

but only if carefully minding the spaces since they play the significant role of list separators. The same syntax could be accomplished in Forth, also by creating a left-parenthesis function that reads ahead in the input stream.  At the end of this post, the left parenthesis will be redefined in its Forth context of beginning a comment string, akin to C’s /*.

Finding application in embedded systems

For this kind of elegant extensibility while still maintaining robustness, compact versions of Tcl–just like Forth before it–are now found running the debug shell on a lot of embedded devices.  It gives developers a convenient way to query the state of a product, call underlying C functions in the base firmware, define manufacturing test routines etc., all without having to develop the full overhead of an interpreter themselves.  This is exactly the pretext under in which Tcl (“tool command language” but pronounced “tickle”) was originally developed: it’s better to have professional language designers doing the shell writing and init-file processing, than those whose primary focus is the overall application program and for whom constructing command syntaxes and formatting init files are an often ill-executed distraction from the larger application they’re trying to create.

Forth is a very handy small-footprint language, and probably the only integrated development/operating environment presenting an immediate, text-mode interface relatively unchanged since its brief popularity on 8-bit 80’s systems.  One of the complaints about Forth is that there are so many different “standard” dictionaries that  different Forth dialects often use the same word to mean slightly different things, taking different parameters or returning differently formatted results.  And while Forth’s low overhead becomes less relevant in newly affordable 32-bit microcontroller environments, there are still exceptional recent versions of the language such as FlashForth that cater to 8- and 16-bit microcontrollers.  Studying Mikael Nordman’s creation is a joy, whether to program in or just reading through his source code, and it actually makes even the simplest PIC demo boards instant on-board development platforms for even rather complex programs.

Tcl in Forth

Returning to the theme of replicating Forth or Tcl syntax inside of the other, I briefly took a stab at implementing Tcl’s shell-style (preceded by $) variable substitution in new words “set” and “puts” for Forth as well as supporting bracket-bounded function calls. So after much trial and error I was able to replace typically jumbled Forth statements of the form

variable x 3 x !
variable y x @ y !

with the somewhat more intuitive code (including implicit declaration of variables) borrowed from Tcl:

set x 3
set y $x

or its single-line equivalent,

set y [set x 3]

But this exercise proved not very interesting after I had tested a few such constructs, and again the inconsistent dictionaries among Blazin’ Forth on the Commodore, FlashForth on my Microstick II and gforth on my Linux box served as a reminder that when it comes to portability, Forth is no C.

Fortcl: Forth in Tcl

Lots of people have implemented Forth in Tcl.  I decided to take a crack at it for the principal reason that most implementations I was running across seemed to use global variables, whereas I prefer to stick with a select few list arguments to any function in order to enable better execution tracing and reuse.  I use a global variable in Tcl only to access what would be a global variable in Forth; none are used for the mechanics of the Forth interpreter itself.

Since Ficl (pronounced “fickle”) comes up on a web search as the name of an extant but seemingly unmaintained SourceForge project to develop a command language similar to Tcl in Forth, I’ve called my second exercise for this post Fortcl (pronounced “forticle,” as in a “fortified follicle”).  As with the other Forth-inside-Tcl implementations at the previous link, defining a word defines a Tcl proc of the same name.  Mine just accepts three (possibly null) arguments as input and similarly returns three (possibly null) outputs: a calculation stack called “stack,” a return stack called “rstack” and a list of words in “args” left to process on the current line.  A proc forth2 keeps the process going from one word to the next on a line; it is wrapped in a proc forth that just passes in null stacks and everything typed after it and returns only the stack (thus not returning the “return stack,” apologies for the roundabout nomenclature).

So besides replacing global variables with cascading Tcl lists, what can I say is special about Fortcl?  As in any implementation of Forth, colon definitions (basically a procedure defined by a sequence of characters between a lone colon and a lone semicolon) require a word name followed by one or more list elements, but the hierarchical list support in Tcl means that each list element itself can be a list.  I utilize this fact to allow the definition of a word in terms of either other Fortcl words or (if the first argument is a list rather than a word) in Tcl itself.  So either of the following is an acceptable way of defining the increment ( 1+ ) word once the addition ( + ) word has been defined:

: 1+ 1 + ; # defined in terms of another Forth word ( + )


: 1+ {list [concat [lrange $stack 0 end-1] [expr 1 + [lindex $stack end]]] $rstack $args} ; # defined as a Tcl script since the first (and only) word in the definition list is a list

Note that hash ( # ) after a semicolon starts a comment in Tcl and therefore in Fortcl.  The first is faster to type, the latter faster in execution time.

The left-parenthesis according to Forth

Another advantage of the latter definition of “1+” is that it supports the keyword “immediate” before the semicolon, which forces future definitions that use the word to execute its Tcl script during the definition (compile-time) rather than copying it into themselves for later execution (run-time).  It is good to do this for parenthesis-delimited Forth comments meant for human consumption only, so that they are stripped from any definitions and don’t slow down execution of the word:

: ( {concat [list $stack] [list $rstack] [list [lrange $args 0 [lsearch -exact $args )]-1]]} immediate ;

(_imm is now a defined word, and so as long as the “immediate” keyword is present comments will be removed from any definition.  There is no conflict with any other, non-immediate left-parenthesis proc that may exist, such as the wrapper for Tcl’s built-in proc expr at the top of this post.

Executing Forth words from Tcl

Regardless of which colon-definition mode (Forth or native, non-immediate Tcl) that 1+ was implemented in, execution similarly can take place on a line of Forth after the Tcl interpreter has recognized a call to proc forth:

forth var x 3 x ! x @ 1+ x ! x @
--> 4

or a line of Tcl that doesn’t even call proc forth for stack management, instead using lindex to strip off the two null lists (return stack and args remaining) that follow the answer in the first list (i.e. the resulting stack after the increment):

set x 3 ; set x [lindex [1+ $x] 0]
--> 4

At last, the code

After source-ing the base Fortcl to pick up the forth wrapper and more importantly the definition word ( : ), I have placed all words I’ve implemented to date in a separate, much longer source-able file consisting of only colon definitions that call it. An important note on this latter file is that there are a few ubiquitous Forth words that clash with Tcl keywords and thus had to be renamed in the forth definition:

  1. if (renamed to iff as in “if-forth” or “if and only if,” so use iff…else…then rather than if…else…then)
  2. variable (renamed to var)

Closing thoughts

Tcl is a fond post-adolescent memory for me, by the way.  John Ousterhout, its creator, was the instructor for the first computer science course I took at Berkeley, right around the time that Tcl was taking off. By day he was assigning me C and MIPS assembly homework, but outside of class he was rolling out this flexible interpreter to the world, probably still supporting Magic to some extent and somehow also raising a newborn daughter who as of this new year would be on the verge of turning 22 herself.

Thanks, ouster!



You know we’ve come a long way when it is more expedient to explain to children that a byte is a quantity corresponding to roughly a billionth of another unit far more fundamental to them in their daily lives: the gigabyte. A hundred bytes is an atomic-level scale to the newest generation.  In the spirit of Robby Boey’s Commodore 64 sprite art with his daughters, I wanted to show mine that a small, quickly written program could be quite usable in solving her favorite word game from the daily newspaper.

A crytpoquip is a letter-substitution cipher (a cryptogram) that always reveals some witty observation or pun once every letter has been substituted for another, with never any mapping onto the same letter and not necessarily any reciprocal mappings. These constraints, together with context clues and the somewhat limited set of two- and three- letter words, allows the enthusiast to gradually crack the substitution code, which changes from one cryptoquip to another.  A hint is often given to get started, for example “S equals E.”

Besides the newspaper, it is possible to find solvable cryptograms on the web.  And whether as an algorithms demonstration or targeted at the impatient puzzler, there exist various automated solver programs on the web that utilize a standard (or selectable) language dictionary to remove all subtle insights from the path toward the solution.

Of course my homebrew version is in assembly language, both to get the size down (the hex codes for the loop itself fit on one 40×25 screen) and to illustrate that a  carefully programmed text-based 6502 platform can be at least as responsive as a comparable app running on a smartphone taking up 100,000 times the memory footprint.  In my interface, any character typed on the keyboard appears in the upper left corner of the screen.  The first character pressed shows up there in reverse video, with all instances of it in the cipher (as determined by a nonprinting character in the row below them) also highlighted in reverse.  Here is a screenshot from a puzzle with only one clue, the letter B, remaining:


The answer is obviously that the letter B in this cipher maps onto the letter H in the solution.  Pressing a second key when letters are highlighted equates (by way of printing it in normal video in the row superior) all the instances to that solution letter.  As the final step the comma is mapped onto itself, since punctuation and digits do not typically get mapped as part of the puzzle:puzzle2

Below is loader code that runs on the Commodore 16 series.  It uses the TED-series trick of allocating and then de-allocating a graphics screen to open up 12K of memory underneath BASIC without interrupting the program being run:crypto1-20

To achieve the compact size of 146 bytes, it is necessary to input the puzzle on a blank screen using the Commodore screen editor or as part of the BASIC loader: crypto30-230

Substitution continues in a loop even after all cipher characters have been mapped, since no such check is performed.  Here is the assembly code in crasm’s format.  With limited tweaks it should run under any Commodore BASIC, since it makes use of the standard Kernal routines for screen I/O and the character codes that use bit 7 as a reverse-video flag.  The equivalent C code appears in the comment at the end of the lines:

cpu	6502
page	0,132

	;; start of character codes (screen memory directly manipulated)
	TEDSCR = $0c00		; const TEDSCR = 0x0c00;

	;; space character (both ROM table index and ASCII)
	spc = $20		; const char spc = ' ';

	;; KERNAL jump vectors
	GETIN = $ffe4		; extern char GETIN(void);
	CHOUT = $ffd2		; extern void CHOUT(char);
	PLOT = $fff0		; extern void PLOT(boolean, short&, short&);

	;; 2-byte pointer to "answer" letter and "crypto" letter below it
	leta = $da		; char* leta;
	letc = $dc		; char* letc;

	code			; char cryp(short, short, short);

				; void shll(void) {
shll	lda	#$80		;  unsigned short stack = 0x80;
	pha			;  do {
shl1	clc			;   char a;
	ldx	#$00		;   short x = 0;
	ldy	#$00		;   short y = 0;
	jsr	PLOT		;   PLOT(0,x,y);
shl2	jsr	GETIN		;   do a = GETIN();
	beq	shl2		;   while (a == 0);
	jsr	CHOUT		;   CHOUT(a);
	pla			;
	and	#$80		;   stack &= 0x80; /* get char at (x,y) */
	eor	#$80		;   stack ^= 0x80; /* flip state bit */
	ldx	#$00		;
	ldy	#$00		;
	jsr	cryp		;
	pha			;   stack = cryp(stack, x, y);
	and	#$7f		;
	bne	shl1		;  } while (stack & 0x7f);
	pla			;
	rts			;  return;
				; }

;;; user must pass in pointer (X row>=0, Y col>=0) to set search start location,
;;; with character passed in A, whose MSB indicates whether this character
;;; is for highlighting "crypto" letters (clear) or substituting them (set)

				; short cryp(short a, short x, short y) {
cryp	sta	letc+1		;

	sty	leta		;
	ldy	#$28		;
	lda	#TEDPAG		;
	sta	leta+1		;

	;; add 1 row for each count of x
	clc			; 
xrow	tya			;
	adc	leta		;
	dex			;
	bmi	xro2		;
	bcc	xro1		;
	inc	leta+1		;
xro1	sta	leta		;
	clc			;
	bcc	xrow		;  leta = TEDSCR + x*40 + y;

	;; A and C conveniently tell us the value to go into higher-pointer LSB
xro2	sta	letc		;
	lda	leta+1		;
	adc	#$00		;
	ldy	letc+1		;
	sta	letc+1		;  letc = leta + 40;

	;; if Y (and thus the initial A) was zero, we use character at this pos
	ldx	#$00		;
	tya			;
	and	#$7f		;
	bne	test		;  if ((a & 0x7f) == 0)
	tya			;
	and	#$80		;
	ora	(leta,x)	;
	tay			;   a = (a & 0x80) | *leta;

test	tya			;  for (--letc; ++letc < TEDSCR + 1024; ++leta)
	bmi	tes2		;   if (a >= 0) { /* highlight crypto letters */

tes1	cmp	(leta,x)	;
	bne	tes3		;
	lda	#spc		;
	cmp	(letc,x)	;
	bne	tes3		;    if ((*leta == a) && (*letc == spc)) {
	cmp	(leta,x)	;     if (*leta == spc)
	beq	done		;      break;
	tya			;
	ora	#$80		;
	sta	(leta,x)	;     *leta |= 0x80; /* reverse video */
	bmi	tes3		;    }

tes2	lda	(letc,x)	;
	bpl	tes3		;   } else if (*letc == a) { /* show answers */
	and	#$7f		;
	sta	(letc,x)	;    *letc &= 0x80; /* normal video */
	tya			;
	and	#$7f		;
	sta	(leta,x)	;    *leta = *letc & 0x7f;

tes3	inc	letc		;   }
	bne	tes4		;
	inc	letc+1		;
	lda	letc+1		;
	cmp	#TEDPAG+4	;
	bpl	done		;
tes4	inc	leta		;
	bne	test		;
	inc	leta+1		;
	bne	test		;

done	tya			;
	rts			;  return a;
				; }


Let me begin by saying that this is not the correct way to do multiplication on the 6502. Here is a simple macro that multiplies two unsigned registers X and Y together as a sequence of additions and leaves the result as an unsigned 8-bit quantity in X, whereas any real routine should preserve the 16-bit result. You can still use it of course as long as you’re confident the result won’t exceed 255, either by the caller’s own design or by testing the high-order bits of X and Y. In such cases it may be faster than the canonical method, and takes up only 50 bytes. All memory for storing intermediate results is taken from the stack.

I started this exercise in fact as a way to explore the TSX and TXS instructions of the 6502. The X register is used to index into the hardware stack, which on a 6502 exists from $100 through $1ff. It is constant-time, going through three loops eight times each, but as mentioned previously the upper 8 bits of the result are discarded.

It turns out that with the 6502 having so few registers, pure macro algorithms like this can only barely even exist at all and only with the help of TXS/TSX. By “pure macro” I mean using only registers and the stack; invoking the macro doesn’t require the caller to pass in the address of some allocated variable space to stash the intermediate results.

 ldy #$08
 bne *-4
 lda $0109,x
 ldy #$08
 bcs *+5
 lda #$00
 sta $0101,x
 bne *-12
 ldx #$00
 ldy #$08
 adc $0101,x
 bne *-9

And here’s what it would look like if it were compiled from the same algorithm in C++:

inline void um8(uint8_t& x, uint8_t y) {
 static uint8_t sp, mem[512];
 mem[sp--] = a ; // pha
 mem[sp--] = a = y ; // tya : pha
 a = x ; // txa
 x = sp ; // tsx
 for (y = 8; y > 0; y--){; // ldy #$08
  mem[sp--] = a ; // pha
  x-- ; // dex
  a = (a >= 1) & 255 ; // asl
 } ; // dey : bne *-4
 a = mem[x + 0x109] ; // lda $0109,x
 for (y = 8; y > 0; y--) {; // ldy #$08
  int c = a & 1; a >>= 1 ; // asl
  if (c == 0) { ; // bcs *+5
   mem[sp--] = a ; // pha
   mem[x + 0x101] = a = 0; // lda #$00 : sta $0101,x
   a = mem[++sp] ; // pla
  } ; //
  x++ ; // inx
 } ; // dey : bne *-12
 x = 0 ; // ldx #$00
 for (y = 8; y > 0; y--){; // ldy #$08
  x += (a = mem[++sp]) ; // txa : tsx : clc : adc $0101,x : tax : pla
 } ; // dey : bne *-9
 y = a = mem[++sp] ; // pla : tay
 a = mem[++sp] ; // pla

 return; /* value in x */

In the title I call this routine “in-place” multiplication in the sense that the routine itself is the only memory space used besides the stack. Such an algorithm may find use in an extremely constrained 6502 system, where every byte is precious but a multiplication routine is still needed.

Invoking this macro is as close as it’s possible to get to pretending there’s an actual register-to-register multiply instruction on the 6502:

 ldx #FACTOR1
 ldy #FACTOR2
 um8 ; x contains (FACTOR1*FACTOR2)&255

Rarely do I get to write assembly code in my day job. Usually I’m doing mixed-signal controller board design and someone else is writing C code. If I’m getting paid to code, it’s usually also been in C. So I’m at maximum exhilaration when I both get to design a special-purpose PCB and code it up to squeeze every last bit of performance out of a microcontroller.
Two past PCB designs I did stand out for me in this fashion. Quite on the opposite end of the spectrum from the high-speed monster I just finished turning on, both were 4 layer PCBs or fewer and based around relatively slow (32MHz PIC24, USB1.1) controller boards: one for 2-axis coordinated motion and the other for multicolor LED light shows.
Microchip of course makes the PIC18 series, some members of which do add USB to this time-honored set of 8-bit peripherals and 16-bit-wide, compiler-friendly instructions. But while both these applications had to support a low-cost overall solution, their real-time demands weren’t well-suited to an 8-bit microcontroller.
Microchip’s step-up PIC24FJ32GB offering has been a nice minimum-component-count platform over the last few years for any low-cost USB application that can fit in only 8K of RAM for variable storage but needs enough processing power to keep up with several motor bridges or serial peripherals. It comes in two different packages: 28-pin dual-inline (-002 suffix, 14 pins on either side, either through-hole or surface-mount) or 44 pins with 11 on each side of a square surface-mount package (-004 suffix, either leadless or quad-flatpack). The amount of RAM seems small, but frequently in a drive application 8K is sufficient to buffer the incoming USB packets and act on them in some meaningful way before moving on; longer-term retention isn’t required.

2-axis Cutting

The first board of my favorites is one I designed for a leading brand of craft cutter. As a two-axis motion controller, it dispensed with stepper motor drive (for reasons of size, audible noise and cost) and introduced DC servo motor control into the market for cutting pre-designed intricate shapes and fonts out of colored/textured paper. I didn’t actually write any of the code on the servo processor that takes cut vectors over USB and does the motion control; that was all done by a very capable embedded C developer and algorithms expert.
What I did do after I sent the board off for fab is program a companion 8-bit PIC device on the board in what has become an anachronism, now that Microchip offers 16-bit controllers with both USB OTG support and dual quadrature encoders in the form of their EP series. But back in 2011, it was necessary to choose: add a USB-to-UART bridge (such as the popular FT232 series or Microchip’s equivalent MCP2200) to a dsPIC33 motor controller, or offload the quadrature encoder interrupt (QEI) processing from the PIC24 and allow it to query the position counts every 1ms over I2C from a dedicated microcontroller.
Since it was the lowest-cost alternative I used a PIC16LF1823 for the latter, and although I was just implementing existing QEI hardware blocks in software it was a fun exercise in interrupt latency reduction. I knew how many 125-ns cycles were burned for every possible ISR branch combination, since this drove the allowable line density on the encoder wheel. The bugs that manifested themselves whenever certain types of interrupts got dropped and the cuts designed to exacerbate them furthermore produced some interesting abstract shapes in the paper along the way.

LED light shows in a display case

The second board I did had to control red, green and blue LED illumination strips for up to 16 product showcase bins in a wooden cabinet. The support chip for this product was the LT8500 from Linear Tech, which provides exactly the 48 independent digital PWM channels that I needed for just a few dollars. The assembly code in the PIC24 acted as more than a bridge between the USB/RS-485 link and the LT8500: it maintained the state table in RAM and provided a protocol for quickly changing/ramping the colors upon a command trigger without any awkward latencies from one bin to the next. All color changes thus were synchronized by the LT8500 to within a few milliseconds, much better than the DMX-based prototype driven by a Microsoft box. Alas, a better career opportunity presented itself before this board went out for fab so I never got past the simulation phase.

Favorite Assembly Macros

I want to share a few favorite PIC24/dsPIC33 macros that came out of this project. The first one just does a copy up to and including the first zero-valued byte, i.e. strcpy(), which assembles down very compactly on this platform. The return value in W0 will always be 0:

.macro strcpy src,dst ; src and dst can be any register W1 through W15
 clr.b W0 ; clobber W0 since an ALU operation is needed to set flags
 ior.b W0,[\src++],[\dst++] ; keep copying, advancing both pointers
 bra nz, .-2 ; branch to the previous instruction if nonzero

So you can call this macro as follows:

mov #STRING,W1
mov #BUFFER,W2
strcpy W1,W2

The next PIC24 assembly macro is a bit more involved. It provides a “MOVe_Byte to High position” that is useful to abstract on this 16-bit processor. When packing halfwords into words the diverse instruction combinations in this macro tend to get executed a lot anyway, so we might as well encapsulate them into a single command that handles the cases for us.  The mov_bh macro (I would’ve defined it as “mov.bh” if macro names were allowed to contain the dot character) can be called with any valid syntax that the mov instruction supports, such as immediate/literal, register-to-register, etc., and we won’t see in the calling code any of the requisite swap instructions.  For example we want to be able to do any of the following:

mov #0xabcd,W1 ; W1:0xabcd
mov #0xfedc,W0 ; W1:0xabcd, W0:0xfedc
mov WREG,file ; W1:0xabcd, W0:0xfedc, file:0xfedc
mov_bhl 0x03,W1 ; W1:0x03cd, W0:0xfedc, file:0xfedc
mov_bh file,WREG ; W1:0x03cd, W0:0xdcdc, file:0xfedc
mov_bh W1,file ; W1:0x03cd, W0:0xdcdc, file:0xdcdc
mov_bh W0,W1 ; W1:0xdccd, W0:0xdcdc, file:0xdcdc
mov_bh W1,file ; W1:0xdccd, W0:0xdccd, file:0xcddc
swap W0 ; W1:0xdccd, W0:0xcddc, file:0xcddc
mov_bh WREG,file ; W1:0xdccd, W0:0xcddc, file:0xdcdc

The macro definition for mov_bh begins by first mapping the registers W0 (or WREG) through w15 that could get passed onto just their numeric equivalents.  This way we can both handle the WREG case as well as make sure that the register is a valid one in this range.

.equiv MOV_BH_REGISTER_W0, 0
.equiv MOV_BH_REGISTER_W1, 1
.equiv MOV_BH_REGISTER_W2, 2
.equiv MOV_BH_REGISTER_W3, 3
.equiv MOV_BH_REGISTER_W4, 4
.equiv MOV_BH_REGISTER_W5, 5
.equiv MOV_BH_REGISTER_W6, 6
.equiv MOV_BH_REGISTER_W7, 7
.equiv MOV_BH_REGISTER_W8, 8
.equiv MOV_BH_REGISTER_W9, 9
.equiv MOV_BH_REGISTER_W10, 10
.equiv MOV_BH_REGISTER_W11, 11
.equiv MOV_BH_REGISTER_W12, 12
.equiv MOV_BH_REGISTER_W13, 13
.equiv MOV_BH_REGISTER_W14, 14
.equiv MOV_BH_REGISTER_W15, 15
.equiv MOV_BH_REGISTER_wreg, 0
.equiv MOV_BH_REGISTER_w0, 0
.equiv MOV_BH_REGISTER_w1, 1
.equiv MOV_BH_REGISTER_w2, 2
.equiv MOV_BH_REGISTER_w3, 3
.equiv MOV_BH_REGISTER_w4, 4
.equiv MOV_BH_REGISTER_w5, 5
.equiv MOV_BH_REGISTER_w6, 6
.equiv MOV_BH_REGISTER_w7, 7
.equiv MOV_BH_REGISTER_w8, 8
.equiv MOV_BH_REGISTER_w9, 9
.equiv MOV_BH_REGISTER_w10, 10
.equiv MOV_BH_REGISTER_w11, 11
.equiv MOV_BH_REGISTER_w12, 12
.equiv MOV_BH_REGISTER_w13, 13
.equiv MOV_BH_REGISTER_w14, 14
.equiv MOV_BH_REGISTER_w15, 15

The next step is to handle the cases one by one.

.macro mov_bh src,dst ; mov register/file lower byte into upper byte
 .ifdef MOV_BH_REGISTER_\dst ; i.e. MOV_BH file,WREG or Ws,Wd
 .if MOV_BH_REGISTER_\dst ;
 swap \dst ;
 mov.b \src,\dst ;
 swap \dst ;
 .else ;
 swap W0 ;
 mov.b \src,\dst ;
 swap W0 ;
 .else ; i.e. MOV_BH file,file or Ws,file
 .ifdef MOV_BH_REGISTER_\src ;
 .if MOV_BH_REGISTER_\src ; i.e. MOV_BH W1,file through W15,file
 push W0 ;
 mov \src,W0 ;
 mov.b WREG,(1+(\dst)) ;
 pop W0 ;
 .else ; i.e. MOV_BH WREG,file or W0,file
 mov.b WREG,(1+(\dst)) ;
 .endif ;
 .else ; i.e. MOV_BH file,file (via W0)
 push W0 ;
 mov.b \src,WREG ;
 mov.b WREG,(1+(\dst)) ;
 pop W0 ;
.macro mov_bhl src,dst ; mov literal byte into upper byte
 .ifdef MOV_BH_REGISTER_\dst ; i.e. MOV_BH #lit8,Wd
 swap \dst ;
 mov.b #\src,\dst ;
 swap \dst ;
 .else ; i.e. MOV_BH #lit8,file
 push W0 ;
 mov.b #\src,W0 ;
 mov.b WREG,(1+(\dst)) ;
 pop W0 ;

That’s it for this retrospective of some of my favorite 16-bit microcontroller PCB designs. In the next post I’ll go back to 8-bit land with a MOS 6502 coding exercise.

Part one of this topic gave a cursory summary of the Commodore TED series of microcomputer from the 1980s, in particular the ability of the eponymous MOS 7360 integrated circuit to manage a situation known as “ROM-over-RAM.”  The total memory capacity (RAM plus ROM) of 6502-based machines frequently exceeded the 64K address space, but the RAM “covered” by ROM never went wasted…at least in the case of the series that hit the market as the Plus/4, Commodore 16 or Commodore 116.

Writes anywhere into the 64K address space always went through to the underlying RAM (with the necessary exception of a small range of memory-mapped registers and code routines to control the behavior of TED itself), but a read could come either from ROM or RAM sharing the same address.  So while some amount of programming footwork–and thus CPU cycles–was required in order to get a byte value back, one could always be stashed quickly away in a free block of RAM.

When developing on a small system with limited hardware, there frequently arises the problem of how to log the activity of a program along the way to getting it to function properly.  In-Circuit Emulation (ICE) and trace capability may not exist for a particular microcontroller, or may be beyond the hobbyist’s budget.  Simulation can only help find bugs to the extent that the actual on-chip peripherals aren’t required, and aren’t the root cause of the problem.

So what’s a developer to do?  Sending status updates out a UART to be displayed by a terminal emulator running on another machine is a frequent means of monitoring program progress.  The Commodore Plus/4 was in fact one of the few 1980s machines to wire a hardware UART such as the MOS 6551 Asynchronous Communications Interface Adapater to an I/O port, and was the only shipping member of the TED series to do so. (More frequently the MOS 6502 CPU had to bit-bang any serial stream,which of course would slow down the program being debugged.)  Back then, multiple UART capability–now taken for granted on all but the most pin-limited microcontroller–was exceedingly rare except on the new 16-bit IBM clones, so when debugging any sort of serial application there again arose the dilemma of how to get the debugging stream out.

With the TED series members containing more than 32K RAM (the Plus/4 or an upgraded 16 or 116), debug logging was easy.  Just pick any range of memory above the BASIC variables or otherwise set to read from ROM.  Such memory is not normally in high demand because it cannot be read in a straightforward fashion, so vast quantities actually can be available.  For example, when the machine is first turned on the BASIC storage space begins at $1001 and ends at the first TED register, $FD00.  (And the power-on message is a fully 92% efficient “60671 bytes free” since $FD00 – $1001 = $ECFF.)  But the “heel” RAM of 192 bytes starting at $FF40 remain completely unclaimed, since preference must be given to the pan-Commodore “kernal” ROM vectors ($FFD2, $FFE4, etc.) occupying the same range of addresses.

So it would not be uncommon for my TED-series BASIC or machine language programs to POKE values into, say, a circular buffer that wraps through the uppermost 192 bytes of RAM.  Until the program terminates or a specific breakpoint is reached there is no need to view the buffer, so the cost of using TED to copy high RAM to a place where it can be read (or even making the RAM momentarily visible) is incurred only infrequently.

The new Microchip “EP” series of their popular 16-bit microcontroller family enables a similar FLASH-over-RAM configuration, not in those same terms but as a comeback appearance of the paging strategy that their 8-bit micros have always used.  Due to their limited (12-/14-/16-bit) instruction word size, the low-end series use a 7-bit literal address (0x00 through 0x7f) offset against a page number. Microchip’s original 16-bit families use a 24-bit instruction word to specify addresses in the range 0x0000 through 0x1fff as 13-bit-long literals, or the full 64KB range (0x0000 through 0xffff) by using the powerful register-indirect modes.

With the EP series Microchip has paved the way to access up to 16MB of data memory through the 16-bit register-indirect modes now offsetting against a 9-bit page number.  Yes, the famed Microchip data memory banking scheme that has been the bane of careless 8-bit assembly language programmers is back!  As soon as Microchip begins to offer devices with more than 64KB of SRAM–and they are getting perilously close, with the top of their line now at 54KB–this new mode will enable access to the high portions.  Any single 32KB page (other than the page corresponding to the lowest 32KB, which is always visible in the low half) can be made to show up in the high half of data memory.

The feature that is reminiscent of the Commodore TED series is that there are actually two page registers, DSWPAG and DSRPAG, which can specify a different active page in data space for reading than for writing.  According to the reference manual this is intended to facilitate quick copying of memory between 32KB pages.  Furthermore, the longstanding Program Space Visibility (PSV) feature which maps read-only program space into the upper half of data space becomes a just special case of the new paging scheme, through the allocation of an extra bit to the DSRPAG register.

So let’s cook up a TED-like example with a PIC24EP- or dsPIC33EP-series controller that requires only 32KB of the onboard data RAM.  The excess will always be accesible for write operations if DSWPAG is set to 0x01.  Reads to the high half of the 64KB data space, however, will return read-only information from the program space (very useful for character sets, look-up tables and other static data since they don’t have to be initialized at runtime) if DSRPAG is set to a value greater than 0x100. The upper half of data space can now carry a different meaning on an indirect read operation via W8 (pull constant data from flash) than an indirect write operation via W9 (push another word of log trace data):

mov #0x0101,W0
mov W0,DSRPAG ; do high-memory reads from PSV page 1 of flash…
mov #0x0001,W0
mov W0,DSWPAG ; but still do writes to the second 32KB of SRAM
bset W8,#15
bset W9,#15 ; make sure both W8 and W9 are high-memory addresses
mov [W8++],W0 ; pull a word from flash for imminent use in the program
mov W0,[W9++] ; and log that word to a debug buffer for safe keeping

Upcoming Swissembly topics will leave Commodore’s glory days behind, to explore other facets of modern embedded system design.

While the 64 became the best-selling single computer model of all time largely due to a comprehensive game library, Commodore International‘s follow-up a few years later in the form of the Plus/4 was widely considered a flop.  Within a year, resellers had stopped carrying the new machine and Commodore was unloading them through closeout channels as varied as eastern-European public schools and mail-order/TV commercials in the U.S.  Not much more commercial software had been written for the new platform by the time of its demise than at its 1984 introduction (Micro Illustrator, Jack Attack…almost everything by Commodore’s own software group except my favorites Blazin’ Forth and Questprobe: Spider-Man.)

What the Plus/4’s detractors fail to grasp is that it completely resolved a major downer of the previous home machines: the 64 had shipped as soon as possible on the CEO’s orders and although its low-cost graphics and sound hardware shone in the hands of a software publisher, few of its features were unlocked for the beginning programmer.  All special hardware could be accessed only by PEEK and POKE into the register space. The box remained near useless to most users before they invested in boxed software, whereas even a Plus/4 ignored by the software industry represented a much more useful blank slate to the budding programmer.  This pretty much describes myself when I picked up a gently used, if at all, Plus/4 from the local Computer Corner for $40 in 1987.

Sure, the earlier Commodore BASIC was a good vehicle for many of us kids learning how to program in between gaming sessions.  But once one reached the advanced sections of the manual and was trying to conjure up sophisticated user interactions with the same poor code structures it introduced at the outset (GOTO and GOSUB, nothing more) or even started running out of memory for variables, it quickly became clear that only a change of programming environment could result in the completion of a project worthy of the platform.

Contrast this with the Plus/4, a true programmer’s computer that wasn’t even intended to become one.  Commodore had been able to ship the 64 at a handsome profit due to its plethora of support chips–not to mention the CPU–being fabricated inside Commodore itself, and its gobs of DRAM plummeting in cost in accordance with Jack Tramiel‘s calculated gamble.  This mass-market value proposition in turn did drive a burgeoning game, desktop publishing and home dialup market.  On the other hand, the 264–as the Plus/4 was originally called–was never intended to replace the 64 but rather stave off low-end threats from Sinclair in Europe and what would later become MSX in Japan.

While the VIC-20 and 64 had been rushed out the door with a barely adequate Microsoft BASIC 2.0 from the PET/CBM platform, for its next try Commodore had the time to really optimize the hardware and software design.  Dave Diorio embarked on the design of the MOS 7360 Text Editing Device (TED), a single-chip graphics, sound, I/O and memory management chip that would lend its abbreviation as a codename for the entire follow-up cost-down effort.  With TED and the MOS 7510 CPU as the centerpieces, Bil Herd reduced the vast count of packages on the motherboard to the bare minimum to achieve a real, usable computer while solving issues along the way that the thrifty design mindset had produced.  Fred Bowen and Terry Ryan squeezed in the improvements in the system software, left the door open for quick-boot ROM productivity applications intended for frequent use (since the system bus expansion port was intended to be taken up by the 1551 disk drive) and the end result was a family of computers that could span multiple segments of the market.

The 264 and its big-brother 364 were more nuanced machines than the unfocused Commodore marketing departments could appreciate, appreciated even today by those willing to give up SID voices and VIC-II sprites in favor of:

  • A palette of 121 colors (up from 16 in the VIC duo) that made the output of paint programs actually attractive
  • A real RS-232 transceiver chip for serial communications that didn’t hamstring the main processor
  • TEDMON, a built-in machine language monitor (typically sold for the 64 only as a ROM cartridge such as HESMON)
  • BASIC 3.5 with renumbering, loop structures, readable graphics and sound commands
  • A disk drive that accessed the same media at almost 10x the speed of the 1541
  • The ability to boot a ROMed application, called “3-plus-1” in the shipping version (but for everyone’s sake let’s pretend it was one of the originally intended single-function apps such as CALC/PLUS)

All these 264 features of course came at the expense of extra ROM capacity relative to the 64.  Yet the amount of available BASIC memory displayed at power-on actually increased, from 38kB to 59kB!

This trick was accomplished by the help of the new TED chip that could switch the upper half of the 264’s memory address space between ROM and RAM very easily. Besides allowing all system RAM to become graphics RAM, TED would take care of DRAM refresh also. The various ROMs normally were mapped to the top 32K, but instead of obliterating the underlying RAM a write operation always went straight through.  Read operations were possible by interacting with TED through a short assembly language routine kept in the lower 32K (always RAM), at address $0494, that could switch back and forth between modes one byte at a time.  Not lightning fast, but enough to allow BASIC to store its variables up underneath ROM and open up more space for program storage. And if the access were just writes, for example log entries for debugging, there was no penalty at all!

I was recently reminded of this scheme (upper 32K RAM-under-ROM, lower 32K always hardware registors or SRAM) when I began working with 2011’s “EP” series refresh of Microchip’s 16-bit PIC24 and dsPIC33 families: it is not uncommon to run out of a convenient way to output debug information in real time when programming microcontrollers with limited peripheral pins.  And so, just like on the 264 series, it is possible to build up a log or trace of moderate size in the upper bank of RAM, underneath read-only entries in the flash program space akin to ROM in the old days.

I’ll go into more detail of this deja-vu capability in part two.