Multiplying by 2.5 is not an uncommon assembly-language operation, for example when stretching a decimal percentage (0% to 100%) input from the user into a 7.92-bit (since counts 251 to 255 get omitted) hexadecimal value. Dedicating a routine to a fixed scaling by a particular constant is often a worthwhile trade of size for speed, avoiding multiply instructions and even branching thanks to fully unrolled shift/add loops.

Suppose we have a routine that accepts an integer in the range 0 to 100 representing the stimulus to be applied to an LED. We would ideally like to map this input linearly onto the range 0x00 to 0xff, but are perfectly willing to accept a mapping onto the range 0x00 to 0xfa because it can be done with zero differential nonlinearity inside 9 bits. This will only sacrifice (255-250)/255=2% of the available intensity, so a current DAC capable of 100mA will only output 98mA at the maximum setting. Frequently, having a perfectly linear response is more important than squeezing every last potential drop of potential from the analog peripheral.

Fundamentally, multiplying by 2.5 is just a multiplication by 4 (in sequential single-bit left shifts or in a single step using a barrel shifter) followed by an addition of the original value and then a division by two.
As stated, a 9th bit indicating whole numbers or halves (bit -1, i.e. below the LSB of the result register) even comes for free and can be returned in the carry bit simply by propagating the LSB of the dividend:

 ;;; optimized routine to perform a conversion from PWM percentage to fullscale
 ;;; unsigned 8-bit value (actually only 0xfa, not 0xff) by multiplying by 5/2
 ;;; instead of 255/100 (so fullscale becomes 102%, i.e. 2% gain error)
 ;;; the 9th (0.5LSB) bit of the output is always equivalent to the LSB of the
 ;;; input, so it is quite trivial to obtain zero DNL at 9-bit resolution, e.g.:
 ;;;    rrf    file,w    ;
 ;;;    clrf   PWM1DCL   ;
 ;;;    rrf    PWM1DCL,f ; PWM1DCL = (file & 1) ? 0x80 : 0x00;
 ;;;    movf   file,w    ; // by looking at LSB of file, no precision lost
 ;;;    x5div2 file
 ;;;    movwf  PWM1DCH   ; PWM1DCH = (file/2) * 5; // despite dropping LSB
 ;;; completion is a constant 12 instruction cycles: 3us assuming Fosc is 16MHz
 x5div2 macro reg
        movlw 0x7e     ; // 0 <= reg <= 100
        andwf reg,w    ; w = reg & 0x7e; // 0 <= w <= reg (even, round down)
        bcf   STATUS,C ;
        rlf   reg,f    ;
        rlf   reg,f    ; uint16_t c = reg *= 4; // 0 <= reg <= 400
        btfsc STATUS,C ; if (c > 0xff)
        iorlw 0x01     ;  w |= 1;
        addwf reg,f    ; c = reg += w;
        btfsc STATUS,C ; if (c > 0xff)
        iorlw 0x01     ;  w |= 1;
        rrf   WREG     ; // 0 <= (w&1)*256 + reg <= 500
        rrf   reg,f    ; reg = ((w&1)*256 + reg)/2; // 0 <= reg <= 250

What we have really accomplished by returning 2.5 times the 8-bit input in one 8-bit value plus one 2^(-1) bit is to return 5 times the input inside a 9-bit word. So a full-scale value of 100 becomes 500, just represented as 250 in the top 8 bits and 0 in the leftover bit (because the input was even)!


Here’s a program I just wrote and assembled for an 8-bit PIC device, which assembles under MPASM (and gpasm) using  the instruction set of the 6502, rather than PIC native code:

        processor 16f1508

        LDA# 0x5a
        LDX# 0xef
store5a STA 0x2000,X
        BNE store5a 
        STA 0x2000,X
        LDA# 0xa5
        BIT 0x2000
        ROL A,
        BIT 0x2000
        BEQ store5a

This example works as intended, setting the 240 bytes of PIC linear memory to all Z’s in ASCII and then performing a couple of ultimately pointless bit tests and stack operations.  Besides the inclusion of a macro definition file to enable this translation a few deviations from standard 6502 syntax give it away: superfluous argument placeholders such as the comma after the ROtate Left Accumulator instruction (a second argument is necessary in case the first one were to have be indexed with X), as well as the strange absorption of the immediate-mode hash signs into the opcode itself.

Assembly-language translation for PIC processors has been done before, using the familiar shorthand of another processor’s instruction set but usually with more of an eye on efficiency and readability than full duplication.  The program snippet above, which would require just 27 bytes of program storage on a true 6502 machine, expands into 247 PIC 14-bit words, for a total of over 400 bytes or about 5% of this device’s flash memory.  Not surprisingly, the macros have to be defined fairly pessimistically and can’t attempt any sort of optimization after the fact on their own.  (Although an optimizing assembler could well do so.)  Mimicking the 6502 on a RAM-banking/ROM-paging PIC device with only one working register and a barely accessible return stack takes a lot of work behind the scenes to accomplish the following in particular:

  • the X and Y index registers (to which purpose I assign FSR0L and FSR1L just in case a pointer access is relative to page boundary and we fortuitously just  need to copy that page into FSR0H or FSR1H)
  • single-opcode stack push/pull operations
  • indexed memory accesses
  • result-less compare and bit test operations

Nevertheless, several 6502 opcodes do map onto a single PIC instruction and don’t require any sort of temporary storage, atomicity or memory protection. These are more along the lines of the assembly-language translations usually attempted:

  • LDA# (becomes: movlw)
  • DEX (becomes: decf FSR0L,f)
  • CLC (becomes: bcf STATUS,C)
  • ROL A (becomes: rlf WREG)

I have not yet tested the full set of 6502 macro definitions extensively, so there are probably several bugs still lurking.  Be sure to disable case sensitivity (MPLAB’s radio button, or gpasm -i) if using uppercase for the 6502 invocations, as I have in order to distinguish the two instruction sets above.

So what features of the 6502 are supported?  Basically, everything except the Binary Coded Decimal (BCD) mode and the oVerflow flag which are the only parts of 6502 that are completely alien to the PIC architecture.  Several opcodes do require special handling or otherwise have a different syntax than on a real 6502 machine:

  1. The Negative flag is supported by means of referring to bit 7 of a particular file register in an additional operand; result-less compares and bit tests require the intermediate results to be stored in a PIC file register.
  2. The immediate mode requires that the hash sign (#) indicating a literal value be placed on the opcode itself, since it is a totally different macro than for a memory access
  3. Indexed indirect (using X) and indirect indexed (using Y) addressing modes are supported, but in this case the question mark (?) character  is required in order to highlight the difference to the assembler after its preprocessor strips off the all-important parentheses.  The parenthesis can still be used in source code, in order to keep some semblance to the 6502 modes:
    • LDA? (0x2000),(X) in lieu of LDA (0x2000,X)…a rarely used 6502 mode anyway!
    • LDA? (0x2000),Y in lieu of LDA (0x2000),Y

Interrupts should work properly, whether using 6502 macros within the ISR or the interrupted code.   Seven of the PIC file registers (0x79 through 0x7f) get used for storing temporary values during the course of a macro.  If not using 6502 macros in interrupt routines, these locations can also be used for temporary storage by PIC code until the next 6502 macro:

  • 0x79 holds the state of the STATUS register in bits 4-0, as well as the GIE bit of INTCON in bit 7
  • 0x7a holds the value of the WREG/Accumulator
  • 0x7b holds the value of the Bank Select Register
  • 0x7c holds the value of the FSR0L/X register
  • 0x7d holds the value of FSR0H
  • 0x7e holds the value of the FSR1L/Y register
  • 0x7f holds the value of FSR1H

One reason for all the extra generated code is that for any instruction that accesses intermediate results in memory or the STATUS register, the GIE bit of INTCON is turned off (SEI in 6502 parlance).  As with previous posts, this is not an efficiency exercise.  I wanted to explore how closely these two 8-bit favorites really could be made to resemble each other.  The code generated will be far from optimized, but the fact that it can be done at all within MPASM’s relatively limited macro expansion capability is kind of neat.

Here’s a head-scratcher: 8-bit PIC processors have a reset vector (where code execution starts, i.e. the value of the PC at reset) of 0x000. The interrupt vector is just four instructions higher, 0x004. What can a PIC coming out of reset accomplish in just a few instructions? A fair amount, it turns out.

Most general-purpose demonstration code doesn’t try to fit any meaningful action into such tight quarters. Since any useful program is bound to be more than four instructions long and have interrupts enabled, one of these four is invariably dedicated to a “goto” instruction to skip past the interrupt service routine, or ISR. Microchip’s XC8 (formerly Hi-Tech’s PICC) compiler doesn’t bother trying to fill the other three with any useful code at all: it merely rewrites the page register PCLATH explicitly with its reset value of zero, increments the PC via an equally meaningless “goto”–instead of a 100% faster “nop”–and then repeats the process with a real page location and a real “goto.”

That’s right, disassembled XC8 output begins with the following overly cautious code, in this case for the trivial “main() {}” program:

0x000: movlp   0
0x001: goto    0x002
0x002: movlp   7
0x003: goto    0x7fc

So at least this compiler’s output doesn’t give much insight into useful purposes for the first four memory locations, prior to the obligatory “goto” no later than 0x003. The first two instructions become the equivalent of a 3-cycle “nop” instruction. The next two instructions explicitly jump to code located at the top of memory. So basically a six-cycle “goto” over the ISR.

Assuming one page selection before the “goto” really could become necessary (if the ISR were to grow beyond one 2kword page–enough for 16 different interrupt sources with 256-word handlers each!), memory at 0x002 and 0x003 could always hold handy constants readily visible from a hex dump. In my own code I’ll sometimes replace any “nop” instructions at 0x002 and 0x003 with “movlw” opcodes that are my code version, values of my “#define” constants or some other at-a-glance indicator:

0x000: movlp   0
0x001: goto    start
0x002: movlw   CODE_VERSION
0x003: movlw   (REV_MONTH<<4)|REV_DAY

But clearly the architects of PIC could have made the reset vector be 0x002, thereby forcing 0x000 and 0x001 to be spent on a jump in all cases. They clearly left the door open for some useful work to get done before burning two instruction cycles (still at least 400ns on a modern 20MHz part) on a “goto” to get to the main code section.

Think quick!

I recently had occasion to make full use of the three instructions in the bottom of PIC instruction memory before the jump over my ISR. As part of a power management application I wanted to pull down on a pin if code happened to start executing from a cold (power-on reset) boot and that pin was low anyway.  This pin also drove a common cathode net for a bunch of Schottky diodes whose anodes were on other PIC pins, in order to clamp any signal back to the main microprocessor to within 0.2V of ground.

Likewise, after reading that pin’s state as an input and thus ascertaining the nature of the reset (cold POR versus warm) I changed it to an output and charged a capacitor on it up to the supply rail to allow the anode pins to toggle as well as to retain this state bit for any subsequent resets.  This seemingly complex decision can be made to fit in three instructions, due to several fortuitous behaviors:

  1. the BSR register defaults to zero at reset and thus PORTA/PORTB/PORTC can be read immediately
  2. “movf PORTA,f” has the effect of copying the current (input) state of those RA pins into their (output) latch register LATA on the first instruction cycle
  3. accessing the BSR then subsequently changing TRISA/TRISB/TRISC (to expose certain LATA/LATB/LATC bit states) takes the remaining two instructions before the “goto”

There are of course registers that can be read to ascertain the nature of a reset (STATUS and PCON) but there is too much bank selection and masking required for each register to make the decision within three instructions, let alone act on it.  Time is somewhat of the essence in my circuit since not only could the capacitor lose its charge due to leakage, but I also put a high-value discharge resistor to ensure that the next true cold boot would be correctly recognized.  Getting everything done before incurring a branch penalty cuts the execution time almost in half!

The following snippet of generic code senses the level of the RA0 pin and then keeps it driven to that state, thereby making an external capacitor-based scheme quite effective for quickly detecting and preserving information about a warm versus cold boot:

0x000:  movf    PORTA,f ; void main(void) { LATA = PORTA;
0x001:  movlb   2       ; 
0x002:  bcf     TRISA,0 ; TRISA &= 0xfe; //RA0 locked to initial value }

All eight pins RA0 through RA7 could even be nailed down with “clrf TRISA”.  Try getting the C code I’ve provided in the comments to compile down as tightly!

I wish this collection of PDF tutorials had been around when I started working with the 8-bit PIC: David Meiklejohn’s 2012 series “Introduction to PIC Programming: Mid-Range Architecture and Assembly Language.”

Contrary to the title, they also make a good refresher (and source of sample code) for experts who just don’t take on a microcontroller design every single year or at every single employer. The treatment of the material and the quality of the writing are among the best. This is not snarky pablum pasted from forum posts but rather the output of a technically skilled author, for about half the price of a Newnes book on PIC that may not prove as useful. Find it online at the Gooligum website

Forgot how many CPU cycles elapse before a write to the Timer0 value actually gets incremented?

Forgot how many “nop” instructions to put near a “sleep” command, where, and why?

Forgot how to use a watchdog timer to periodically wake from sleep?

Although the answers can all be gleaned from the PIC datasheets, I find value in Meiklejohn’s more narrative and task-focused method of presentation.

One last note: It seems there is a new business model in place at the site, whereby only the introductory modules are free of charge. So a discussion of Timer0 comes for free, but the rather different Timer1 and Timer2 will cost you. In case you were hesitating, consider gaining access to the latter money well spent.

One of the first things that many programmers who’ve dabbled in Forth do upon learning Tcl, or vice versa in my case, is to try to implement lexical features of one inside of the other.  In mid-2011 I felt the urge to learn how to do some programming in a low-overhead version of Forth, more than a decade after I seriously got into the Tcl scripting language for similar reasons of wanting something flexible but floppy-bootable.
This trend surely stems from the fact that neither language imposes many rules of syntax beyond streams of whitespace-separated words.

Forth: cross-platform compiled code (careful, can come cryptic!)

Forth is actually quite venerable among compiled languages, having come on the scene in the 1970’s after a long incubation in the service of Charles Moore’s individual programming efforts since 1958.  Natively supporting only 16-bit integers on an internal calculation stack that supplements the more universal “return” stack, many Forth dictionary versions also include support for floating point values–as well as “double” integers–by special processing of multiple adjacent stack locations.  Forth only really does one of two things once it’s up and running as an interpreter/compiler:

  1. accept numeric constants onto the stack, or
  2. try to jump to a subroutine that corresponds to a sequence of non-numeric input characters taken to be a “word” of code

This behavior results in the same postfix syntax as an RPN calculator, and when reading through any Forth code it is essential go slowly enough to be able to visualize the topmost stack contents at any given point. Values are continually sent to and consumed from the stack, so adding two numbers and printing the result is as straightforward as the ubiquitous Forth example ( 2 2 + . ) in which the addition function ( + ) replaces two inputs on the stack with its output ( 4 ) which is then in turn consumed by the stack-printing function ( . ).

Tcl: omitting both vowels from the name, not just one

A Tcl interpreter similarly just plows through streams of characters, but goes a step further by not assigning particular significance to any elements that other languages might try to parse as expressions, including numerical constants!  You can write a procedure called “2”, name a variable “+”, or even have a procedure called “2+2” coexist with a variable by the same name.  The variable value would be accessed Bourne-shell style by preceding it with a dollar sign ($), whereas the procedure call would be the first word on a line or immediately following a function-calling left bracket ([) so there is never any ambiguity.

Tcl’s native format is the parameter list rather than an implied stack, but both being at their core a sequence of numbers there is an obvious natural correspondence between the two. If you actually want to perform a mathematical calculation in Tcl you put it in a list passed to a function that interprets its argument list as an expression, so that

expr 2 + 2

evaluates to “4”.  Once you think about it, this lack of any imposed interpretation on unquoted strings makes sense in today’s computing environment, now that we’ve moved away from just FORTRAN-like calculations and now trade all sorts of text over the Web that are encoded according to various contexts.

A brief aside: a procedure called left parenthesis

By the way, in both languages you can define most ASCII punctuation marks including a parenthesis to be procedure calls, so that whereas the Tcl definition

proc ( {arg args} {expr ( $arg $args}

would seem to contain unbalanced parentheses, it is actually defining “(” as a wrapper procedure. Like any other Tcl procedure it can take zero or more mandatory arguments (one in this case, arbitrarily called “arg”) followed by any number of optional arguments (always called “args”).  Now there are two equivalent forms of the simple arithmetic script above, with the first one taking just slightly longer to execute as it gets translated into the second:

  1. ( 2 + 2 )
  2. expr 2 + 2

but only if carefully minding the spaces since they play the significant role of list separators. The same syntax could be accomplished in Forth, also by creating a left-parenthesis function that reads ahead in the input stream.  At the end of this post, the left parenthesis will be redefined in its Forth context of beginning a comment string, akin to C’s /*.

Finding application in embedded systems

For this kind of elegant extensibility while still maintaining robustness, compact versions of Tcl–just like Forth before it–are now found running the debug shell on a lot of embedded devices.  It gives developers a convenient way to query the state of a product, call underlying C functions in the base firmware, define manufacturing test routines etc., all without having to develop the full overhead of an interpreter themselves.  This is exactly the pretext under in which Tcl (“tool command language” but pronounced “tickle”) was originally developed: it’s better to have professional language designers doing the shell writing and init-file processing, than those whose primary focus is the overall application program and for whom constructing command syntaxes and formatting init files are an often ill-executed distraction from the larger application they’re trying to create.

Forth is a very handy small-footprint language, and probably the only integrated development/operating environment presenting an immediate, text-mode interface relatively unchanged since its brief popularity on 8-bit 80’s systems.  One of the complaints about Forth is that there are so many different “standard” dictionaries that  different Forth dialects often use the same word to mean slightly different things, taking different parameters or returning differently formatted results.  And while Forth’s low overhead becomes less relevant in newly affordable 32-bit microcontroller environments, there are still exceptional recent versions of the language such as FlashForth that cater to 8- and 16-bit microcontrollers.  Studying Mikael Nordman’s creation is a joy, whether to program in or just reading through his source code, and it actually makes even the simplest PIC demo boards instant on-board development platforms for even rather complex programs.

Tcl in Forth

Returning to the theme of replicating Forth or Tcl syntax inside of the other, I briefly took a stab at implementing Tcl’s shell-style (preceded by $) variable substitution in new words “set” and “puts” for Forth as well as supporting bracket-bounded function calls. So after much trial and error I was able to replace typically jumbled Forth statements of the form

variable x 3 x !
variable y x @ y !

with the somewhat more intuitive code (including implicit declaration of variables) borrowed from Tcl:

set x 3
set y $x

or its single-line equivalent,

set y [set x 3]

But this exercise proved not very interesting after I had tested a few such constructs, and again the inconsistent dictionaries among Blazin’ Forth on the Commodore, FlashForth on my Microstick II and gforth on my Linux box served as a reminder that when it comes to portability, Forth is no C.

Fortcl: Forth in Tcl

Lots of people have implemented Forth in Tcl.  I decided to take a crack at it for the principal reason that most implementations I was running across seemed to use global variables, whereas I prefer to stick with a select few list arguments to any function in order to enable better execution tracing and reuse.  I use a global variable in Tcl only to access what would be a global variable in Forth; none are used for the mechanics of the Forth interpreter itself.

Since Ficl (pronounced “fickle”) comes up on a web search as the name of an extant but seemingly unmaintained SourceForge project to develop a command language similar to Tcl in Forth, I’ve called my second exercise for this post Fortcl (pronounced “forticle,” as in a “fortified follicle”).  As with the other Forth-inside-Tcl implementations at the previous link, defining a word defines a Tcl proc of the same name.  Mine just accepts three (possibly null) arguments as input and similarly returns three (possibly null) outputs: a calculation stack called “stack,” a return stack called “rstack” and a list of words in “args” left to process on the current line.  A proc forth2 keeps the process going from one word to the next on a line; it is wrapped in a proc forth that just passes in null stacks and everything typed after it and returns only the stack (thus not returning the “return stack,” apologies for the roundabout nomenclature).

So besides replacing global variables with cascading Tcl lists, what can I say is special about Fortcl?  As in any implementation of Forth, colon definitions (basically a procedure defined by a sequence of characters between a lone colon and a lone semicolon) require a word name followed by one or more list elements, but the hierarchical list support in Tcl means that each list element itself can be a list.  I utilize this fact to allow the definition of a word in terms of either other Fortcl words or (if the first argument is a list rather than a word) in Tcl itself.  So either of the following is an acceptable way of defining the increment ( 1+ ) word once the addition ( + ) word has been defined:

: 1+ 1 + ; # defined in terms of another Forth word ( + )


: 1+ {list [concat [lrange $stack 0 end-1] [expr 1 + [lindex $stack end]]] $rstack $args} ; # defined as a Tcl script since the first (and only) word in the definition list is a list

Note that hash ( # ) after a semicolon starts a comment in Tcl and therefore in Fortcl.  The first is faster to type, the latter faster in execution time.

The left-parenthesis according to Forth

Another advantage of the latter definition of “1+” is that it supports the keyword “immediate” before the semicolon, which forces future definitions that use the word to execute its Tcl script during the definition (compile-time) rather than copying it into themselves for later execution (run-time).  It is good to do this for parenthesis-delimited Forth comments meant for human consumption only, so that they are stripped from any definitions and don’t slow down execution of the word:

: ( {concat [list $stack] [list $rstack] [list [lrange $args 0 [lsearch -exact $args )]-1]]} immediate ;

(_imm is now a defined word, and so as long as the “immediate” keyword is present comments will be removed from any definition.  There is no conflict with any other, non-immediate left-parenthesis proc that may exist, such as the wrapper for Tcl’s built-in proc expr at the top of this post.

Executing Forth words from Tcl

Regardless of which colon-definition mode (Forth or native, non-immediate Tcl) that 1+ was implemented in, execution similarly can take place on a line of Forth after the Tcl interpreter has recognized a call to proc forth:

forth var x 3 x ! x @ 1+ x ! x @
--> 4

or a line of Tcl that doesn’t even call proc forth for stack management, instead using lindex to strip off the two null lists (return stack and args remaining) that follow the answer in the first list (i.e. the resulting stack after the increment):

set x 3 ; set x [lindex [1+ $x] 0]
--> 4

At last, the code

After source-ing the base Fortcl to pick up the forth wrapper and more importantly the definition word ( : ), I have placed all words I’ve implemented to date in a separate, much longer source-able file consisting of only colon definitions that call it. An important note on this latter file is that there are a few ubiquitous Forth words that clash with Tcl keywords and thus had to be renamed in the forth definition:

  1. if (renamed to iff as in “if-forth” or “if and only if,” so use iff…else…then rather than if…else…then)
  2. variable (renamed to var)

Closing thoughts

Tcl is a fond post-adolescent memory for me, by the way.  John Ousterhout, its creator, was the instructor for the first computer science course I took at Berkeley, right around the time that Tcl was taking off. By day he was assigning me C and MIPS assembly homework, but outside of class he was rolling out this flexible interpreter to the world, probably still supporting Magic to some extent and somehow also raising a newborn daughter who as of this new year would be on the verge of turning 22 herself.

Thanks, ouster!



You know we’ve come a long way when it is more expedient to explain to children that a byte is a quantity corresponding to roughly a billionth of another unit far more fundamental to them in their daily lives: the gigabyte. A hundred bytes is an atomic-level scale to the newest generation.  In the spirit of Robby Boey’s Commodore 64 sprite art with his daughters, I wanted to show mine that a small, quickly written program could be quite usable in solving her favorite word game from the daily newspaper.

A crytpoquip is a letter-substitution cipher (a cryptogram) that always reveals some witty observation or pun once every letter has been substituted for another, with never any mapping onto the same letter and not necessarily any reciprocal mappings. These constraints, together with context clues and the somewhat limited set of two- and three- letter words, allows the enthusiast to gradually crack the substitution code, which changes from one cryptoquip to another.  A hint is often given to get started, for example “S equals E.”

Besides the newspaper, it is possible to find solvable cryptograms on the web.  And whether as an algorithms demonstration or targeted at the impatient puzzler, there exist various automated solver programs on the web that utilize a standard (or selectable) language dictionary to remove all subtle insights from the path toward the solution.

Of course my homebrew version is in assembly language, both to get the size down (the hex codes for the loop itself fit on one 40×25 screen) and to illustrate that a  carefully programmed text-based 6502 platform can be at least as responsive as a comparable app running on a smartphone taking up 100,000 times the memory footprint.  In my interface, any character typed on the keyboard appears in the upper left corner of the screen.  The first character pressed shows up there in reverse video, with all instances of it in the cipher (as determined by a nonprinting character in the row below them) also highlighted in reverse.  Here is a screenshot from a puzzle with only one clue, the letter B, remaining:


The answer is obviously that the letter B in this cipher maps onto the letter H in the solution.  Pressing a second key when letters are highlighted equates (by way of printing it in normal video in the row superior) all the instances to that solution letter.  As the final step the comma is mapped onto itself, since punctuation and digits do not typically get mapped as part of the puzzle:puzzle2

Below is loader code that runs on the Commodore 16 series.  It uses the TED-series trick of allocating and then de-allocating a graphics screen to open up 12K of memory underneath BASIC without interrupting the program being run:crypto1-20

To achieve the compact size of 146 bytes, it is necessary to input the puzzle on a blank screen using the Commodore screen editor or as part of the BASIC loader: crypto30-230

Substitution continues in a loop even after all cipher characters have been mapped, since no such check is performed.  Here is the assembly code in crasm’s format.  With limited tweaks it should run under any Commodore BASIC, since it makes use of the standard Kernal routines for screen I/O and the character codes that use bit 7 as a reverse-video flag.  The equivalent C code appears in the comment at the end of the lines:

cpu	6502
page	0,132

	;; start of character codes (screen memory directly manipulated)
	TEDSCR = $0c00		; const TEDSCR = 0x0c00;

	;; space character (both ROM table index and ASCII)
	spc = $20		; const char spc = ' ';

	;; KERNAL jump vectors
	GETIN = $ffe4		; extern char GETIN(void);
	CHOUT = $ffd2		; extern void CHOUT(char);
	PLOT = $fff0		; extern void PLOT(boolean, short&, short&);

	;; 2-byte pointer to "answer" letter and "crypto" letter below it
	leta = $da		; char* leta;
	letc = $dc		; char* letc;

	code			; char cryp(short, short, short);

				; void shll(void) {
shll	lda	#$80		;  unsigned short stack = 0x80;
	pha			;  do {
shl1	clc			;   char a;
	ldx	#$00		;   short x = 0;
	ldy	#$00		;   short y = 0;
	jsr	PLOT		;   PLOT(0,x,y);
shl2	jsr	GETIN		;   do a = GETIN();
	beq	shl2		;   while (a == 0);
	jsr	CHOUT		;   CHOUT(a);
	pla			;
	and	#$80		;   stack &= 0x80; /* get char at (x,y) */
	eor	#$80		;   stack ^= 0x80; /* flip state bit */
	ldx	#$00		;
	ldy	#$00		;
	jsr	cryp		;
	pha			;   stack = cryp(stack, x, y);
	and	#$7f		;
	bne	shl1		;  } while (stack & 0x7f);
	pla			;
	rts			;  return;
				; }

;;; user must pass in pointer (X row>=0, Y col>=0) to set search start location,
;;; with character passed in A, whose MSB indicates whether this character
;;; is for highlighting "crypto" letters (clear) or substituting them (set)

				; short cryp(short a, short x, short y) {
cryp	sta	letc+1		;

	sty	leta		;
	ldy	#$28		;
	lda	#TEDPAG		;
	sta	leta+1		;

	;; add 1 row for each count of x
	clc			; 
xrow	tya			;
	adc	leta		;
	dex			;
	bmi	xro2		;
	bcc	xro1		;
	inc	leta+1		;
xro1	sta	leta		;
	clc			;
	bcc	xrow		;  leta = TEDSCR + x*40 + y;

	;; A and C conveniently tell us the value to go into higher-pointer LSB
xro2	sta	letc		;
	lda	leta+1		;
	adc	#$00		;
	ldy	letc+1		;
	sta	letc+1		;  letc = leta + 40;

	;; if Y (and thus the initial A) was zero, we use character at this pos
	ldx	#$00		;
	tya			;
	and	#$7f		;
	bne	test		;  if ((a & 0x7f) == 0)
	tya			;
	and	#$80		;
	ora	(leta,x)	;
	tay			;   a = (a & 0x80) | *leta;

test	tya			;  for (--letc; ++letc < TEDSCR + 1024; ++leta)
	bmi	tes2		;   if (a >= 0) { /* highlight crypto letters */

tes1	cmp	(leta,x)	;
	bne	tes3		;
	lda	#spc		;
	cmp	(letc,x)	;
	bne	tes3		;    if ((*leta == a) && (*letc == spc)) {
	cmp	(leta,x)	;     if (*leta == spc)
	beq	done		;      break;
	tya			;
	ora	#$80		;
	sta	(leta,x)	;     *leta |= 0x80; /* reverse video */
	bmi	tes3		;    }

tes2	lda	(letc,x)	;
	bpl	tes3		;   } else if (*letc == a) { /* show answers */
	and	#$7f		;
	sta	(letc,x)	;    *letc &= 0x80; /* normal video */
	tya			;
	and	#$7f		;
	sta	(leta,x)	;    *leta = *letc & 0x7f;

tes3	inc	letc		;   }
	bne	tes4		;
	inc	letc+1		;
	lda	letc+1		;
	cmp	#TEDPAG+4	;
	bpl	done		;
tes4	inc	leta		;
	bne	test		;
	inc	leta+1		;
	bne	test		;

done	tya			;
	rts			;  return a;
				; }


Let me begin by saying that this is not the correct way to do multiplication on the 6502. Here is a simple macro that multiplies two unsigned registers X and Y together as a sequence of additions and leaves the result as an unsigned 8-bit quantity in X, whereas any real routine should preserve the 16-bit result. You can still use it of course as long as you’re confident the result won’t exceed 255, either by the caller’s own design or by testing the high-order bits of X and Y. In such cases it may be faster than the canonical method, and takes up only 50 bytes. All memory for storing intermediate results is taken from the stack.

I started this exercise in fact as a way to explore the TSX and TXS instructions of the 6502. The X register is used to index into the hardware stack, which on a 6502 exists from $100 through $1ff. It is constant-time, going through three loops eight times each, but as mentioned previously the upper 8 bits of the result are discarded.

It turns out that with the 6502 having so few registers, pure macro algorithms like this can only barely even exist at all and only with the help of TXS/TSX. By “pure macro” I mean using only registers and the stack; invoking the macro doesn’t require the caller to pass in the address of some allocated variable space to stash the intermediate results.

 ldy #$08
 bne *-4
 lda $0109,x
 ldy #$08
 bcs *+5
 lda #$00
 sta $0101,x
 bne *-12
 ldx #$00
 ldy #$08
 adc $0101,x
 bne *-9

And here’s what it would look like if it were compiled from the same algorithm in C++:

inline void um8(uint8_t& x, uint8_t y) {
 static uint8_t sp, mem[512];
 mem[sp--] = a ; // pha
 mem[sp--] = a = y ; // tya : pha
 a = x ; // txa
 x = sp ; // tsx
 for (y = 8; y > 0; y--){; // ldy #$08
  mem[sp--] = a ; // pha
  x-- ; // dex
  a = (a >= 1) & 255 ; // asl
 } ; // dey : bne *-4
 a = mem[x + 0x109] ; // lda $0109,x
 for (y = 8; y > 0; y--) {; // ldy #$08
  int c = a & 1; a >>= 1 ; // asl
  if (c == 0) { ; // bcs *+5
   mem[sp--] = a ; // pha
   mem[x + 0x101] = a = 0; // lda #$00 : sta $0101,x
   a = mem[++sp] ; // pla
  } ; //
  x++ ; // inx
 } ; // dey : bne *-12
 x = 0 ; // ldx #$00
 for (y = 8; y > 0; y--){; // ldy #$08
  x += (a = mem[++sp]) ; // txa : tsx : clc : adc $0101,x : tax : pla
 } ; // dey : bne *-9
 y = a = mem[++sp] ; // pla : tay
 a = mem[++sp] ; // pla

 return; /* value in x */

In the title I call this routine “in-place” multiplication in the sense that the routine itself is the only memory space used besides the stack. Such an algorithm may find use in an extremely constrained 6502 system, where every byte is precious but a multiplication routine is still needed.

Invoking this macro is as close as it’s possible to get to pretending there’s an actual register-to-register multiply instruction on the 6502:

 ldx #FACTOR1
 ldy #FACTOR2
 um8 ; x contains (FACTOR1*FACTOR2)&255