Hot on the heels of my last post, which had sat in draft mode for too long until I finally (on a plane) counted the instruction cycles, comes a second hand-optimization of a C function into assembly language.  On the same trip I came across another interesting CS problem from HackerRank that reduces to a couple (20 or fewer) of Thumb assembly instructions.  In fact my original C solution had been much longer and more involved until a series of successive insights that reduced it to a single if-then-else test, followed by immediately evaluating one of three expressions (based on a register being negative, zero or positive) that yields a string-pointer return value in r0 which is either “YES” or “NO”.

This solution particularly elegantly illustrates two aspects of assembly optimization by hand:

  • Reducing the number of inputs as quickly as possible in order to avoid using unclobberable registers (r4 through r11 for ARM, which according to the ATPCS must be preserved across function calls)
  • Setting the condition flags as a result of performing necessary register manipulations, which partly involves ordering those instructions so the flag values are already set at the moment a branch must be decided.

But this is getting ahead of ourselves.  First let’s step back from native assembly language nuance to read the problem and review the C solution that HackerRank wants the coder to obtain.

Kangaroo problem description

Two kangaroo avatars are somewhere (maybe at the same integer) on a number line, and will each jump a certain (maybe the same) positive integer number toward +∞ at the same instant, over and over again.  Write a function given the initial position and jump distance of each kangaroo, that returns “YES” if they will ever land at the same point on the number line at the same time, i.e. after the nth jump for some value of n their position is the same, and “NO” otherwise.

The intended approach, especially if you’re in a job interview, is not to begin a jump-by-jump simulation until one of the kangaroos passes the other.  The reason is that a solution using this naive approach is tempting to obtain quickly by directly converting the problem statement into code:

  1. Set up a loop to increment the initial position of each kangaroo by its jump distance,
  2. Compare for equality (in which case we output “YES” then exit), otherwise
  3. Apply a slightly more complicated test to figure out whether to output “NO” or continue the next iteration of the loop since the inter-kangaroo distance is shrinking.

Before starting down this path, as with all interview questions, take a break before starting to write.  Try to get inside the head of your interviewer, and above all: talk about your insights into the problem as they occur to you.

Why Kangaroo is a good interview problem

Interviewers will see this approach as supporting mastery of the programming language claimed on the resume but—if “turned in” as the solution—also as evidence of lack of insight into the nature of computational problems.  Perhaps peers will see a hasty approach and fear the generation of code they’ll need to rewrite because it uses resources inefficiently or just plain ignores corner cases: If the jump distance of each kangaroo is the same, do you avoid an infinite loop that waits in vain for the laggard to catch up?  If extended to a machine integer size able to represent the initial distance from kangaroo 1 to kangaroo 2 as greater than the number of atoms on Earth, and the jump delta is just one unit, does your code overflow undetected or just take way longer than necessary to execute?

The first insight is to realize that the four inputs (two x positions and two v jump distances) can be immediately boiled down to just two quantities: the initial difference d between kangaroo 1 and kangaroo 2, and the catch-up distance c between kangaroo 1 and kangaroo 2 on each jump.  It is useful in the interest of compact code to expect the parameters of the “aft” kangaroo (the one starting closer to -∞ and trying to catch up) to be passed in x1 and v1.  But both c and d are signed quantities, so a simple test for the sign of d allows us to effectively swap the kangaroo parameters by forcing d positive and ensuring that the sign of c is in agreement with whether the “aft” kangaroo is catching up (c > 0) or falling behind (c < 0).  A third possibility is that the distance between them is constant, in which case whether they collide at the end of one jump is a function of whether they started at the same position.

Here is the compact solution in C:

// C submission for
char* kangaroo(int x1, int v1, int x2, int v2) {
 static char* answer_yes = "YES";
 static char* answer_no = "NO";

 int d = x2 - x1; // dEFICIT of aft kangaroo behind fore kangaroo 
 int c = v1 - v2; // cLOSING in distance between kangaroos per jump

 if (d < 0) { // swap kangaroos since x1 > x2
  d = -d; // d is now nonnegative
  c = -c; // c sign must be flipped too
 if (c < 0) // aft kangaroo gets farther away with each jump, not closer
  return answer_no;

 else if (c == 0) // kangaroos jump same distance, answer is their initial match
  return (d == 0) ? answer_yes : answer_no;

 else // aft kangaroo hits fore kangaroo iff c (positive) divides evenly into d
  return (d % c == 0) ? answer_yes : answer_no;

ARM implementation

For just about every modern architecture the majority of registers must be preserved by a called function, in the interest of low-overhead function calls.  Optimizing compilers will try to avoid doing so, not only for the stack hit but also because the flexibility to return to the caller prematurely (before the closing brace of the declaration) is lost.

Register usage

As seen from their order in the function declaration the parameters x1 and x2 arrive to kangaroo() in the even argument registers r0 and r2 respectively, while the parameters v1 and v2 arrive in the odd argument registers r1 and r3 respectively.  As the first step we’ll place c in r1 by subtracting r3 from r1 (neither value is needed again), and d in r2 by subtracting r0 from r2 (ditto).

The initial guess of return value placed in r0 is “answer_no” since the expected value from a Monte Carlo analysis shows that a kangaroo pair with random parameters are unlikely ever to collide.  As soon as a collision is known to be impossible, a “mov pc,lr” gets us out of the function with this correct return value.  If we do detect a kangaroo collision we’ll change r0 to “answer_yes” right before exiting.

Lastly r3 is available for calculating a remainder, in one of two ways according to whether the architecture supports hardware division.  No higher registers are needed, a good outcome since they would not only have to be saved on the stack but also restored at a common exit point that would result in more branching.

Condition flags

The reason for calculating c before d wasn’t alphabetical; the first decision that must be made is whether d is negative, indicating the parameters for the “fore” and “aft” kangaroo got swapped on the call.  The N flag will be set automatically by the subtraction that sets d, so it is not necessary to do a separate compare as long as c isn’t set after d (as the C code would suggest is possible).

The code

Putting the above three principles into action gives an implementation for kangaroo() that cannot be beaten by a compiler on either size or execution speed.  For compactness the answer_no and answer_yes string addresses used by the ldr pseudo-instruction are left undeclared in the snippet below.

PUBLIC __kangaroo

__kangaroo:    ;                 //r0    //r1    //r2    //r3
 subs r1,r1,r3 ;char* kangaroo(int x1, int v1, int x2, int v2);
 subs r2,r2,r0
 ldr  r0,=answer_no
 bpl  no_kangaroo_swap
 rsbs r2,r2,#0
 rsbs r1,r1,#0 
 cmp  r1,#0
 bpl  c_nonnegative 
 mov  pc,lr
 bne  c_nonzero
#if defined(ARMv7M) || defined(ARMv8M)
 cbnz r2,c_zero_return;
 cmp  r2,#0 
 bne  c_zero_return
 ldr  r1,=answer_yes
 mov  pc,lr 
#if defined(ARMv7M)
 udiv r3,r2,r1
 umul r3,r1,r3
 subs r3,r2,r3 
 bne  c_pos_return 
 subs r2,r2,r1
 bmi  c_pos_return
 bne  divide_loop
 ldr  r1,=answer_yes
 mov  pc_lr

There are more optimizations possible of course to avoid division in specific cases, but are only worth considering if they are perceived as likely to occur.  For example, a jump delta c of 1 will always result in a YES answer when d > 0.  And jump delta c of 2 will always result in a YES answer when the LSB of d is also 0.  The temptation to make the function grow to the point of unintelligibility must be avoided.

Note that the compact instructions CBZ and CBNZ are back in the low-end cores effective with the ARMv8-M Baseline specification, meaning they’ll be there on the Cortex-M22 (and this code will save an instruction cycle and two bytes).