A few years ago I picked up on Amazon an excellent 64-bit assembly programming tutorial (on the AMD64 architecture a.k.a. x86_64), which since seems to have gone out of print. Written by a university professor, it captures that exact TeX feel of college handouts during my formative semester of MIPS programming in the early 90’s.

Careful formulation of a problem around an instruction-set architecture (ISA) can just about guarantee beating gcc optimizations, so I knew I wanted to use register modes that don’t nicely map onto C code. The ability to manipulate eight-byte words directly was too tempting to pass up. What could I do that combined this new register width with Intel holdovers from the 1970’s such as support for direct (via registers AH, BH, etc.) access to the lately pointless, second-lowest byte of a quadword?

I decided to count how many quadword-aligned memory locations represent the first of 8 identical bytes. In other words, when fetching aligned quadwords we will see if it consists of a repeating byte pattern (such as 0x2b2b2b2b2b2b2b2b), and if so we increment a counter before proceeding to examine the next 8 bytes.

Furthermore, I decided to repeat this quadword memory test on a large scale, in fact on a larger chunk than would ever be possible in a 32-bit address space. Choosing 4GB of .bss segment is sufficient to prove we’re running on a 64-bit machine. A gcc -O2 C implementation of the check function churns through a 32-bit memory space in a fairly consistent 1.8 seconds on a virtual machine in my i7 laptop:

typedef union {
  unsigned long q;
  unsigned int d[2];
  unsigned short h[4];
  unsigned char b[8];
} flexible_t;

unsigned long check(flexible_t x) {
  return (x.d[0] == x.d[1]) && (x.h[0] == x.h[1]) && (x.b[0] == x.b[1]);

However, gcc can only muster enough insight into what is happening to squeeze this function down to:

mov %rdi,$rdx
xor %eax,%eax
shr $0x20,%rdx
cmp %edx,%edi
je  check3
repz retq
mov %rdi,%rdx
shr $0x10,%rdx
cmp %di,%dx
jne check3
mov %edi,%edx
movzbl %dh,%eax
cmp %dil,%al
sete %al
nopl 0x0(%rax,%rax,1)

So what’s the quickest AMD64 instruction sequence that checks a quadword for byte consistency?  Much shorter than gcc’s effort as it turns out, since the C representation of the function doesn’t give much insight into the most suitable x86 instructions for the task.

My 4GB search takes the same laptop just 1.5s, a full 17% faster since this tight loop has now been further optimized by a few instructions. Ray Seyfarth uses yasm for the course, so the code builds with the line in the top comment:

;;; ../yasm -f elf64 -g dwarf2 -l all8same.lst all8same.asm; gcc -o all8same all8same.o

 segment .bss
array resq 0x80000000 ; two billion quadwords (16GB), just because we can
;;; check() returns a binary value (in rax):
;;; 1 if all eight bytes in the 64-bit input (passed in rdi)
;;; 0 otherwise
 segment .text
check:             ;uint64_t check(uint64_t rdi) {
 xor     esi,esi   ; rsi = 0;
 mov     eax,1     ; rax = 1;
 mov     rdx,rdi   ; rdx = rdi;
 cmp     dl,dh     ; if ((rdx & 0xff) != ((rdx & 0xff00)>>8))
 cmovne  eax,esi   ;  rax = rsi;
 rol     edx,16    ; rdx = ((rdx & 0xffffffffffff0000)>>16)|((rdx & 0x0000ffffffffffff)<<16);
 xor     edx,edi   ; if (((rdx & 0xffffffff) ^= (rdi & 0xffffffff)) == 0)
 cmovnz  eax,esi   ;  rax = rsi;
 mov     rdx,rdi   ; rdx = rdi;
 rol     rdx,32    ; rdx = ((rdx & 0xffffffff00000000)>>32)|((rdx & 0x00000000ffffffff)<<32);
 xor     rdx,rdi   ; if ((rdx ^= rdi) == 0)
 cmovnz  eax,esi   ;  rax = rsi;
 ret               ; return rax;
 global main
 ;; main() entry point
 push rbp
 mov rbp,rsp
 sub rsp,16
 mov r10,array ; long* r10 = array;
 xor r13d,r13d ; int r13d = 0;
 xor rbx,rbx ; for (int ebx = 0;
 cmp ebx,0x7fffffff ; ebx < 0x80000000;
 jg done ; ebx++)
 lea rcx,[r10+rbx*4] ;
 mov rdi,[rcx] ;
 call check ; 
 add r13d,eax ; r13d += check(array[ebx]);
 inc ebx ;
 jmp forloop ;
 ;; main() exit point
 mov eax,r13d ; return eax = r13d;

Further helping the pipeline is the lack of branches/jumps inside of the check() function aside from conditional moves so it executes in constant time and no branch prediction is needed.