I just realized that PBP assembler code is using two word "goto" instructions for those short branches instead of one word "bra" instructions. That makes it worst case 20 cycles instead of 18 cycles per interation through the inner loop (160 cycles per byte). Ouch!