I made that exact mistake initially, hence the comment where the DIR variable is declared - yet I missed it when looking at you code...

I wonder why reading the pins individually doesn't work in your case. I only did it that way because CuriousOne wanted to be able to have the signals placed arbitrarily. I suspect though that reading them individually might be faster than masking and shifting.

I've been playing around with the math to go from encoder counts to actual frequency and I'm down to 110 cycles for calculating and stuffing the bits into the array. Had to resort to a tiny bit of ASM, getting the high word of a 16*16bit multiplication back from PBP without resorting to actually using LONGs.