Your bit-banged read & write routines use a @ NOP for a 1uS delay, but you're running
at 20MHz so these are only 200nS. You would need 5 x @ NOP for a 1uS delay.