Also the PL (parallel load) signal needs to be High in order to clock data out of the serial data pin, but also needs to go low in order to sample the data on the parallel pins. Simplest thing to do is to invert the CE signal and use that to control the PL signal. That is how I turned the HC165 into a simple SPI-like interface (Clock, Data out, and Chip Enable).