if you were to divide an acceleration move into a number of equally timed blocks,the number of blocks is high enough to ensure that the travel (encoder clicks?) for each block is always < 255 . each block would represent a relative target position based on required acceleration ,it will obviously turn out that the final block(the remainder) would violate the strict acceleration figure but it would always result in slightly lower acceleration ,i doubt that would matter much position being the more inportant criteria.
these blocks could be stored in an byte array , the setpoint for the pid routine would be incremented/decremented by the block travel[] at the block rate , and the preformance could be monitored by checking current position vs start pos+accumulated offsets at each block increment.
as long as the block duration is > than the pid update rate and you have enough memory for a large enough array , at worst the acceleration move could be divided into variable block rate stages to make things more manageable .