PDA

View Full Version : Optimizing CRC16 routine.



HenrikOlsson
- 16th January 2011, 16:03
Hi everyone,
This is kind of a doublepost as I originally put this question as a response in this thread (http://www.picbasic.co.uk/forum/showthread.php?t=11790) without noticing the thread was in the wishlist section of the forum. Hopefully it gets a bit more attention here.

Allright, I'm working a MODBUS stack for PBP. I found a CRC routine (I think it was in another thread here) which I've modified slightly and it works nicely. The problem is that it is, or seems, quite slow. Here's the current code:

MB_CRC = 65535
MB_k = MB_Buffer_Ptr - 3 ' Find out where in the array to stop.
FOR MB_i = 0 to MB_k ' Iterate thru the frame, stop before the last two bytes which are the actual CRC.
MB_CRC = MB_CRC ^ MB_Buffer[MB_i] ' Bitwise EXOR the current low byte of the CRC "value" with the byte in the buffer

For MB_j = 0 to 7 ' Cycle thru the low byte of the CRC value
If MB_CRC.Bit0 THEN ' If the LSB is set
MB_CRC = MB_CRC >> 1 ' We shift CRC on bit to the right....
MB_CRC = $A001 ^ MB_CRC ' and EXOR it with out polynomial for MODBUS CRC-16 ($A001)
ELSE
MB_CRC = MB_CRC >> 1 ' If the LSB is not set we simply shift CRC one bit to the right.
ENDIF
Next ' Next bit in the byte
NEXT ' Next byte in the frame
RETURN


This thing takes 1730 cycles, including a GOSUB and RETURN, to calculate the CRC value for 9 bytes (123456789). In the tread I linked to above Darrel has a, what looks to me, very similar routined which is said to do it in 825 cycles. For reference, here's Darrels code:

CRC VAR WORD
CRC_IN VAR BYTE
Idx VAR BYTE
CRC_Len CON 16
CRC_Poly CON $1021 ; x16 + x12 + x5 + 1
;-----CRC-16 -----------------------------------------------------
CRC16:
CRC.HighByte = CRC.HighByte ^ CRC_IN
FOR Idx = 0 to 7
IF CRC.0(CRC_Len-1) THEN
CRC = (CRC << 1) ^ CRC_Poly
ELSE
CRC = CRC << 1
ENDIF
NEXT Idx
RETURN
As you can see it's very simmilar, it uses another polynominal (which I don't think should make any difference (?)) and it shifts to the left instead of the right so it's another kind of CRC than what I need or I'd use it directly.

Does anyone have any idea why mine is so much slower than Darrels and/or what I might do to speed it up? MODBUS frames can get quite large and currently it takes more than 12000 cycles to go thru a frame of 67 bytes.

Thanks!
/Henrik.

cncmachineguy
- 16th January 2011, 16:21
Not saying I get this whole CRC thing, just looking at the 2 snippits, the biggest thing I see is you have a nested loop and Darrel doesn't. But that seems like it would change things by more than double. (approx difference between the 2)

Try using his poly instead of yours in your code to see if that changes your timing. I don't know if you can just do that, but if so it will eliminate that as a factor.

HenrikOlsson
- 16th January 2011, 16:45
Hi Bert,
I tested with $1021 instead of $A001 as the poly and it changed from 1730 to 1682 cycles - still nowhere nere ~800.

Yes, mine iterates thru the array of 9 bytes (in this case) in one go while Darrels would have to be called 9 times with a new byte in CRC_In each time. I don't think that is what makes it that much slower.

I'm new to this CRC thing as well and don't yet fully understand it enough to try to come with a completely different aproach.

Thanks!
/Henrik.

cncmachineguy
- 16th January 2011, 17:03
What if you combine these 2 lines:


MB_CRC = MB_CRC >> 1
MB_CRC = $A001 ^ MB_CRC

to be this:


MB_CRC = (MB_CRC >> 1) ^ $A001 ; or try DT's poly here just for consistancy


This would kinda make sense to me as why yours might be ~twice as long, PBP may be creating double code to do it in 2 steps.

HenrikOlsson
- 16th January 2011, 18:10
Hi,
Thank you for your ideas, I appreciate it.

Unfortunately it's the opposite of what you think. Originally I did have those two lines combined but splitting the statement into two increased the performance by ~9% (it went from 1894 to 1730 cycles for the 9 test-bytes) it also saved me 8 bytes of program space - which is secondary but still nice.

/Henrik.

Mike, K8LH
- 16th January 2011, 18:14
Greetings Henrik,

Is there a chance you have a lot of overhead associated with array handling? Have you looked at the assembly language generated by that routine?

I checked out the inner loop using BoostC and found worst case cycle times of 123 cycles in a "for/next" loop and 100 cycles in a "do/while" loop.

Cheerful regards, Mike


crcword ^= work; // 123 cycles (BoostC)
for(char i=0; i<8; i++) //
{ crcword >>= 1; //
if(status.C) //
crcword ^= 0x0A01; //
}


crcword ^= work; i = 0; // 100 cycles (BoostC)
do { //
crcword >>= 1; //
if(status.C) //
crcword ^= 0x0A01; //
i++; //
} while(i.3 == 0); //

HenrikOlsson
- 16th January 2011, 18:33
Hi Mike,
I'm using TMR1 to measure the execution time of the routine. I just moved the starting and stopping of the timer so it's around the inner FOR-NEXT loop. In other words the timer is started before the inner FOR and stopped after the inner NEXT effectively excluding the array handling of the outer FOR-NEXT loop from the timing.

This time I got 1533 cycles.

So, as far as I can see it is the inner loop that is eating away the time and I still don't understand why. I'll try a DO-WHILE construct instead of the FOR-NEXT and see what I'll get.

Thanks!
/Henrik.

Mike, K8LH
- 16th January 2011, 18:46
... <snip> ... I'm using TMR1 to measure the execution time of the routine ... <snip> ...
Could that be a problem? Do you have anything similar to the MPLAB Simulator Stopwatch?


... <snip>... This time I got 1533 cycles ... <snip> ...

My goodness, that doesn't sound quite right. That's close to what you got when you were processing nine bytes. Could there be a problem with the method you're using to count cycles?

I would expect you to get code similar to the assembler code generated from my BoostC "Do/While" loop;


crcword ^= work; i = 8; //
0040 0845 MOVF crc16_00000_arg_work, W
0041 06B7 XORWF gbl_crcword, F
0042 3008 MOVLW 0x08
0043 00C6 MOVWF crc16_00000_1_i

do { //
0044 label5

crcword >>= 1; //
0044 36B8 LSRF gbl_crcword+D'1', F
0045 0CB7 RRF gbl_crcword, F

if(status.C) //
0046 1C03 BTFSS gbl_status,0
0047 284C GOTO label6
004C label6

crcword ^= 0x0A01; //
0048 3001 MOVLW 0x01
0049 06B7 XORWF gbl_crcword, F
004A 300A MOVLW 0x0A
004B 06B8 XORWF gbl_crcword+D'1', F

} while(--i); //
004C 03C6 DECF crc16_00000_1_i, F
004D 1D03 BTFSS STATUS,Z
004E 2844 GOTO label5
}

HenrikOlsson
- 16th January 2011, 18:59
Hi Mike,
I tried a DO-WHILE construct and I'm quite surprised by the results. I moved the starting and stopping back to where I had it originally and with the DO-WHILE I get 1496 cycles instead of 1730 for the 9 test-bytes. A 67 byte array, previously taking ~12600 cycles now executes in ~10800. Getting there....

Here's the current ASM code:

0012A0 M _MB_CRC_16
0012A0 6831 M setf _MB_CRC
0012A2 6832 M setf (_MB_CRC) + 1
0012A4 0E03 M movlw 003h
0012A6 5C3F M subwf _MB_Buffer_Ptr, W
0012A8 6E42 M movwf _MB_k
0012AA 6A40 M clrf _MB_i
0012AC M L00108
0012AC 0004 M clrwdt
0012AE 5040 M movf _MB_i, W
0012B0 5C42 M subwf _MB_k, W
0012B2 A0D8 M btfss STATUS, C
0012B4 EF80 F009 M goto L00109
0012B8 5040 M movf _MB_i, W
0012BA 0F56 M addlw low (_MB_Buffer)
0012BC 6EE9 M movwf FSR0L
0012BE 0E01 M movlw (_MB_Buffer) >> 8
0012C0 6AEA M clrf FSR0H
0012C2 22EA M addwfc FSR0H, F
0012C4 CFEF F013 M movff INDF0, T1
0012C8 5013 M movf T1, W
0012CA 1A31 M xorwf _MB_CRC, F
0012CC 6A41 M clrf _MB_j
0012CE M L00110
0012CE 0004 M clrwdt
0012D0 B641 M btfsc _MB_j, 003h
0012D2 EF7D F009 M goto L00111
0012D6 0004 M clrwdt
0012D8 A031 M btfss _MB_CRC, 000h
0012DA EF78 F009 M goto L00112
0012DE 90D8 M bcf STATUS, C
0012E0 3232 M rrcf _MB_CRC + 1, F
0012E2 3231 M rrcf _MB_CRC, F
0012E4 0E01 M movlw low (0A001h)
0012E6 1A31 M xorwf _MB_CRC, F
0012E8 0EA0 M movlw (0A001h) >> 8
0012EA 1A32 M xorwf _MB_CRC + 1, F
0012EC EF7B F009 M goto L00113
0012F0 M L00112
0012F0 90D8 M bcf STATUS, C
0012F2 3232 M rrcf _MB_CRC + 1, F
0012F4 3231 M rrcf _MB_CRC, F
0012F6 M L00113
0012F6 2A41 M incf _MB_j, F
0012F8 D7EA M bra L00110
0012FA M L00111
0012FA 2A40 M incf _MB_i, F
0012FC A0D8 M btfss STATUS, C
0012FE D7D6 M bra L00108
001300 M L00109
001300 0012 M return

/Henrik.

Mike, K8LH
- 16th January 2011, 19:25
The inner loop starts at label "L00110" and uses about 18 instruction cycles (worst case) each loop iteration for a whopping total of right around 144 cycles per byte processed.

I have not check the rest of the code. I must get back to my studies.

Cheerful regards, Mike

cncmachineguy
- 16th January 2011, 19:54
Hi Henrik,
here is the snippit I am testing:


Main: 'DO WHATEVER YOU WANT HERE
tmr1=0
t1con = 1
For MB_j = 0 to 7 ' Cycle thru the low byte of the CRC value
If MB_CRC.Bit0 THEN ' If the LSB is set
MB_CRC = MB_CRC >> 1 ' We shift CRC on bit to the right....
MB_CRC = $A001 ^ MB_CRC ' and EXOR it with out polynomial for MODBUS CRC-16 ($A001)
ELSE
MB_CRC = MB_CRC >> 1 ' If the LSB is not set we simply shift CRC one bit to the right.
ENDIF
Next ' Next bit in the byte
t1con = 0
stopnow:
goto stopnow



This takes 169 cycles according to TMR1 setup as you see here. I have tried with both $A001 and $1021. Both are the same. Does this fall in line with what you have?

BTW, MPLAB stopwatch reports the same

Next I think I will try out DT's snippet

Mike, K8LH
- 16th January 2011, 20:50
I just realized that PBP assembler code is using two word "goto" instructions for those short branches instead of one word "bra" instructions. That makes it worst case 20 cycles instead of 18 cycles per interation through the inner loop (160 cycles per byte). Ouch!

HenrikOlsson
- 17th January 2011, 06:14
Bert,
That is exactly like I had it when trying to exclude the outer array handling loop from the timing. I ran 9 bytes thru and got ~1530 cycles. 1530/9=170 cycles per byte and your test shows 169 - so that matches pretty well. I DID get a slight difference when I changed the poly and I guess that depends on the actual value of the byte going thru the routine, if you run the 9 byte sequence 123456789 thru it you should see a slight difference between the two polynomial values.

Mike,
So obviously BoostC compiles this to faster code but doesn't the BoostC compiled code you posted use GOTO as well?

Question is why there's such a difference between Darrel's code and the one I'm using. The poly itself does make a slight difference, I've verified that but it's still way off. I haven't actually verified the claimed 825us (I've no reason to doubt it) but I'll do that tonight just to make sure.

Thanks a lot guys, I appreciate the help!
/Henrik.

Ioannis
- 17th January 2011, 12:15
Hi Henrik.

How did you declare MB_i variable? Byte or Word?

Also instead of this:



MB_CRC = 65535
MB_k = MB_Buffer_Ptr - 3 ' Find out where in the array to stop.
FOR MB_i = 0 to MB_k ' Iterate thru the frame, stop before the last two bytes which are the actual CRC.
MB_CRC = MB_CRC ^ MB_Buffer[MB_i] ' Bitwise EXOR the current low byte of the CRC "value" with the byte in the buffer

For MB_j = 0 to 7 ' Cycle thru the low byte of the CRC value
If MB_CRC.Bit0 THEN ' If the LSB is set
MB_CRC = MB_CRC >> 1 ' We shift CRC on bit to the right....
MB_CRC = $A001 ^ MB_CRC ' and EXOR it with out polynomial for MODBUS CRC-16 ($A001)
ELSE
MB_CRC = MB_CRC >> 1 ' If the LSB is not set we simply shift CRC one bit to the right.
ENDIF
Next ' Next bit in the byte
NEXT ' Next byte in the frame
RETURN


try this one:



MB_CRC = 65535
MB_k = MB_Buffer_Ptr - 3 ' Find out where in the array to stop.
FOR MB_i = 0 to MB_k ' Iterate thru the frame, stop before the last two bytes which are the actual CRC.
MB_CRC = MB_CRC ^ MB_Buffer[MB_i] ' Bitwise EXOR the current low byte of the CRC "value" with the byte in the buffer

For MB_j = 0 to 7 ' Cycle thru the low byte of the CRC value
MB_CRC = MB_CRC >> 1
If MB_CRC.Bit0 THEN MB_CRC = $A001 ^ MB_CRC ' If the LSB is set
' and EXOR it with out polynomial for MODBUS CRC-16 ($A001)
Next ' Next bit in the byte
NEXT ' Next byte in the frame
RETURN


Ioannis

Mike, K8LH
- 17th January 2011, 13:58
Mike,
So obviously BoostC compiles this to faster code but doesn't the BoostC compiled code you posted use GOTO as well?

I posted 16F1827 code. A slightly unfair comparison because 18F devices don't have a "lsrf" instruction.

I didn't realize you were using an 18F until you posted your assembler output, and so I tested the following code which is similar to yours, for an 18F in BoostC;


MB_CRC ^= work; //
for(MB_j=0; MB_j<8; MB_j++) //
{ if(MB_CRC.0) //
{ MB_CRC >>= 1; //
MB_CRC ^= 0xA001; //
}
else //
{ MB_CRC >>= 1; //
}
} // Next bit in the byte

Here's the BoostC output which uses "bra" instructions and runs in about 115 cycles (worst case);


0008 label1
0008 500A MOVF gbl_work, W
000A 1A05 XORWF gbl_MB_CRC, F
000C 6A09 CLRF gbl_MB_j
000E label2
000E 0E08 MOVLW 0x08
0010 6009 CPFSLT gbl_MB_j
0012 D7FA BRA label1
0014 A005 BTFSS gbl_MB_CRC,0
0016 D008 BRA label3
0018 90D8 BCF STATUS,C
001A 3206 RRCF gbl_MB_CRC+D'1', F
001C 3205 RRCF gbl_MB_CRC, F
001E 0E01 MOVLW 0x01
0020 1A05 XORWF gbl_MB_CRC, F
0022 0EA0 MOVLW 0xA0
0024 1A06 XORWF gbl_MB_CRC+D'1', F
0026 D003 BRA label4
0028 label3
0028 90D8 BCF STATUS,C
002A 3206 RRCF gbl_MB_CRC+D'1', F
002C 3205 RRCF gbl_MB_CRC, F
002E label4
002E 2A09 INCF gbl_MB_j, F
0030 D7EE BRA label2

Mike, K8LH
- 17th January 2011, 14:08
Mike,
So obviously BoostC compiles this to faster code but doesn't the BoostC compiled code you posted use GOTO as well?

Please forgive me if it seems I was suggesting one compiler is better than another. I'm just trying to help you characterize loop timing but I don't have PBP.

Cheerful regards, Mike

HenrikOlsson
- 17th January 2011, 16:40
Mike,
Not at all - at least I didn't read it like that and your help is much appreciated! I'm not that good at assembler though (hence the need for PBP ;-) ) Sorry for not mentioning I was using an 18 series PIC, I should've thought about that.

Ioannis,
MB_i is declared as a BYTE.

At this point the best I've got is (was) this:

MB_k = MB_Buffer_Ptr - 3 ' Find out where in the array to stop.
FOR MB_i = 0 to MB_k ' Iterate thru the frame, stop before the last two bytes which are the actual CRC.
MB_CRC = MB_CRC ^ MB_Buffer[MB_i] ' Bitwise EXOR the current low byte of the CRC "value" with the byte in the buffer

MB_j = 0

DO WHILE MB_j.3 = 0
If MB_CRC.Bit0 THEN ' If the LSB is set
MB_CRC = MB_CRC >> 1 ' We shift CRC on bit to the right....
MB_CRC = $A001 ^ MB_CRC ' and EXOR it with out polynomial for MODBUS CRC-16 ($A001)
ELSE
MB_CRC = MB_CRC >> 1 ' If the LSB is not set we simply shift CRC one bit to the right.
ENDIF
MB_j = MB_j + 1
LOOP
NEXT ' Next byte in the frame
RETURN
This works and does the 9 byte test-array in 1496 cycles which is faster than with the FOR-NEXT loop I originally had. I then tried changing the meat of it to:

DO WHILE MB_j.3 = 0
MB_CRC = MB_CRC >> 1
If MB_CRC.BIT0 THEN MB_CRC = $A001 ^ MB_CRC
MB_j = MB_j + 1
LOOP
This runs a little bit faster, 1394 cycles, but it calculates the wrong CRC value. This is because, the shifts is done before checking BIT0 while in the working code it checks BIT0 THEN shifts and XOR's (or not).

Since the shift right command uses the RRCF instruction which shifts thru the CARRY flag I tried:

DO WHILE MB_j.3 = 0
MB_CRC = MB_CRC >> 1
If STATUS.0 THEN MB_CRC = $A001 ^ MB_CRC 'Check carry, XOR if set.
MB_j = MB_j + 1
LOOP
This runs in 1414 cycles and produces the correct CRC value. Can anyone see any potential issues with doing it like this?

All in all we've gone from 1730 to 1414 cycles, that's a ~22% increase - not bad! I wonder what more we can squeeze out of it? ;-)

Thanks guys!
/Henrik.

Ioannis
- 17th January 2011, 18:53
I am pretty sure I put the shift inside the loop.

OK, why Darrel's routine is faster? Maybe is the shift? Can it be that Left Shift is faster than Right?

I see no other obvious reason.

Interresting.

Ioannis

HenrikOlsson
- 17th January 2011, 20:06
I am pretty sure I put the shift inside the loop
Absolutely, but shifting before checking bit0 is different from checking bit0 before shifting ;-)

I haven't yet tried Darrels version but I have looked at the ASM code it compiles to and as far as I can see it's now "longer" than what my current code compiles to. I'm actually starting to doubt it'll do the 9 byte test sequence in 825 cycles. Darrel, are you there?

Thanks!
/Henrik.

Bruce
- 17th January 2011, 23:52
Run Darrels' example through MPSIM with the stopwatch to see if it really takes only 825 cycles to process 9 bytes through his CRC routine.

Yours looks pretty darn tight as-is.