An aside from another thread discussing math execution time in PBP 2.50B, where it seems that ALL add/subtract/multiply/divide operations are done as though they are signed-31-bit operations vs. differentiating between byte/word/long/mixed operations.
IF I'm reading pbppi18L.Lib correctly, it seems that the worst case for a s31/s31 divide is about 1,127 cycles (signed, 2^30-1 divided by 1), best case looks to be about 713 cycles (unsigned, 1 divided by 0)
The divide loop itself takes 34 cycles per iteration for 32 iterations. Worst case, 1,088 cycles, going into the loop 25 cycles, coming out of the loop 14 cycles.Code:;****************************************************************
;* DIV : 32 x 32 divide *
;* Input : R0 / R1 *
;* Output : R0 = quotient *
;* : R2 = remainder *
;* Notes : R2 = R0 MOD R1 *
;****************************************************************
ifdef DIVS_USED
LIST
DIVS clrf R3 + 3 ; Clear sign difference indicator
btfss R0 + 3, 7 ; Check for R0 negative
bra divchkr1 ; Not negative
btg R3 + 3, 7 ; Flip sign indicator
clrf WREG ; Clear W for subtracts
negf R0 ; Flip value to plus
subfwb R0 + 1, F
subfwb R0 + 2, F
subfwb R0 + 3, F
divchkr1 btfss R1 + 3, 7 ; Check for R1 negative
bra divdo ; Not negative
btg R3 + 3, 7 ; Flip sign indicator
clrf WREG ; Clear W for subtracts
negf R1 ; Flip value to plus
subfwb R1 + 1, F
subfwb R1 + 2, F
subfwb R1 + 3, F
bra divdo ; Skip unsigned entry
NOLIST
DIV_USED = 1
endif
ifdef DIV_USED
LIST
DIV
ifdef DIVS_USED
clrf R3 + 3 ; Clear sign difference indicator
endif
divdo clrf R2 ; Do the divide
clrf R2 + 1
clrf R2 + 2
clrf R2 + 3
movlw 32
IF R0.byte3 = 0 AND R1.byte3 = 0 then movlw 24
and preshift R0 and R1 over 8 bits
IF R0.word1 = 0 AND R1.word1 = 0 then movlw 16
and preshift R0 and R1 over 16 bits
IF R0.word1 = 0 AND R1.word1 = 0 and R0.byte1 = 0 and R1.byte1 = 0 then movlw 8
and preshift R0 and R1 over 24 bits
movwf R3
divloop rlcf R0 + 3, W
rlcf R2, F
rlcf R2 + 1, F
rlcf R2 + 2, F
rlcf R2 + 3, F
movf R1, W
subwf R2, F
movf R1 + 1, W
subwfb R2 + 1, F
movf R1 + 2, W
subwfb R2 + 2, F
movf R1 + 3, W
subwfb R2 + 3, F
bc divok
movf R1, W
addwf R2, F
movf R1 + 1, W
addwfc R2 + 1, F
movf R1 + 2, W
addwfc R2 + 2, F
movf R1 + 3, W
addwfc R2 + 3, F
bcf STATUS, C
divok rlcf R0, F
rlcf R0 + 1, F
rlcf R0 + 2, F
rlcf R0 + 3, F
decfsz R3, F
bra divloop
ifdef DIVS_USED
btfss R3 + 3, 7 ; Should result be negative?
bra divdone ; Not negative
clrf WREG ; Clear W for subtracts
negf R0 ; Flip quotient to minus
subfwb R0 + 1, F
subfwb R0 + 2, F
subfwb R0 + 3, F
negf R2 ; Flip remainder to minus
subfwb R2 + 1, F
subfwb R2 + 2, F
subfwb R2 + 3, F
divdone
endif
movf R0, W ; Get low byte to W
goto DUNN
NOLIST
DUNN_USED = 1
endif
I think, and I haven't played with it yet, in the highlighted section above, a couple of checks (in italics) could be put in there to check if the dividend and/or divisor's upper bytes (or break it down to individual bits) are cleared. If they are, then preshift R0 and R1 and lop off 8, 16, or 24 (up to 30 if checking by bit) iterations of the loop itself.
The worst case (long dividend/divisor) would still take the full 1,127 cycles (plus some cycles for the checks), but the best case (byte dividend/divisor) could be knocked down from 713 cycles to about 24 cycles.
Same could be said for the add, subtract and multiply routines, although gains would be minimal for each, and the PIC has the built-in single cycle 8x8 hardware multiplier.