Optimizing DIV

**Darrel Taylor** · - 13th September 2008, 22:07

<img align=left border=1 hspace=10 vspace=10 src="http://www.picbasic.co.uk/forum/attachment.php?attachmentid=2857" />
<!-- Name: kapow.jpg
Views: 2208
Size: 5.7 KB

-->It appears that along with some new errors, ...
you've added an average of 3 instruction cycles to the PBP DIV time.

According to the theory, the most optimization would be gained when using small numbers. So 8-bit/8-bit should see the greatest effect.<hr>

--Testing 8/8, Skip 0/0 this time--
Test- A=1-255, B=0-255
No ERRORs:
SkiMIN=762 SkiMAX=1018
PBPMIN=759 PBPMAX=1015

This is the latest test program ... SkiDiv5.pbp

These are the results at different OPT levels ...
SKI_DIV_OPT_1
SKI_DIV_OPT_2
SKI_DIV_OPT_3
SKI_DIV_OPT_0

This test program uses the Timing methods described in Post #2.

skimask · - 13th September 2008, 22:17

Originally Posted by Darrel Taylor

It appears that along with some new errors, ...

Hmmm... That code I posted last night at 22:17 worked good for me, as far as errors go. 0/0 was the only one I got and I found the source of that one. But not much for savings.

you've added an average of 3 instruction cycles to the PBP DIV time.
According to the theory, the most optimization would be gained when using small numbers. So 8-bit/8-bit should see the greatest effect.

That's what I'm getting also now that I've got a 'method' working.

**Darrel Taylor** · - 13th September 2008, 22:31

Originally Posted by skimask

It's working well in the short tests, sometimes knocking cycle counts down by more than 3/4, usually by about 1/2.

Keep in mind that the previous test program was only looking for the accuracy of the results, not the timing.

It used 2 divides, 1 for the quotient and another for the remainder.

Code:

        PBPQ = AL / BL                   ; do same in PBP
        PBPR = AL // BL

When it looked like you were getting 1/2 the cycle count. It's because PBP had to do it twice.

The new test program is geared towards Timing.
PBP only has to do it once now.
<br>

skimask · - 13th September 2008, 22:49

Originally Posted by Darrel Taylor

It used 2 divides, 1 for the quotient and another for the remainder.

Ya, I noticed that last night after everything was running TWICE as fast.
I knew it was too good to be true.
I'm cooking on some numbers right now based on loop sizes of 9, 99, and 999 (step 1)...and so on...
After those get done cooking, I'm going to expand those number out to 32 bit with larger steps and let it cook again. Like you annotated, could take months!

**Darrel Taylor** · - 14th September 2008, 02:26

Originally Posted by skimask

I'm cooking on some numbers right now ... After those get done cooking, I'm going to expand those number out to ... and let it cook again. Like you annotated, could take months!

Go for it! I think the SkiDiv5 program should give you enough information to give it a worthy effort.

Just remember. I'm only the guy giving the results.
You've chosen to attempt bettering Jeff Schmoyer ...
Good luck.

<br>

skimask · - 14th September 2008, 02:29

Originally Posted by Darrel Taylor

Go for it! I think the SkiDiv5 program should give you enough information to give it a worthy effort.
Just remember. I'm only the guy giving the results.
You've chosen to attempt bettering Jeff Schmoyer ...
Good luck.

<br>

Working on it... I'm surely not trying to better Jeff! That would be fruitless. If this does work as planned, I'm sure it'll be one of those tradeoffs...speed for space.
You're the only guy giving the results? So far...
Standby...shouldn't be too long now...

skimask · - 14th September 2008, 02:56

Code I'm using now. Using MPLAB simulator to get accurate cycle counts.

Code:

resetplaceholder:	'18f4685 code
DEFINE	OSC		40	'40 Mhz clock for proto work
DEFINE	NO_CLRWDT	1	'no extra clear watchdog timer instructions
DISABLE
DEFINE SKI_DIV_OPT 3
AL VAR LONG : BL VAR LONG : SkiQ VAR LONG : SkiR VAR LONG
PBPQ VAR LONG : PBPR VAR LONG : AW VAR WORD : BW VAR WORD : RW VAR WORD
ERROR VAR BIT : errorcount var long : maxnum var long : s1 var long
pbpq = al / bl	'need to use at least one PBP divide to kick in DIVS for now
al3 var al.byte3 : al2 var al.byte2 : al1 var al.byte1 : al0 var al.byte0
bl3 var bl.byte3 : bl2 var bl.byte2 : bl1 var bl.byte1 : bl0 var bl.byte0
'aliased so I can find it easily in MPSIM
testcode:
	@ nop	;stp = easy to search and add a breakpoint in SIM
		maxnum = 9 : s1 = 1 : gosub dodivide
        	maxnum = 99 : s1 = 1 : gosub dodivide
        	maxnum = 999 : s1 = 19 : gosub dodivide
        	maxnum = 9999 : s1 = 193 : gosub dodivide
		maxnum = 99999 : s1 = 1193 : gosub dodivide
		maxnum = 999999 : s1 = 31279 : gosub dodivide
		maxnum = 9999999 : s1 = 112791 : gosub dodivide
		maxnum = 99999999 : s1 = 327913 : gosub dodivide
		maxnum = 999999999 : s1 = 1279137 : gosub dodivide
	@ nop	;stp
END
stop
dodivide:	For AL = 0 to maxnum step s1
			for BL = 0 to maxnum step s1
				PBPQ = AL / BL
			next BL
		next AL
		@ nop	;stp
		For AL = 0 to maxnum step s1
			for BL = 0 to maxnum step s1
				@ MOVE?NN	_AL, R0
				@ MOVE?NN	_BL, R1		; AL / BL
				@ L?CALL	#DIVS
				@ MOVE?ANN	R0, _SkiQ
				@ MOVE?NN	R2, _SkiR
			next BL
		next AL
		@ nop	;stp
		return

ASM
	ifdef DIVS_USED
  LIST
#DIVS
	clrf	R3 + 3		; Clear sign difference indicator
	btfss	R0 + 3, 7	; Check for R0 negative
	bra	#divchkr1	; Not negative
	btg	R3 + 3, 7	; Flip sign indicator
	clrf	WREG		; Clear W for subtracts
	negf	R0		; Flip value to plus
	subfwb	R0 + 1, F
	subfwb	R0 + 2, F
	subfwb	R0 + 3, F
#divchkr1
	btfss	R1 + 3, 7	; Check for R1 negative
	bra	#divdo		; Not negative
	btg	R3 + 3, 7	; Flip sign indicator
	clrf	WREG		; Clear W for subtracts
	negf	R1		; Flip value to plus
	subfwb	R1 + 1, F
	subfwb	R1 + 2, F
	subfwb	R1 + 3, F
	bra	#divdo		; Skip unsigned entry
  NOLIST
DIV_USED = 1
	endif
	ifdef DIV_USED
  LIST
#DIV
		ifdef DIVS_USED
	clrf	R3 + 3		; Clear sign difference indicator	
		endif
#divdo
	clrf	R2		; Do the divide
	clrf	R2 + 1
	clrf	R2 + 2
	clrf	R2 + 3
	movlw	32
	movwf	R3

;added to speed up s-31 divides by using byte and bit shifting,
;added checking for 32, 24, 16 and 8 bit divides
;and using those routines if R0 and R1 are small enough
		ifdef SKI_DIV_OPT
SkiOpt3	;shift down bytes if low bytes are 0'd
	movf    R0, W      ; IF R0(0)= 0 
	bnz     SkiOpt4

	movf    R1, W      ;   AND R1(0)= 0 then 
	bnz     SkiOpt4

	movff   R0 + 1, R0 + 0 ;      and preshift R0
	movff   R0 + 2, R0 + 1
	movff   R0 + 3, R0 + 2
	clrf    R0 + 3

	movff   R1 + 1, R1 + 0 ;      and R1 over 8 bits
	movff   R1 + 2, R1 + 1
	movff   R1 + 3, R1 + 2
	clrf    R1 + 3

	movlw   8              ;      loops - 8
	subwf   R3, F
	btfss   STATUS, Z      ; stop if no loop's left (0/0)
	bra     SkiOpt3
	bra	#divdone

SkiOpt4	;shift down bytes if low bits are 0'd
	btfsc	R0, 0	; if lowest bit set, goto divloop
	bra	skiopt5
	btfsc	R1, 0	; if lowest bit set, goto divloop
	bra	skiopt5

	bcf    	STATUS, C	;clr carry-shift over complete R0
	rrcf	R0 + 3, F	;shift R0+3, .0 into carry
	rrcf	R0 + 2, F	;shift R0+2
	rrcf	R0 + 1, F	;shift R0+1
	rrcf	R0 + 0, F	;shift R0+0

	bcf	STATUS, C	;clr carry-shift over complete R1
	rrcf	R1 + 3, F	;shift R1, .0 into carry
	rrcf	R1 + 2, F	;shift R1+2
	rrcf	R1 + 1, F	;shift R1+1
	rrcf	R1 + 0, F	;shift R1+0

	movlw	1		;subtract one from the loop count
	subwf	R3, F

	btfss	STATUS, Z	;stop if no more loops
	bra	SkiOpt4
	bra	#divdone

skiopt5	;check if can use different divide methods (32, 24, 16, 8)
	movlw	32		;load loop count
	movwf	R3
	movf	R0 + 3, W
	bnz	#divloop	;use normal div
	movf	R1 + 3, W
	bnz	#divloop	;use normal div

	movlw	24		;load loop count
	movwf	R3
	movf	R0 + 2, W
	bnz	#divloop24	;jump out to 24 bit
	movf	R1 + 2, W
	bnz	#divloop24	;jump out to 24 bit

	movlw	16		;load loop count
	movwf	R3
	movf	R0 + 1, W
	bnz	#divloop16	;jump out to 16 bit
	movf	R1 + 1, W
	bnz	#divloop16	;jump out to 16 bit

	movlw	8		;load loop count
	movwf	R3
	bra	#divloop8	;fall thru to 8 bit
	
		endif
;above added to speed divide operations

#divloop	;32 bit
	rlcf	R0 + 3, W
	rlcf	R2, F
	rlcf	R2 + 1, F
	rlcf	R2 + 2, F
	rlcf	R2 + 3, F
	movf	R1, W
	subwf	R2, F
	movf	R1 + 1, W
	subwfb	R2 + 1, F
	movf	R1 + 2, W
	subwfb	R2 + 2, F
	movf	R1 + 3, W
	subwfb	R2 + 3, F
	bc	#divok
	movf	R1, W
	addwf	R2, F
	movf	R1 + 1, W
	addwfc	R2 + 1, F
	movf	R1 + 2, W
	addwfc	R2 + 2, F
	movf	R1 + 3, W
	addwfc	R2 + 3, F
	bcf	STATUS, C
#divok
	rlcf	R0, F
	rlcf	R0 + 1, F
	rlcf	R0 + 2, F
	rlcf	R0 + 3, F
	decfsz	R3, F
	bra	#divloop

		ifdef DIVS_USED
	btfss	R3 + 3, 7	; Should result be negative?
	bra	#divdone	; Not negative
	clrf	WREG		; Clear W for subtracts
	negf	R0		; Flip quotient to minus
	subfwb	R0 + 1, F
	subfwb	R0 + 2, F
	subfwb	R0 + 3, F
	negf	R2		; Flip remainder to minus
	subfwb	R2 + 1, F
	subfwb	R2 + 2, F
	subfwb	R2 + 3, F
	bra	#divdone
		endif

#divloop24	;24 bit
	rlcf	R0 + 2, W
	rlcf	R2, F
	rlcf	R2 + 1, F
	rlcf	R2 + 2, F
	movf	R1, W
	subwf	R2, F
	movf	R1 + 1, W
	subwfb	R2 + 1, F
	movf	R1 + 2, W
	subwfb	R2 + 2, F
	bc	#divok24
	movf	R1, W
	addwf	R2, F
	movf	R1 + 1, W
	addwfc	R2 + 1, F
	movf	R1 + 2, W
	addwfc	R2 + 2, F
	bcf	STATUS, C
#divok24
	rlcf	R0, F
	rlcf	R0 + 1, F
	rlcf	R0 + 2, F
	decfsz	R3, F
	bra	#divloop24

		ifdef DIVS_USED
	btfss	R3 + 3, 7	; Should result be negative?
	bra	#divdone	; Not negative
	clrf	WREG		; Clear W for subtracts
	negf	R0		; Flip quotient to minus
	subfwb	R0 + 1, F
	subfwb	R0 + 2, F
	subfwb	R0 + 3, F
	negf	R2		; Flip remainder to minus
	subfwb	R2 + 1, F
	subfwb	R2 + 2, F
	subfwb	R2 + 3, F
	bra	#divdone
		endif
		
#divloop16	;16 bit
	rlcf	R0 + 1, W
	rlcf	R2, F
	rlcf	R2 + 1, F
	movf	R1, W
	subwf	R2, F
	movf	R1 + 1, W
	subwfb	R2 + 1, F
	bc	#divok16
	movf	R1, W
	addwf	R2, F
	movf	R1 + 1, W
	addwfc	R2 + 1, F
	bcf	STATUS, C
#divok16
	rlcf	R0, F
	rlcf	R0 + 1, F
	decfsz	R3, F
	bra	#divloop16

		ifdef DIVS_USED
	btfss	R3 + 3, 7	; Should result be negative?
	bra	#divdone	; Not negative
	clrf	WREG		; Clear W for subtracts
	negf	R0		; Flip quotient to minus
	subfwb	R0 + 1, F
	subfwb	R0 + 2, F
	subfwb	R0 + 3, F
	negf	R2		; Flip remainder to minus
	subfwb	R2 + 1, F
	subfwb	R2 + 2, F
	subfwb	R2 + 3, F
	bra	#divdone
		endif
		
#divloop8	;8 bit
	rlcf	R0, W
	rlcf	R2, F
	movf	R1, W
	subwf	R2, F
	bc	#divok8
	movf	R1, W
	addwf	R2, F
	bcf	STATUS, C
#divok8
	rlcf	R0, F
	decfsz	R3, F
	bra	#divloop8

		ifdef DIVS_USED
	btfss	R3 + 3, 7	; Should result be negative?
	bra	#divdone	; Not negative
	clrf	WREG		; Clear W for subtracts
	negf	R0		; Flip quotient to minus
	subfwb	R0 + 1, F
	subfwb	R0 + 2, F
	subfwb	R0 + 3, F
	negf	R2		; Flip remainder to minus
	subfwb	R2 + 1, F
	subfwb	R2 + 2, F
	subfwb	R2 + 3, F
		endif
#divdone
	movf	R0, W		; Get low byte to W
	goto	DUNN
  NOLIST
DUNN_USED = 1
	endif
ENDASM
END

Cycle counts for various loop and step sizes

Code:

loop count	s1=step size	'pbp clock count     ski clock count	increase
9 : 		s1 = 1		'105,278	          26,124	4.033
99 : 		s1 = 1		'10,632,832	       2,507,692	4.240
999 : 		s1 = 19		'2,981,601	       1,136,305	2.624
9999 : 	        s1 = 193	'2,869,397	       1,174,807	2.442
99999 : 	s1 = 1193	'7,479,745	       3,913,171	1.911
999999 : 	s1 = 31279	'1,081,858	         722,091	1.498
9999999 : 	s1 = 112791	'8,388,778	       5,638,902	1.462
99999999 : 	s1 = 327913	'98,376,761	      98,167,489	1.002
999999999 : 	s1 = 1279137	'646,580,978	     663,354,535	0.973

Obviously, the biggest speed increase comes from using the various divide routines (32, 24, 16, 8), and once you get into the really big numbers, you lose cycles.
In a bit, I'm going to load this into my protoboard and see what happens (PBP div vs. Ski_Opt divides) as far as accuracy goes.
I still think shifting left and cutting the loop count ahead of time has merit. As I said, I know I did it on that 6809E back in the day. I just can't remember how...

EDIT: Added the error checking code back in and put a WATCH on the error counter variable in MPSIM.
So far, the only error I've gotten is on 0/0. Ski_Opt comes back with both zero's. PBP comes back with a max'd out quotient (2^32 -1), 0 in the remainder.
Slow going in the sim with a step of 1 running from 0 to (2^31 -1)....
Just calculated out the situation as described in the line above...
Running on the SIM on my laptop (Dell Insp. 8200 P4 @ 1.7Ghz), it could take about 571,232,829 years, 5 months, 13 days to complete the loop as written!!!
Ya...not so much!!!

Thread: Optimizing DIV

Thread Tools

Search Thread

Display

Hybrid View

Similar Threads

Optimizing LCD commands?

Members who have read this thread : 0

Bookmarks

Bookmarks

Posting Permissions