64/32-Bit Division Algorithm
Coded in Thumb (ARM) | July 13, 2011
Given the simple mathematical operators in assembly such as addition and subtraction, it is
relatively easy to implement multiplication. The same cannot be said for division, since
there may be issues with not having registers of the correct size. As part of a summer
intern project, I along with another intern implemented the ability to divide a 64-bit
number by a 32-bit number, using just 32-bit registers.
Working With Negative Numbers
The division algorithm must be able to divide negative numbers. In order to improve efficiency, we use a sign flag, which will be set only if one of the numerator or denominator is negative. If we assume the 64-bit numerator comes in at registers R0 and R1 and the 32-bit denominator in R2, figuring out the sign flag (which we store in R8) is just the following (in ARM Thumb code):
MOV R7, R1
; Temporarily store the top half of the numerator into R7
EORS R7, R2
; Exclusive or it with the denominator
MOV R8, R7
; Save sign flag in R8 register
By XORing the top half of the numerator and the denominator, we are XORing their signs.
The result is negative if either one is negative and positive if both or neither are negative.
This makes the algorithm more efficient since it's easier to negate either the numerator or
denominator and work with positive numbers. When the result needs to be returned, the answer
is negated if the sign flag is set.
Short-Circuiting
In order to speed up the algorithm, a lot of short-circuiting was used for special cases in division. We check to see if the numerator and/or denominator fall in one of these special cases, then branch to a part of the code that handles this case without going into the main loop (which will be discussed later).- Division by 0: This is a simple case, as you can't divide by 0. The result is undefined, so we either return 0 or a number representing an undefined value.
- Division by 1: The answer is just the numerator.
- Numerator = Denominator: The result is 1.
- Numerator < Denominator: Since our algorithm does not deal with floats but instead gives an integer answer and the remainder, this case becomes a special case. The result is 0 with the numerator as the remainder.
- Dividing by a Power of 2: Division by a power of 2 is the same as right shifting the numerator. A naïve way to approach this would be to create a loop and repeatedly right shift the numerator. However, a much faster way is to check if the denominator is a power of 16, 8, 4, 2, and 1 (or a combination of those), and shift the numerator accordingly.
The Main Loop
Unfortunately, the main loop cannot be described in detail for proprietary reasons, but an illustration of this can be found here. Since the algorithm goes through this loop many times, the division algorithm becomes very slow. A faster way to implement it is to unroll the loop: explicitly writing out each iteration of the loop. For example, to unroll this loop:
for (int i = 1; i <= 3; i++) {
print "hi!";
}
The code would just be repeated 3 times:
print "hi!";
print "hi!";
print "hi!";
Since the number of iterations to run in a division algorithm vary, we need a mechanism to
check the number of loops that need to be done (in order to know where to jump in the unrolled
code). This is still faster than an actual loop since the loop invariant does not need to be
checked each time (which happens up to 48 times). This is assuming code space is not an issue.