64/32-Bit Division Algorithm

Coded in Thumb (ARM) | July 13, 2011
Given the simple mathematical operators in assembly such as addition and subtraction, it is relatively easy to implement multiplication. The same cannot be said for division, since there may be issues with not having registers of the correct size. As part of a summer intern project, I along with another intern implemented the ability to divide a 64-bit number by a 32-bit number, using just 32-bit registers.

Working With Negative Numbers

The division algorithm must be able to divide negative numbers. In order to improve efficiency, we use a sign flag, which will be set only if one of the numerator or denominator is negative. If we assume the 64-bit numerator comes in at registers R0 and R1 and the 32-bit denominator in R2, figuring out the sign flag (which we store in R8) is just the following (in ARM Thumb code):
MOV     R7, R1      ; Temporarily store the top half of the numerator into R7
EORS    R7, R2      ; Exclusive or it with the denominator
MOV     R8, R7      ; Save sign flag in R8 register
By XORing the top half of the numerator and the denominator, we are XORing their signs. The result is negative if either one is negative and positive if both or neither are negative.
This makes the algorithm more efficient since it's easier to negate either the numerator or denominator and work with positive numbers. When the result needs to be returned, the answer is negated if the sign flag is set.

Short-Circuiting

In order to speed up the algorithm, a lot of short-circuiting was used for special cases in division. We check to see if the numerator and/or denominator fall in one of these special cases, then branch to a part of the code that handles this case without going into the main loop (which will be discussed later).

The Main Loop

Unfortunately, the main loop cannot be described in detail for proprietary reasons, but an illustration of this can be found here. Since the algorithm goes through this loop many times, the division algorithm becomes very slow. A faster way to implement it is to unroll the loop: explicitly writing out each iteration of the loop. For example, to unroll this loop:
for (int i = 1; i <= 3; i++) {
    print "hi!";
}
The code would just be repeated 3 times:
print "hi!";
print "hi!";
print "hi!";
Since the number of iterations to run in a division algorithm vary, we need a mechanism to check the number of loops that need to be done (in order to know where to jump in the unrolled code). This is still faster than an actual loop since the loop invariant does not need to be checked each time (which happens up to 48 times). This is assuming code space is not an issue.

Conclusion

This project took us about 2 weeks, which seems like a long time for an algorithm! What took us the most time wasn't debugging, but getting the algorithm to work in the smallest number of clocks. We tried a number of clever ways to reduce cycles, such as loop unrolling as mentioned above.