202 ARITHMETIC 4.2.1 be expressed as (Managed web hosting) a normalized

May 7th, 2008

202 ARITHMETIC 4.2.1 be expressed as a normalized floating point number in the required range, special action is necessary.) N7. [Pack.] Put e and f together into the desired output representation. 1 Some simple examples of floating point addition are given in exercise 4. The following MIX subroutines, for addition and subtraction of numbers having the form (4), show how Algorithms A and N can be expressed as computer programs. The subroutines below are designed to take one input u from symbolic location ACC, and the other input v comes from register A upon entrance to the subroutine. The output w appears both in register A and location ACC. Thus, a fixed point coding sequence LDA A; ADD B; SUB C; STA D (7) would correspond to the floating point coding sequence LDA A, STA ACC; LDA B, JMP FADD; LDA C, JMP FSUEl; STA D. (8) Program A (Addition, subtraction, and normalization). The following program is a subroutine for Algorithm A, and it is also designed so that the normalization portion can be used by other subroutines that appear later in this section. In this program and in many others throughout this chapter, OFLO stands for a subroutine that prints out a message to the effect that MIX s overflow toggle was unexpectedly found to be on. The byte size b is assumed to be a multiple of 4. The normalization routine NORM assumes that r12 = e and rAX = f, where rA = 0 implies rX = 0 and r12 < b. 00 BYTE EQU 1(4:4) Byte size b 01 EXP E&U I:1 Definition of exponent field 0.2 FSUB STA TEMP Floating point subtraction subroutine: 03 LDAN TEMP Change sign of operand. 04 FADD STJ EXITF Floating point addition subroutine: 05 JOV OFLO Ensure overflow is off. 06 STA TEMP TEMP +-v. 07 LDX ACC rX + 21. 08 CMPA ACC @XI ) Steps Al. A2, A3 are combined here: 09 JGE IF Jump if e, 2 e, . 10 STX FU (0 : 4) FU+ fffffo. 11 LD2 ACC@XP) r12 + e, . 12 STA FV (0 : 4) 13 LDIN TEMP @XI ) rI1 + -e, . 1-4 JMP 4F 15 IH STA FU (0 : 4) FU t f f f f f 0 (u, v interchanged) . 16 LD2 TEMP (EXP) r12 + e, . 17 STX FV (0 : 4) 18 LDlN ACC (!ZXF ) rI1 t -e, .

Top web site - 4.2.1 SINGLE-PRECISION CALCULATIONS 201 N4. Scale right Nl.

May 6th, 2008

4.2.1 SINGLE-PRECISION CALCULATIONS 201 N4. Scale right Nl. underflow N7. Pack Fig. 3. Normalization of (e, f). base-b digits to the right of the radix point. If such a large accumulator is not available, it is possible to shorten the requirement to p + 2 or p + 3 places if proper precautions are taken; the details are given in exercise 5.1 A6. [Add.] Set fw +- fU + fV. A7. [Normalize.] (At this point (G,,, fiu) represents the sum of u and 21, but lfwj may have more than p digits, and it may be greater than unity or less than l/b.) Perform Algorithm N below, to normalize and round (ezu, jiu) into the final answer. 1 Algorithm N (Normalization). A raw exponent e and a raw fraction f are converted to normalized form, rounding if necessary to p digits. This algorithm assumes that ]f 1 < b. Nl. [Test f .] If If] 2 1 ( fraction overflow ), go to step N4. If f = 0, set e to its lowest possible value and go to step N7. N2. [Is f normalized?] If ] f ] 2 l/b, go to step N5. N3. [Scale left.] Shift f to the left by one digit position (i.e., multiply it by b), and decrease e by 1. Return to step N2. N4. [Scale right.] Shift f to the right by one digit position (i.e., divide it by b), and increase e by 1. N5. [Round.] Round f to p places. (We take this to mean that f is changed to the nearest multiple of b-p. It is possible that (bpf)modl = f so that there are two nearest multiples; if b is even, we choose the one that makes bPf + $b odd. Further discussion of rounding appears in Section 4.2.2.) It is important to note that this rounding operation can make If] = 1 ( rounding overflow ); in such a case, return to step N4. N6. [Check e.] If e is too large, i.e., larger than its allowed range, an exponent overflow condition is sensed. If e is too small, an exponent underflow condition is sensed. (See the discussion below; since the result cannot

200 ARITHMETIC 4.2.1 Al. Unpack i-; A7. Normalize (Web hosts)

May 6th, 2008

200 ARITHMETIC 4.2.1 Al. Unpack i-; A7. Normalize I Fig. 2. Floating point addition. Note: Since floating point arithmetic is inherently approximate, not exact, we will use round symbols $7 8, 8, 0 to stand for ffoating point addition, subtraction, multiplication, and division, respectively, in order to distinguish approximate operations from the true ones. The basic idea involved in floating point addition is fairly simple: Assuming that eu 2 e,, we take e, = eu, fW = fu. + fV/beu+~ (thereby aligning the radix points for a meaningful addition), and normalize the result. Several situations can arise that make this process nontrivial, and the following algorithm explains the method more precisely. Algorithm A (Floating point addition). Given base b, excess 4, p-digit, normalized floating point numbers u = (ezL, fu) and Y = (e,, fV), this algorithm forms the sum w = u$ w. The same procedure may be used for floating point subtraction, if -v is substituted for V. Al. [Unpack.] Separate the exponent and fraction parts of the representations of u and v. A2. [Assume e, 2 e,.] If e, < e,, interchange u and w. (In many cases, it is best to combine step A2 with step Al or with some of the later steps.) A3. [Set e,.] Set e, c ezL . A4. [Test e, -e, .] If e, -e, 2 p+ 2 (large difference in exponents), set fW c fu. and go to step A7. (Actually, since we are assuming that ZL is normalized, we could terminate the algorithm; but it is occasionally useful to be able to normalize a possibly unnormalized number by adding zero to it.) A5. [Scale right.] Shift f,, to the right e, -e, places; i.e., divide it by Vu- u. [Note: This will be a shift of up to p + 1 places, and the next step (which adds fiL to fV) thereby requires an accumulator capable of holding 2p + 1

4.2.1 SINGLE-PRECISION CALCULATIONS 199 used for this purpose, (Cheap web hosting)

May 5th, 2008

4.2.1 SINGLE-PRECISION CALCULATIONS 199 used for this purpose, notably characteristic and mantissa ; but it is an abuse of terminology to call the fraction part a mantissa, since this concept has quite a different meaning in connection with logarithms. Furthermore the English word mantissa means a worthless addition. ) The MIX computer assumes that its floating point numbers have the form (4 Here we have base b, excess q, floating point notation with four bytes of precision, where b is the byte size (e.g., b = 64 or b = loo), and q is equal to [$b]. The fraction part is f j j j j, and e is the exponent, which lies in the range 0 2 e < b. This internal representation is typical of the conventions in most existing computers, although b is a much larger base than usual. B. Normalized calculations. A floating point number (e, j) is normalized if the most significant digit of the representation of f is nonzero, so that l/b I VI < 1; (5) or if j = 0 and e has its smallest possible value. It is possible to tell which of two normalized floating point numbers has a greater magnitude by comparing the exponent parts first, and then testing the fraction parts only if the exponents are equal. Most floating point routines now in use deal almost entirely with normalized numbers: inputs to the routines are assumed to be normalized, and the outputs are always normalized. Under these conventions we lose the ability to represent a few numbers of very small magnitude-for example, the value (0, .OOOOOOOl) can t be normalized without producing a negative exponent-but we gain in speed, uniformity, and the ability to give relatively simple bounds on the relative error in our computations. (Unnormalized floating point arithmetic is discussed in Section 4.2.2.) Let us now study the normalized floating point operations in detail. At the same time we can consider the construction of subroutines for these operations, assuming that we have a computer without built-in floating point hardware. Machine-language subroutines for floating point arithmetic are usually writ- ten in a very machine-dependent manner, using many of the wildest idiosyn- crasies of the computer at hand; so floating point addition subroutines for two different machines usually bear little superficial resemblance to each other. Yet a careful study of numerous subroutines for both binary and decimal computers reveals that these programs actually have quite a lot in common, and it is possible to discuss the topics in a machine-independent way. The first (and by far the most difficult!) algorithm we shall discuss in this section is a procedure for floating point addition, (eu, .fd $ (euj .fv) = (k, full. (6)

Web site development - 198 ARITHMETIC 4.2 4.2. FLOATING POINT ARITHMETIC IN

May 4th, 2008

198 ARITHMETIC 4.2 4.2. FLOATING POINT ARITHMETIC IN THIS SECTION, we shall study the basic principles of doing arithmetic on floating point numbers, by analyzing the internal mechanisms underlying such calculations. Perhaps many readers will have little interest in this subject, since their computers either have built-in floating point instructions or their computer manufacturer has supplied suitable subroutines. But, in fact, the material of this section should not merely be the concern of computer-design engineers or of a small clique of people who write library subroutines for new machines; every well-rounded programmer ought to have a knowledge of what goes on during the elementary steps of floating point arithmetic. This subject is not at all as trivial as most people think; it involves a surprising amount of interesting information. 4.2.1. Single-Precision Calculations A. Floating point notation. We have discussed fixed point notation for numbers in Section 4.1; in such a case the programmer knows where the radix point is assumed to lie in the numbers he manipulates. For many purposes it is considerably more convenient to let the position of the radix point be dynamically variable or floating as a program is running, and to carry with each number an indication of its current radix point position. This idea has been used for many years in scientific calculations, especially for expressing very large numbers like Avogadro s number N = 6.02252 x 1023, or very small numbers like Planck s constant h = 1.0545 X 1O-27 erg sec. In this section we shall work with base b, excess q, Aoating point numbers with p digits: Such numbers will be represented by pairs of values (e, f), denoting (e, f) = f x bepq. 0) Here e is an integer having a specified range, and f is a signed fraction. We will adopt the convention that Ifl < 1; in other words, the radix point appears at the left of the positional representation of f. More precisely, the stipulation that we have pdigit numbers means that bpf is an integer, and that -bP < bPf < bP. (2) The term floating binary implies that b = 2, floating decimal implies b = 10, etc. Using excess-50 floating decimal numbers with 8 digits, we can write, for example, Avogadro s number N = (74, +.60225200) ; Planck s constant h = (24, +.10545000). (3) The two components e and f of a floating point number are called the exponent and the fraction parts, respectively. (Other names are occasionally

4.1 POSITIONAL (Free web space) NUMBER SYSTEMS 197 b 31. [M3.5]

May 3rd, 2008

4.1 POSITIONAL NUMBER SYSTEMS 197 b 31. [M3.5] A generalization of two s complement arithmetic, called 2-adic numbers, was invented about 1900 by K. Hensel. (In fact he treated p-adic numbers, for any prime p.) A 2-adic number may be regarded as a binary number u = (. .u3u2u1u0.u-1.. .U-&, whose representation extends infinitely far to the left, but only finitely many places to the right, of the binary point. Addition, subtraction, and multiplication of 2-adic numbers are done according to the ordinary procedures of arithmetic, which can in principle be extended indefinitely to the left. For example, 7=(… 000000000000111)2 ) = (. . .110110110110111)2 -7=(… 111111111111001)2 -f = (. . .001001001001001)2 $ = ( .000000000000001.11)2 & = (. .110011001100110.1)2 J–7 = (. .100000010110101)~ or (. .011111101001011)~ Here 7 appears as the ordinary binary integer seven, while -7 is its two s comple- ment (extending infinitely to the left); it is easy to verify that the ordinary procedure for addition of binary numbers will give -7 + 7 = ( . 00000)s = 0, when the procedure is continued indefinitely. The values of 3 and -+ are the unique 2-adic numbers that, when formally multiplied by 7, give 1 and -1, respectively. The values of 3 and & are examples of 2-adic numbers that are not 2-adic integers, since they have nonzero bits to the right of the binary point. The two values of J-7, which are negatives of each other, are the only 2-adic numbers that, when formally squared, yield the value (. . .111111111111001)2. a) Prove that any 2-adic number u can be divided by any nonzero 2-adic number v to obtain a unique 2-adic number w satisfying u = VW. (Hence the set of 2-adic numbers forms a field ; cf. Section 4.6.1.) b) Prove that the 2-adic representation of the rational number -1/(2n + 1) may be obtained as follows, when n is a positive integer: First find the ordinary binary expansion of +1/(2n + l), which has the periodic form (O.cucucu.. .)z for some string o of O s and 1 s. Then -1/(2n + 1) is the 2-adic number (. . ocycy)~. c) Prove that the representation of a 2-adic number u is ultimately periodic (that is, ZLN+X = UN for all large N, for some x 2 1) if and only if u is rational (that is, u = m/n, for some integers m and n). d) Prove that, when n is an integer, fi is a 2-adic number if and only if it satisfies nmod22k+3 = 22k for some nonnegative integer k. (Thus, the possibilities are either n mod S = 1, or n mod 32 = 4, etc.) 32. [M40] (I. Z. Ruzsa.) Prove that there are infinitely many integers whose ternary representation uses only O s and l s and whose quinary representation uses only O s, l s, and 2 s. 33. [M40] (D. A. Klarner.) Let D be any set of integers, let b be any positive integer, and let k, be the number of distinct integers that can be written as n-digit numbers (a,-1 . . . alac)b to base b with digits a, in D. Prove that the sequence (k,) satisfies a linear recurrence relation, and explain how to compute the generating function C,, kd. Illustrate your algorithm in the case b = 3 and D = (-1, 0,3}.

196 ARITHMETIC 4.1 27. [M.Zl] Show that every

May 2nd, 2008

196 ARITHMETIC 4.1 27. [M.Zl] Show that every nonzero integer has a unique reversing binary representa- tion 2-3 -2el + . + (-1)9, where es < el < .. < et. b 28. [M24] Show that every nonzero complex number of the form a + bi where a and b are integers has a unique revolving binary representation (1 + i) + i(l + i) -(1 + i) -i(1 + i) + . . + ?(l + i) , where es < el < ... < et. (Cf. exercise 27.) 29. [M35] (N. G. de Bruijn.) Let SO, S1, Ss, be sets of nonnegative integers; we will say that the collection {SO, ,571, Sz, . } has Property B if every nonnegative integer n can be written in the form n = so + Sl + s2 + . , s3 E s,, in exactly one way. (Property B implies that 0 E S, for all j, since n = 0 can only be represented as 0 + 0 + 0 + . . . .) Any mixed-radix number system with radices bo, bl, bz, . provides an example of a collection of sets satisfying Property B, if we let S, = (0, B3, , (b, -l)B,}, where B, = bobI . . b,-1; here the representation of n = so + Sl + s2 + '.. corresponds in an obvious manner to its mixed-radix repre- sentation (9). Furthermore, if the collection {SO, Si, Sz, . } has Property B, and if Ao, AI, A2, . . is any partition of the nonnegative integers (so that we have A0 U A1 U A2 U . = (0, 1,2,. . } and A, n A3 = 0 for i # j; some A3 s may be empty), then the collapsed collection {TO, ZJ, 7 2, . } also has Property B, where ? , is the set of all sums x2eA s, taken over all possible choices of si E S,. Prove that any collection {TO, T1, 7 2, , } that satisfies Property B may be obtained by collapsing some collection {SO, S1 , S2, . . } that corresponds to a mixed-radix number system. 30. [A4391 (N. G. de Bruijn.) The radix-(-2) number system shows us that every integer (positive, negative, or zero) has a unique representation of the form (-2)el + (--2)Q + . . + (--2)Q, el > e2 > f.. > et 2 0, t 2 0. The purpose of this exercise is to explore generalizations of this phenomenon. a) Let bc, bl, b2, . . be a sequence of integers such that every integer n has a unique representation of the form n = beI + be, + . . + b,, , el > e2 > … > et 2 0, t > 0. (Such a sequence (b,) is called a binary basis. ) Show that there is an index j such that b, is odd, but bk is even for all Ic # j. b) Prove that a binary basis (bn) can always be rearranged into the form do, 2di, 4dz, . . . = (2nd,), where each dk is odd. c) If each of do, di, ds, . in (b) is fl, prove that (b,) is a binary basis if and only if there are infinitely many +1 s and infinitely many -1 s. d) Prove that 7, -13. 2, 7 3 2 , -13. 23, . . , 7. 22k, -13. 22k+1, . is a binary basis, and find the representation of n = 1.

4.1 POSITIONAL NUMBER SYSTEMS 195 20. [HM%] (David (Web server version)

May 2nd, 2008

4.1 POSITIONAL NUMBER SYSTEMS 195 20. [HM%] (David W. Matula.) Consider a decimal number system that uses the digits D = {-l,O, 8,17,26,35,44,53,62,71} instead of (0, 1, ,9}. The result of exercise 19 implies (as in exercise 18) that all real numbers have an infinite decimal expansion using digits from D. In the usual decimal system, exercise 13 points out that some numbers have two representations. (a) Find a real number that has more than two D-decimal repre- sentations. (b) Show that no real number has infinitely many D-decimal representations. (c) Show that uncountably many numbers have two or more D-decimal representations. b 21. [MZ?] (C. E. Shannon.) Can every real number (positive, negative, or zero) be expressed in a balanced decimal system, i.e., in the form xkcn aklOk, for some integer n and some sequence a,, anPl, an-s, . , where each ai is one of the ten numbers {-4$,-3~,-2~,-1~,–2, I Z,1 11 2, 2&,3$, 4$}? (Note that zero is not one of the allowed digits, but we implicitly assume that a,+~, a,+~, are zero.) Find all representations of zero in this number system, and find all representations of unity. 22. [HM25] Let cy = -Crn>i 10Vm2. Given E > 0 and any real number x, prove that there is a decimal representation such that 0 < 1~ - xockcn ~10~1 < E, where each ak is allowed to be only one of the three values 0, 1, or o. (Note that no negative powers of 10 are used in this representation!) 23. [HM30] Let D be a set of b real numbers such that every positive real number has a representation xkcn. akbk with all ak 6 D. Exercise 20 shows that there may be many numbers without unique representations; but prove that the set 7 of all such numbers has measure zero. 24. [M35] Find infinitely many different sets D of ten nonnegative integers satisfying the following three conditions: (i) gcd(D) = 1; (ii) 0 E D; (iii) every positive real number can be represented in the form xkcn -~10 with all ak E D. 25. [A&5] (S. A. Cook.) Let b, u, and w be positive integers, where b > 2 and 0 < u < b . Show that the base b representation of u/v does not contain a run of m consecutive digits equal to b - 1, anywhere to the right of the radix point. (By convention, no runs of infinitely many (b -1) s are permitted in the standard base b representation.) b 26. [HAJ30] (N. S. Mendelsohn.) Let (&) b e a sequence of real numbers defined for all integers n, –co < n < 00, such that Pn < &+I; lim &=co; lim & = 0. n-02 n—o3 Let (c,) be an arbitrary sequence of positive integers that is defined for all integers n, -co < n < cu. Let us say that a number x has a generalized representation if there is an integer n and an infinite sequence of integers a,, anel, an-s, . such that x = Cksn akpk, where a, # 0, 0 < ak < ck, and ak < ck for infinitely many k. Show that every positive real number z has exactly one generalized representation if and only if &+i = c,, 7Lckpk for all n. (Consequently, the mixed-radix systems with integer bases have thisproperty; and mixed-radix systems with pi = (co + l)& , PZ = (ci + l)(co + l)&, . , p-1 = &/(c.-1 + l), . are the most general number systems of this type. )

Managed web hosting - 194 ARITHMETIC 4.1 5. (001 Explain why a

May 1st, 2008

194 ARITHMETIC 4.1 5. (001 Explain why a negative integer in nines complement notation has a repre- sentation in ten s complement notation that is always one greater, if the representations are regarded as positive. 6. [16] What are the largest and smallest pbit integers that can be represented in (a) signed-magnitude binary notation (including one bit for the sign), (b) two s complement notation, (c) ones complement notation? 7. [M.Z0] The text defines ten s complement notation only for integers represented in a single computer word. Is there a way to define a ten s complement notation for all real numbers, having infinite precision, analogous to the text s definition? Is there a similar way to define a nines complement notation for all real numbers? 8. [MIo] Prove Eq. (5). b 9. [15] Change the following octal numbers to hexadecimal notation, using the hexadecimal digits 0, 1, . . . , F: 12; 5655; .2550.276; 76545336; 3726755. 10. [M.%?] Generalize Eq. (5) to mixed-radix notation. 11. [z?] Design an algorithm that uses the -2 number system to compute the sum of (a,. . alao)- and (b, . . blbo)-l, obtaining the answer (c,+z . c~co)-~. 12. [~3] Specify algorithms that convert (a) the binary signed magnitude number f(arl . . . ao)z to its negabinary form (&+I . . . bo)-2; and (b) the negabinary number Pn+l . . . bo)-z to its signed magnitude form &(a,+1 . . . a0)2. b 13. [A&1] In the decimal system there are some numbers with two infinite decimal expansions; e.g., 2.3599999.. . = 2.3600000.. . Does the negadecimal (base -10) system have unique expansions, or are there real numbers with two different infinite expansions in this base also? 14. [14] Multiply (11321)2i by itself in the quater-imaginary system using the method illustrated in the text. 15. [A4.24] What are the sets s={~m~yk~ ak an allowable digit } , - analogous to Fig. 1, for the negative decimal and for the quater-imaginary number systems? 16. [A&.24] Design an algorithm to add 1 to (a,. . . ala~)~-l in the i-l number system. 17. [A&N] It may seem peculiar that i -1 has been suggested as a number-system base, instead of the similar but intuitively simpler number i + 1. Can every complex number a+ bi, where a and b are integers, be represented in a positional number system to base i f 1, using only the digits 0 and l? 18. [HiV3.2] Show that the set S of Fig. 1 is a closed set that contains a neighborhood of the origin. (Consequently, every complex number has a binary representation to base i -1.) b 19. [Z3] (David W. Matula.) Let D be a set of b integers, containing exactly one solution to the congruence 2 3 j (modulo b) for 0 5 j < b. Prove that all integers m (positive, negative, or zero) can be represented in the form m = (a,. . . aO)b, where all the a3 are in D, if and only if all integers in the range 1 5 m < u can be so represented, where 1 = -max{ a ( a E D}/(b -l), u = -min{ a 1 a E D}/(b -1). For example, D = {-l,O,. . , b -2} satisfies the conditions for all b 2 3. [Hint: Design an algorithm that constructs a suitable representation.]

4.1 POSITIONAL NUMBER SYSTEMS 193 Mixed-radix systems are (Web hosting top)

April 30th, 2008

4.1 POSITIONAL NUMBER SYSTEMS 193 Mixed-radix systems are familiar in everyday life, when we deal with units of measure. For example, the quantity 3 weeks, 2 days, 9 hours, 22 minutes, 57 seconds, and 492 milliseconds is equal to seconds 7, 24, 60, 60; 10004g2 1 3, 2, , 22, 57; The quantity 10 pounds, 6 shillings, and thruppence ha penny was once equal to [loI SE> 1i! 41 pence in British currency, before Great Britain changed to a purely debirnil monetary system. It is possible to add and subtract mixed-radix numbers by using a straightfor- ward generalization of the usual addition and subtraction algorithms, provided of course that the same mixed-radix system is being used for both operands (see exercise 4.3.1-9). Similarly, we can easily multiply or divide a mixed-radix number by small integer constants, using simple extensions of the familiar pencil- and-paper methods. Mixed-radix systems were first discussed in full generality by Georg Cantor [Zeitschrift fiir Math. und Physik 14 (1869), 121-1281. Exercises 26 and 29 give further information about them. Some questions concerning irrational radices have been investigated by W. Parry, Acta Mathematics, Acad. Sci. Hung., 11 (1960), 401-416. Besides the systems described in this section, several other ways to represent numbers are mentioned elsewhere in this series of books: the binomial number system (exercise 1.2.6-56); the Fibonacci number system (exercises 1.2.8-34, 5.4.2-10); the phi number system (exercise 1.2.8-35); modular representations (Section 4.3.2); Gray code (Section 7.2.1); and roman numerals (Section 9.1). EXERCISES 1. [IS] Express -10, -9, , 9, 10 in the number system whose base is -2. b 2. [,%$I Consider the following four number systems: (a) binary (signed magnitude); (b) negabinary (radix -2); (c) balanced ternary; and (d) radix b = &. Use each of these four number systems to express each of the following three numbers: (i) -49; (ii) -33 (show the repeating cycle); (iii) 7r (to a few significant figures). 3. [Zoo] Express -49 + i in the quater-imaginary system. 4. [IS] Assume that we have a MIX program in which location A contains a number for which the radix point lies between bytes 3 and 4, while location B contains a number whose radix point lies between bytes 2 and 3. (The leftmost byte is number 1). Where will the radix point be, in registers A and X, after the following instructions? 4 LDA A b) LDA A WJL BD SRAX 5 DIV B 1