Archive for January, 2008

Web hosting reseller - 64 RANDOM NUMBERS 3.3.2 F. Permutation test. Divide

Thursday, January 24th, 2008

64 RANDOM NUMBERS 3.3.2 F. Permutation test. Divide the input sequence into n groups oft elements each, that is, (Ujt, Ujt+l, . . . , Ujt++r) for 0 5 j < n. The elements in each group can have t! possible relative orderings; the number of times each ordering appears is counted, and a chi-square test is applied with k = t! and with probability l/t! for each ordering. For example, if t = 3 we would have six possible categories, according to whether Usj < Usj+l < Us3+z or U3j < U3j+2 < U3j+l or *** or U3j+2 < U3j+l < U3je We assume in this test that equality between U s does not occur; such an assumption is justified, for the probability that two U s are equal is zero. A convenient way to perform the permutation test on a computer makes use of the following algorithm, which is of interest in itself: Algorithm P (Analyze a permutation). Given a sequence of distinct elements (ht..., Ut), we compute an integer f(Ul, . . . , Ut) such that 0 2 f(U1, . f * , Ut) < t!, and f(Ul,. . . , Ut) = f(Vl,. . . , Vt) if and only if (VI,. . . , Ut) and (VI,. . . ,Vt) have the same relative ordering. Pl. [Initialize.] Set r c t, f c 0. (During this algorithm we will have 0 5 f < t!/r!.) P2. [Find maximum.] Find the maximum of {VI,. . . , UT}, and suppose that U, is the maximum. Set f t T. f + s - 1. P3. [Exchange.] Exchange U, c* Us. P4. [Decrease r.] Decrease r by one. If r > 1, return to step P2. 1 Note that the sequence (VI,. . . , Ut) will have been sorted into ascending or- der when this algorithm stops. To prove that the result f uniquely characterizes the initial order of (VI,. . ., Ut), we note that Algorithm P can be run backwards: For r = 2, 3, . . . , t, set s c f modr, f t [f/r], and exchange U,. Us. It is easy to see that this will undo the effects of steps P2-P4; hence no two permutations can yield the same value of f, and Algorithm P performs as advertised. The essential idea that underlies Algorithm P is a mixed-radix representation called the factorial number system : Every integer in the range 0 2 f < t! can be uniquely written in the form f = (. * .(6-l x (t -1) + G-2) x (t -2) + * * * + cz) x 2 + Cl = (t - l)! Q-1 + (t - 2)! Q-2 + * * * + 2! cz + l! Cl (7) where the digits Cj are integers satisfying Cl 5 Cj L j, for 1 2 j < t. (8) In Algorithm P, cr.-1 = s - 1 when step P2 is performed for a given value of T.

3.3.2 EMPIRICAL TESTS 63 Algorithm C (Data for (Frontpage web hosting)

Thursday, January 24th, 2008

3.3.2 EMPIRICAL TESTS 63 Algorithm C (Data for coupon collector s test). Given a sequence of integers Yo, K, , with 0 5 Y, < d, this algorithm counts the lengths of n consecutive coupon collector segments. At the conclusion of the algorithm, COUNT[r] is the number of segments with length r, for d 5 T < t, and COUNT[t] is the number of segments with length 2 t. Cl. [Initialize.] Set j c -1, s c 0, and set COUNT[r] +- 0 for d 5 r 5 t. C2. [Set q,r zero.] Set o c T c 0, and set OCCURS[k] c 0 for 0 5 k < d. C3. [Next observation.] Increase T and j by 1. If OCCURS[Yj] # 0, repeat this step. C4. [Complete set?] Set OCCURS[Y~] c 1 and o t q + 1. (The subsequence observed so far contains q distinct values; if q = d, we therefore have a complete set.) If q < d, return to step C3. C5. [Record the length.] If r > t, increase COUNT[t] by one, otherwise increase COUNT[r] by one. C6. [n found?] Increase s by one. If s < n, return to step C2. 1 For an example of this algorithm, see exercise 7. We may think of a boy collecting d types of coupons, which are randomly distributed in his breakfast cereal boxes; he must keep eating more cereal until he has one coupon of each type. A chi-square test is to be applied to COUNT[d], COUNT[d + 11, . . . , COUNT[t], with k = t-d+ 1, after Algorithm C has counted n lengths. The corresponding probabilities are d

62 RANDOM NUMBERS 3.3.2 D. Poker test (Partition

Wednesday, January 23rd, 2008

62 RANDOM NUMBERS 3.3.2 D. Poker test (Partition test). The classical poker test considers 72 groups of . . five successive integers, (Ys3, Ysj+l, . . . , Ysj+4) for 0 5 j < n, and observes which of the following seven patterns is matched by each quintuple: All different: abcde Full house: aaabb One pair: aabcd Four of a kind: aaaab Two Pairs: aabbc Five of a kind: aaaaa Three of a kind: aaabc A chi-square test is based on the number of quintuples in each category. It is reasonable to ask for a somewhat simpler version of this test, to facilitate the programming involved. A good compromise would simply be to count the number of distinct values in the set of five. We would then have five categories: 5 different = all different; 4 different = one pair; 3 different = two pairs, or three of a kind; 2 different = full house, or four of a kind; 1 different = five of a kind. This breakdown is easier to determine systematically, and the test is nearly as good. In general we can consider n groups of k successive numbers, and we can count the number of k-tuples with T different values. A chi-square test is then made, using the probability p = d(d-l)…(d–+l) 7 dk that there are r different. (The Stirling numbers {F} are defined in Section 1.2.6, and they can readily be computed using the formulas given there.) Since the probability p, is very small when r = 1 or 2, we generally lump a few categories of low probability together before the chi-square test is applied. To derive the proper formula for p,, we must count how many of the d k-tuples of numbers between 0 and d - 1 have exactly T different elements, and divide the total by dk. Since d(d - 1). . . (d -T + 1) is the number of ordered choices of r things from a set of d objects, we need only show that {F} is the number of ways to partition a set of k elements into exactly T parts. Therefore exercise 1.2.6-64 completes the derivation of Eq. (5). E. Coupon collector s test. This test is related to the poker test somewhat as the gap test is related to the frequency test. The sequence Yo, Yr, . . . is used, and we observe the lengths of segments YJ+l, Yy+s, . . . , YJ+r required to get a complete set of integers from 0 to d -1. Algorithm C describes this precisely:

3.3.2 EMPIRICAL TESTS 61 G2. Set T zero

Tuesday, January 22nd, 2008

3.3.2 EMPIRICAL TESTS 61 G2. Set T zero G4. Increase T Fig. 6. Gathering data for the gap test. (Algorithms for the coupon-collector s test and the run test are similar.) Gl. [Initialize.] Set j t -1, s t 0, and set COUNT[r] t 0 for 0 2 T 2 t. 62. [Set T zero.] Set r t 0. 63. [o 2 Uj < ,B?] Increase j by 1. If Uj 2 LY and Uj < p, go to step G5. 64. [Increase r.] Increase r by one, and return to step G3. 65. [Record gap length. ] (A gap of length r has now been found.) If r 2 t, increase COUNT[t] by one, otherwise increase COUNT[?+] by one. G6. [n gaps found?] Increase s by one. If s < n, return to step G2. 1 After this algorithm has been performed, the chi-square test is applied to the k = t + 1 values of COUNT[O], COUNT[l], . . . , COUNT[t], using the following probabilities: PO = P, Pl = P(l -PI, P2 = P(1 - P)2, . . . , Pt-1 = P(l -Py, Pt = (1 -PI . (4) Here p = ,L3-LY, the probability that Q 5 Uj < p. The values of n and t are to be chosen, as usual, so that each of the values of COUNT[r] is expected to be 5 or more, preferably more. .! The gap test is often applied with Q: = 0 or p = 1 in order to omit one of the comparisons in step G3. The special cases (0, p) = (0, 3) or (4, 1) give rise to tests that are sometimes called runs above the mean and runs below the mean, respectively. The probabilities in Eq. (4) are easily deduced, so this derivation is left to the reader. Note that the gap test as described above observes the lengths of n gaps; it does not observe the gap lengths among n numbers. If the sequence (Un) is sufficiently nonrandom, Algorithm G may not terminate. Other gap tests that examine a fixed number of U s have also been proposed (see exercise 5).

60 RANDOM NUMBERS 3.3.2 The equanimity of your (Web hosts)

Tuesday, January 22nd, 2008

60 RANDOM NUMBERS 3.3.2 The equanimity of your average tosser of coins depends upon a law which ensures that he will not upset himself by losing too much nor upset his opponent by winning too often. -TOM STOPPARD. Rosencrantz & Guildenstern are Dead (1966) B. Serial test. More generally, we want pairs of successive numbers to be uniformly distributed in an independent manner. The sun comes up just about as often as it goes down, in the long run, but this doesn t make its motion random. To carry out the serial test, we simply count the number of times that the pair (Ys,, Y2jtl) = (4, r) occurs, for 0 2 j < n; these counts are to be made for each pair of integers (4, T) with 0 5 4, r < d, and the chi-square test is applied to these k = d2 categories with probability l/d2 in each category. As with the equidistribution test, d may be any cocvenient number, but it will be somewhat smaller than the values suggested above since a valid chi-square test should have n large compared to k (say n 2 5d2 at least). Clearly we can generalize this test to triples, quadruples, etc., instead of pairs (see exercise 2); however, the value of d must then be severely reduced in order to avoid having too many categories. When quadruples and larger numbers of adjacent elements are considered, we therefore make use of less exact tests such as the poker test or the maximum test described below. Note that 2n numbers of the sequence (2) are used in this test in order to make 12 observations. It would be a mistake to perform the serial test on the pairs (Yo, Yi), (Yi, Y2), . . . , (Y,-i, Yn); can the reader see why? We might perform another serial test on the pairs (Yzj+i, Yzj+2), and expect the sequence to pass both tests. Alternatively, I. J. Good has proved that if d is prime, and if the pairs (Yo, Yl), (Yi, Ys), . . . , (Y,-.1, Y,) are used, and if we use the usual chi-square method to compute both the statistics V2 for the serial test and VI for the frequency test on Yo, . . . , Y,-, with the same value of d, then V2 -2Vl should have the chi-square distribution with (d - 1)2 degrees of freedom when n is large. (See Proc. Cambridge Phil. Sot. 49 (1953), 276-284; Annals Math. Stat. 28 (1957), 262-264.) C. Gap test. Another test is used to examine the length of gaps between occurrences of lJ, in a certain range. If CY and p are two real numbers with 0 5 cy. < p 5 1, we want to consider the lengths of consecutive subsequences u,, uj+1, , Uj+r in which UY+~ lies between CY and ,0 but the other U s do not. (This subsequence of T + 1 numbers represents a gap of length T.) Algorithm G (Data for gap test). The following algorithm, applied to the sequence (1) for any given values of cy and p, counts the number of gaps of lengths 0, 1, . . . , t -1 and the number of gaps of length 2 t, until 12 gaps have been tabulated.

3.3.2 EMPIRICAL TESTS 59 (Free web space) b 24. [4U] Experiment

Monday, January 21st, 2008

3.3.2 EMPIRICAL TESTS 59 b 24. [4U] Experiment with various probability distributions (p, q, T) on three categories, where p + q + r = 1, by computing the exact distribution of the chi-square statistic V for various TL, thereby determining how accurate ari approximation the chi-square distribution with two degrees of freedom really is. 3.3.2. Empirical Tests In this section we shall discuss ten kinds of specific tests that have been applied to sequences in order to investigate their randomness. The discussion of each test has two parts: (a) a plug-in description of how to perform the test; and (b) a study of the theoretical basis for the test. (Readers lacking mathe- matical training may wish to skip over the theoretical discussions. Conversely, mathematically-inclined readers may find the associated theory quite interest- ing, even if they never intend to test random number generators, since some instructive combinatorial questions are involved here.) Each test is applied to a sequence KM = uo, Ul, u2,. . . (1) of real numbers, which purports to be independently and uniformly distributed between zero and one. Some of the tests are designed primarily for integer-valued sequences, instead of the real-valued sequence (1). In this case, the auxiliary sequence (Y,)=yo,K,yz,…, (2) which is defined by the rule K = LdUnJ, (3) is used instead. This is a sequence of integers that purports to be independently and uniformly distributed between 0 and d -1. The number d is chosen for convenience; for example, we might have d = 64 = 26 on a binary computer, so that Y, represents the six most significant bits of the binary representation of U,. The value of d should be large enough so that the test is meaningful, but not so large that the test becomes impracticably difficult to carry out. The quantities U,, Y,, and d will have the above significance throughout this section, although the value of d will probably be differenl in different tests. A. Equidistribution test (Frequency test). The first requirement that sequence (1) must meet is that its numbers are, in fact, uniformly distributed between zero and one. There are two ways to make this test: (a) Use the Kolmogorov- Smirnov test, with F(z) = 5 for 0 < II: 5 1. (b) Let d be a convenient number, e.g., 100 on a decimal computer, 64 or 128 on a binary computer, and use the sequence (2) instead of (1). For each integer T, 0 5 r < d, count the number of times that Y3 = r for 0 5 j < n, and then apply the chi-square test using Ic = d and probability p, = l/d for each category. The theory behind this test has been covered in Section 3.3.1.

58 RANDOM NUMBERS 3.3.1 b 16. (Web hosting domain) [MM%] Generalize

Sunday, January 20th, 2008

58 RANDOM NUMBERS 3.3.1 b 16. [MM%] Generalize Theorem 1.2.11.3A to find the value of 7(x + 1, x + .zdz + Y)lUX + l), for large result to z and fixed y, z. find the approximate Disregard solution, terms t, to of the answer the equation that are 0(1/x). Use this Y($ po) =P, for large v and fixed p, thereby accounting for the asymptotic formulas indicated in Table 1. [Hint: See exercise 1.2.11.3-8.1 17. [HM96] Let t be a fixed real number. For 0 2 k 5 n, let by convention, let P&x) = 1. Prove the following relations: a)P~k(x)=~i+ dX~~~ldX,i …/ : dXk+l~zktldl* …lz2dxl. b) P,~(z) = (x + t) /n! -(x + t) - /(n -I)!. (k -t)k c) Pnk(x) -Pn(k-l)(X) = ~ P(+-k)o(x -k), if 1 5 k 5 n. k! d) Obtain a general formula for P&(x), and apply it to the evaluation of Eq. (24). 18. [A4%7] Give a simple reason why K; has the same probability distribution as K,+. 19. [HM@ Develop tests, analogous to the Kolomogrov-Smirnov test, for use with multivariate distributions F(xi, . , x7) = probability that (Xl < x1, . . . , X, 5 x7). (Such procedures could be used, for example, in place of the serial test in the next section.) 20. [HM41] Deduce further terms of the asymptotic behavior of the KS distribution, extending (28). 21. [M40] Although the text states that the KS test should be applied only when F(x) is a continuous distribution function, it is, of course, possible to try to compute K,f and K; even when the distribution has jumps. Analyze the probable behavior of K,f and K, for various discontinuous distributions F(x). Compare the effectiveness of the resulting statistical test with the chi-square test on several samples of random numbers. 22. [HM46] Investigate the improved KS test suggested in the answer to exercise 6. 23. [M,% .2] (T. Gonzalez, S. Sahni, and W. R. Franta.) (a) Suppose that the maxi- mum value in formula (13) for the KS statistic K$ occurs at a given index j where ]nF(Xj)] = k. Prove that F(Xy) = rnaxi~i~~{ F(Xi) 1 [nF(X,)] = k }. (b) Design an algorithm that calculates Kz and K; in O(n) steps (without sorting).

3.3.1 GENERAL TEST PROCEDURES 57 8. [OO] The (Web server certificate)

Saturday, January 19th, 2008

3.3.1 GENERAL TEST PROCEDURES 57 8. [OO] The text describes an experiment in which 20 values of the statistic K& were obtained in the study of a random sequence. These values were plotted, to obtain Fig. 4, and a KS statistic was computed from the resulting graph. Why were the table entries for n = 20 used to study the resulting statistic, instead of the table entries for n = 101 b 9. [%I] The experiment described in the text consisted of plotting 20 values of K&, computed from the maximum of 5 test applied to different parts of a random sequence. We could have computed also the corresponding 20 values of K,; since Kc has the same distribution as K&, we could lump together the 40 values thus obtained (that is, 20 of the K$ s and 20 of the K, s), and a KS test could be applied so that we would get new values K&, K,. Di scuss the merits of this idea. b 10. [XI] Suppose a chi-square test is done by making n observations, and the value V is obtained. Now we repeat the test on these same n observations over again (getting, of course, the same results), and we put together the data from both tests, regarding it as a single chi-square test with 2n observations. (This procedure violates the text s stipulation that all of the observations must be independent of one another.) How is the second value of V related to the first one? 11. [IO] Solve exercise 10 substituting the KS test for the chi-square test. 12. [A&?81 Suppose a chi-square test is made on a set of n observations, assuming that ps is the probability that each observation falls into category s; but suppose that in actual fact the observations have probability qs # ps of falling into category s. (Cf. exercise 3.) We would, of course, like the chi-square test to detect the fact that the ps assumption was incorrect. Show that this will happen, if n is large enough. Prove also the analogous result for the KS test. 13. [M.Z4] Prove that Eqs. (13) are equivalent to Eqs. (11). b 14. [HA&Z6] Let 2, be given by Eq. (18). Show directly by using Stirling s approxima- tion that the multinomial probability n!pyL..p;ti/Y,!…Y,! = e -V 2/~(2nn)k-1pl .pk + o(n-k 2), if Zr,Zs, . . . , zk are bounded as n + 00. (This idea leads to a proof of the chi-square test that is much closer to first principles, and requires less handwaving, than the derivation in the text.) 15. [HM24] Polar coordinates in two dimensions are conventionally defined by the equations x = r cos 8 and y = r sin 8. For the purposes of integration, we have dx dy = r dr do. More generally, in n-dimensional space we can let xk = r sin81 . . . sin&l cos&, l

56 RANDOM (Web server setup) NUMBERS 3.3.1 The more precise asymptotic

Saturday, January 19th, 2008

56 RANDOM NUMBERS 3.3.1 The more precise asymptotic formulas in Table 2 follow from results obtained by D. A. Darling [Theory of Prob. and Appl. 5 (1960) 356-3611, who proved among other things that K,+ 5 s with probability 1 - CZSZ(l -$s/fi + 0(1/n)) (28) for all fixed s 2 0. EXERCISES 1. [OO] What line of the chi-square table should be used to check whether or not the value V = 7& of Eq. (5) is improbably high? 2. [Zoo] If two dice are loaded so that, on one die, the value 1 will turn up exactly twice as often as any of the other values, and the other die is similarly biased towards 6, compute the probability ps that a total of exactly s will appear on the two dice, for 2 2 s 2 12. b 3. [23] Some dice that were loaded as described in the previous exercise were rolled 144 times, and the following values were observed: value of s = 2 3 4 5 6 7 8 9 10 11 12 observed number, Y, = 2 6 10 16 18 32 20 13 16 9 2 Apply the chi-square test to these values, using the probabilities in (l), pretending it is not known that the dice are in fact faulty. Does the chi-square test detect the bad dice? If not, explain why not. b 4. [z?] The author actually obtained the data in experiment 1 of (9) by simulating dice in which one was normal, the other was loaded so that it always turned up 1 or 6. (The latter two possibilities were equally probable.) Compute the probabilities that replace (1) in this case, and by using a chi-square test decide if the results of that experiment are consistent with the dice being loaded in this way. 5. [,B] Let F(z) be the uniform distribution, Fig. 3(b). Find K& and KG for the following 20 observations: 0.414, 0.732, 0.236, 0.162, 0.259, 0.442, 0.189, 0.693, 0.098, 0.302, 0.442, 0.434, 0.141, 0.017, 0.318, 0.869, 0.772, 0.678, 0.354, 0.718, and state whether these observations are significantly different from expected behavior with respect to either of these two tests. 6. [A&V] Consider Fn(z), as given in Eq. (lo), for fixed 2. What is the probability that Fn(z) = s/n, given an integer s? What is the mean value of Fn(z)? What is the standard deviation? 7. [MU] Show that Kz and K; can never be negative. What is the largest possible value K,+ can be?

3.3.1 GENERAL TEST PROCEDURES 55 To study the (Web hosting directory)

Friday, January 18th, 2008

3.3.1 GENERAL TEST PROCEDURES 55 To study the distribution of K,f and K;, we begin with the following basic fact: If X is a random variable with the continuous distribution F(z), then F(X) is a uniformly distributed real number between 0 and 1. To prove this, we need only verify that if 0 5 y 5 1 we have F(X) 5 y with probability y. Since F is continuous, F(Q) = y for some ~0; thus the probability that F(X) 5 y is the probability that X 5 20. By definition, the latter probability is F(Q), that is, it is y. Let YJ = nF(X,), for 1 2 j 5 n, where the X s have been sorted as in Step 2 above. Then the variables Yy are essentially the same as independent, uniformly distributed random numbers between 0 and 1 that have been sorted into nondecreasing order, Yl 2 YZ 5 . .. 2 Y,; and the first equation of (13) may be transformed into Kz = Lmax(l-Y1,2-Yz ,…, n-Y,). 6 If 0 5 t 5 n, the probability that K,+ 5 t/J;; is therefore the probability that Yj 2 j-t for 1 5 j 5 n. This is not hard to express in terms of n-dimensional integrals, where aj = max(j -t, 0). (24) The denominator here is immediately evaluated: it is found to be nn/n!, which makes sense since the hypercube of all vectors (yl, ~2,. . . , yn) with 0 2 yj < n has volume nn, and it can be divided into n! equal parts corresponding to each possible ordering of the y s. The integral in the numerator is a little more difficult, but it yields to the attack suggested in exercise 17, and we get the general formula probability that (KL 5 $) = ~o~<~(~)(ic-t)*(t+li-ic)ntl- (25) The distribution of K; is exactly the same. Equation (25) was first obtained by Z. W. Birnbaum and Fred H. Tingey [Annals Math. Stat. 22 (1951), 592-5961; it may be used to extend Table 2. In his original paper, Smirnov proved that lim probability that (KL 2 s) = 1 - C2 , if s > 0. (26) 12 00 This together with (25) implies that, for all s 2 0, we have