Archive for January, 2008

Web design course - 74 RANDOM NUMBERS 3.3.2 7. [08] Apply the

Thursday, January 31st, 2008

74 RANDOM NUMBERS 3.3.2 7. [08] Apply the coupon collector s test procedure (Algorithm C) with d = 3 and 72 = 7, to the following sequence: 1101221022120202001212201010201121. What lengths do the seven subsequences have? b 8. [MZ?] How many U s need to be examined, on the average, in the coupon collec- tor s test (Algorithm C) before 72 complete sets have been found, assuming that the sequence is random? What is the standard deviation? [Hint: See Eq. 1.2.9-28.1 9. [A4,%?1] Generalize the coupon collector s test so that the search stops as soon as w distinct values have been found, where w is a fixed positive integer less than or equal to d. What probabilities should be used in place of (6)? 10. [A&% ] Solve exercise 8 for the more general coupon collector s test described in exercise 9. 11. [OO] The runs up in a particular permutation are displayed in (9); what are the runs down in that permutation? 12. [ZOO] Let V0, Ul, . . . , V,-l be n distinct numbers. Write an algorithm that determines the lengths of all ascending runs in the sequence. When your algorithm terminates, COUNT[r] should be the number of runs of length r, for 1 5 r 5 5, and COUNF[S] should be the number of rdns of length 6 or more. 13. [MZ?] Show that (16) is the number of permutations of p+q+l distinct elements having the pattern (15) . b 14. [A&5] If we throw away the element that immediately follows a run, so that when Xj is greater than X3+1 we start the next run with X3+2, the run lengths are independent, and a simple chi-square test may be used (instead of the horribly compli- cated method derived in the text). What are the appropriate run-length probabilities for this simple run test? 15. [A4101 In the maximum-of-t test, why are Vi, V, , . . . , VAwl supposed to be uniformly distributed between zero and one? b 16. [IS] (a) Mr. J. H. Quick (a student) wanted to perform the maximum-of-t test for various values of t. Letting Zj, = max(Vj, U3+1,. . . , Uj+t–l), he found a clever way to go from the sequence ZO(~-~), Z1(t-l), , to the sequence Zot, Zlt, . . , using very little time and space. What was his bright idea? (b) He decided to modify the maximum-of-t method so that the jth observation would be max(u, , . . . , Uj+,-,); in other words, he took V, = Zj, instead of V, = Z ctjjt as the text says. He reasoned that all of the Z s should have the same distribution, so the test is even stronger if each ZJt, 0 5 j < n, is used instead of just every tth one. But when he tried a chi-square equidistribution test on the values of V$, he got extremely high values of the statistic V, which got even higher as t increased. Why did this happen? 17. (A4Z5] (a) Given any numbers UO, . . , Un-l, Vi,. . . , Vn-l, let 1 ai=- c uk, v=-; c vk. n O

3.3.2 EMPIRICAL TESTS 73 unpublished], is unlike the (Adult web hosting)

Thursday, January 31st, 2008

3.3.2 EMPIRICAL TESTS 73 unpublished], is unlike the others in that it was not developed before the advent of computers; it is specifically intended for computer use. The reader probably wonders, Why are there so many tests? It has been said that more computer time is spent testing random numbers than using them in applications! This is untrue, although it is possible to go overboard in testing. The need for making several tests has been amply documented. It has been recorded, for example, that some numbers generated by a variant of the middle- square method have passed the frequency test, gap test, and poker test, yet flunked the serial test. Linear congruential sequences with small multipliers have been known to pass many tests, yet fail on the run test because there are too few runs of length one. The maximum-of-t test has also been used to ferret out some bad generators that otherwise seemed to perform respectably. Perhaps the main reason for doing extensive testing on random number generators is that people misusing Mr. X s random number generator will hardly ever admit that their programs are at fault: they will blame the generator, until Mr. X can prove to them that his numbers are sufficiently random. On the other hand, if the source of random numbers is only for Mr. X s personal use, he might decide not to bother to test them, since the techniques recommended in this chapter have a high probability of being satisfactory. EXERCISES 1. [10] Why should the serial test described in part B be applied to (Yo, Yl), (Yz, Ys), . , (Ygnp2,Yzn–1) instead of to (Yo, Yl), (Yl, Yz), . . . , (Y,-,, Y,)? 2. [IO] State an appropriate way to generalize the serial test to triples, quadruples, etc., instead of pairs. b 3. [MX?] How many U s need to be examined in the gap test (Algorithm G) before n gaps have been found, on the average, assuming that the sequence is random? What is the standard deviation of this quantity? 4. [&?I Prove that the probabilities in (4) are correct for the gap test. 5. [Mz?] The classical gap test used by Kendall and Babington-Smith considers the numbers UO, U1, , UN-1 to be a cyclic sequence with UN+3 identified with U,. Here N is a fixed number of U s that are to be subjected to the test. If n of the numbers UO, , UN-~ fall into the range cr 5 U, < /3, there are n gaps in the cyclic sequence. Let 2, be the number of gaps of length r, for 0 5 r < t, and let Zt be the number of gaps of length 2 t; show that the quantity V = CoCVCt(Zr -np,) /np, should have the chi-square distribution with t degrees of freedom, in the limit as N goes to infinity, where p, is given in Eq. (4). 6. [40] (H. Geiringer.) A frequency count of the first 2000 decimal digits in the representation of e = 2.71828.. gave a x2 value of 1.06, indicating that the actual frequencies of the digits 0, 1, . , 9 are much too close to their expected values to be considered randomly distributed. (In fact, x2 > 1.15 with probability 99.9 percent.) The same test applied to the first 10,000 digits of e gives the reasonable value x2 = 8.61; but the fact that the first 2000 digits are so evenly distributed is still surprising. Does the same phenomenon occur in the representation of e to other bases? [See AMA4 72 (1965), 483-500.1

72 RANDOM NUMBERS 3.3.2 L. Historical remarks and (Web site design and hosting)

Wednesday, January 30th, 2008

72 RANDOM NUMBERS 3.3.2 L. Historical remarks and further discussion. Statistical tests arose naturally in the course of scientists efforts to prove or disprove hypotheses about various observed data. The best known early papers dealing with the testing of artificially generated numbers for randomness are two articles by M. G. Kendall and B. Babington-Smith in the Journd of the Royal Statisticd Society 101 (1938), 147-166, and in the supplement to that journal, 6 (1939), 51-61. These papers were concerned with the testing of random digits between 0 and 9, rather than random real numbers; for this purpose, the authors discussed the frequency test, serial test, gap test, and poker test, although they misapplied the serial test. Kendall and Babington-Smith also used a variant of the coupon collector s test; the method described in this section was introduced by R. E. Greenwood in Math. Comp. 9 (1955), l-4. The run test has a rather interesting history. Originally, tests were made on runs up and down at once: a run up would be followed by a run down, then another run up, and so on. Note that the run test and the permutation test do not depend on the uniform distribution of the U s, they depend only on the fact that Vi = Uj occurs with probability zero when i # j; therefore these tests can be applied to many types of random sequences. The run test in primitive form was originated by J. Bienayme [Comptes Rendus 81 (Paris: Acad. Sciences, 1875), 417-4231. S ome sixty years later, W. 0. Kermack and A. G. McKendrick published two extensive papers on the subject (Proc. Royal Society Edinburgh 57 (1937), 228-240, 332-3761; as an example they stated that Edinburgh rainfall between the years 1785 and 1930 was entirely random in character with respect to the run test (although they examined only the mean and standard deviation of the run lengths). Several other people began using the test, but it was not until 1944 that the use of the chi-square method in connection with this test was shown to be incorrect. The paper by H. Levene and J. Wolfowitz in Annals Math. Stat. 15 (1944), 58-69, introduced the correct run test (for runs up and down, alternately) and discussed the fallacies in earlier misuses of that test. Separate tests for runs up and runs down, as proposed in the text above, are more suited to computer application, so we have not given the more complex formulas for the alternate-up-and-down case. See the survey paper by D. E. Barton and C. L. Mallows, Annals Math. Stat. 36 (1965), 236-260. Of all the tests we have discussed, the frequency test and the serial correla- tion test seem to be the weakest, in the sense that nearly all random number generators pass these tests. Theoretical grounds for the weakness of these tests are discussed briefly in Section 3.5 (cf. exercise 3.5-26). The run test, on the other hand, is a rather strong test: the results of exercises 3.3.3-23 and 24 sug- gest that linear congruential generators tend to have runs somewhat longer than normal if the multiplier is not large enough, so the run test of exercise 14 is definitely to be recommended. The collision test is also highly recommended, since it has been especially designed to detect the deficiencies of many poor generators that have unfor- tunately become widespread. This test, which is based on ideas of H. Delgas Christiansen [Inst. Math. Stat. and Oper. Res., Tech. Univ. Denmark (Oct. 1975),

3.3.2 EMPIRICAL TESTS 71 coefficient is not expected (Web hosting billing)

Tuesday, January 29th, 2008

3.3.2 EMPIRICAL TESTS 71 coefficient is not expected to be exactly zero. (See exercise 18.) A good value of C will be between pn -2a, and pn + 2a,, where -1 1 n(n -3) Pn = -on=—n > 2. (25) n-l n-l J n+l We expect C to be between these limits about 95 percent of the time. Equations (25) are only conjectured at this time, since the exact distribution of C is not known when the U s are uniformly distributed. For the theory when the U s have the normal distribution, see the paper by Wilfrid J. Dixon, Annals Math. Stat. 15 (1944), 119-144. Empirical evidence suggests that we may safely use the formulas for the mean and standard deviation that have been derived from the assumption of the normal distribution, without much error; these are the values that have been listed in (25). It is known that lim,,, &(T, = 1; cf. the article by Anderson and Walker, Annals Math. Stat. 35 (1964), 1296-1303, where more general results about serial correlations of dependent sequences are derived. Instead of simply computing the correlation coefficient between the obser- vations ( UO, U1, . . . , Unpl) and their immediate successors (VI, . . . , Unpl, UO), we can also compute it between (UO, U1,. . , Unpl) and any cyclically shifted sequence (U,, . . . , Unpl, UO, . . . , U,-,); the cyclic correlations should be small for 0 < 4 < n. A straightforward computation of Eq. (24) for all 4 would require about n2 multiplications, but it is actually possible to compute all the correlations in only O(n log n) steps by using fast Fourier transforms. (See Section 4.6.4; cf. also L. P. Schmid, CACM 8 (1965), 115.) K. Tests on subsequences. It frequently happens that the external program using our random sequence will call for numbers in batches. For example, if the program works with three random variables X, Y, and 2, it may consistently invoke the generation of three random numbers at a time. In such applications it is important that the subsequences consisting of every third term of the original sequence be random. If the program requires 4 numbers at a time, the sequences uo, uq, u2q,. * * ; ~1,~q+1,~2q+1,…; . . . . uq-1,7Jzq-1, u3q-1,. . . can each be put through the tests described above for the original sequence UO, Ul, u2, . . . . Experience with linear congruential sequences has shown that these derived sequences rarely if ever behave less randomly than the original sequence, unless 4 has a large factor in common with the period length. On a binary computer with m equal to the word size, for example, a test of the subsequences for 4 = 8 will tend to give the poorest randomness for all 4 < 16; and on a decimal computer, g = 10 yields the subsequences most likely to be unsatisfactory. (This can be explained somewhat on the grounds of potency, since such values of 9 will tend to lower the potency.)

70 RANDOM NUMBERS 3.3.2 (Web host sites) Sl. [Initialize.] Set A[j]

Monday, January 28th, 2008

70 RANDOM NUMBERS 3.3.2 Sl. [Initialize.] Set A[j] e 0 for 0 2 j 5 n; then set A[l] t 1 and j, e ji t 1. Then do step S2 exactly n - 1 times and go on to step S3. S2. [Update probabilities.] (Each time we do this step it corresponds to tossing a ball into an urn; A[j] represents the probability that exactly j of the urns are occupied.) Set jr +ji+l. Thenforjcjl,ji-l,…, jh(inthis order), set A[j] +- (j/mkUl + ((1 + l/m) -(jlm))Ab -11. If ALI has become very small as a result of this calculation, say A[j] < 10m2 , set A[j] + 0; and in such a case, if j = j, decrease ji by 1, or if j = jo increase jo by 1. S3. [Compute the answers.] In this step we make use of an auxiliary table (Tl I r2 f . . * , Ttmax ) = (.Ol, .05, .25, .50, .75, .95, .99, 1.00) containing the specified percentage points of interest. Set p t 0, t t 1, and j c j. -1. Do the following iteration until t = tmax: Increase j by 1, and set p e p + A[j]; then if p > Tt, output n -j -1 and 1 - p (meaning that with probability 1 -p there are at most n -j -1 collisions) and repeatedly increase t by 1 until p 5 Tt. I J. Serial correlation test. We may also compute the following statistic: n(U0Ul + GU2 + . . . + un-2un–l + un-1Uo) -(VII + Ul + . . . + un-1y = n(U$ + UT + .*a+ q-,)-(uo + u1+…+ un-ly . (23) This is the serial correlation coefficient, which is a measure of the amount lJ+l depends on Uj. Correlation coefficients appear frequently in statistics; if we have n quantities uo, Ul, f U,-I and n others Vi, VI, . . . , V,-I , the correlation coefficient between them is defined to be C= n m4w -cc WC vj) (24 J(nCu3 -Euj)2)(nCV~ -(,IEV,)2) All summations in this formula are to be taken over the range 0 5 j < n; Eq. (23) is the special case V, = U(j+l) modn. (Note: The denominator of (24) is zero when UO = VI = … = Un-l or VO = VI = … = V,-1; we exclude this case from discussion.) A correlation coefficient always lies between -1 and +l. When it is zero or very small, it indicates that the quantities Uj and I+ are (relatively speaking) independent of each other, but when the correlation coefficient is fl it indicates total linear dependence; in fact Vj = cx f PiIJj for all j in such a case, for some constants cr and p. (See exercise 17.) Therefore it is desirable to have C in Eq. (23) close to zero. In actual fact, since U&J1 is not completely independent of UIU2, the serial correlation

Web hosting unlimited bandwidth - 3.3.2 EMPIRICAL TESTS 69 Suppose we have m

Monday, January 28th, 2008

3.3.2 EMPIRICAL TESTS 69 Suppose we have m urns and we throw n balls at random into those urns, where m is much greater than n. Most of the balls will land in urns that were previously empty, but if a ball falls into an urn that already contains at least one ball we say that a collision has occurred. The collision test counts the number of collisions, and a generator passes this test if it doesn t induce too many or too few collisions. To fix the ideas, suppose m = 220 and n = 214. Then each urn will receive only one 64th of a ball, on the average. The probability that a given urn will contain exactly Ic balls is pk = (z)mpk(l -m-l)n-k, so the expected number of collisions per urn is ,&>l(k-l)pk = xkBO k&-E,>, pk = n/m-l+po. Since po = (1 - m-l)n = 1 - n/m + (y)mm2 + small& terms, we find that the average total number of collisions taken over all m urns is very slightly less than n2/2m = 128. We can use the collision test to rate a random number generator in a large number of dimensions. For example, when m = 220 and n = 214 we can test the 20-dimensional randomness of a number generator by letting d = 2 and forming 20-dimensional vectors Vj = (Yzo~, Yzo~+~, . . . , Yzo~+~~) for 0 2 j < n. It suffices to keep a table of m = 22o bits to determine collisions, one bit for each possible value of the vector V,; on a computer with 32 bits per word, this amounts to 215 words. Initially all 220 bits of this table are cleared to zero; then for each V,, if the corresponding bit is already 1 we record a collision, otherwise we set the bit to 1. This test can also be used in 10 dimensions with d = 4, and so on. To decide if the test is passed, we can use the following table of percentage points when m = 220 and n = 214: collisions 5 101 108 119 126 134 145 153 with probability .009 .043 .244 .476 .742 .946 .989 The theory underlying these probabilities is the same we used in the poker test, Eq. (5); the probability that c collisions occur is the probability that n - c urns are occupied, namely m(m-l)…(m-n+c+l) n mn 1n-c I Although m and n are very large, it is not difficult to compute these probabilities using the following method: Algorithm S (Percentage points for collision test). Given m and n, this algorithm determines the distribution of the number of collisions that occur when n balls are scattered into m urns. An auxiliary array A[O], A[l], . . . , A[n] of floating point numbers is used for the computation; actually A[j] will be nonzero only for jo 2 j 2 j,, and ~ 1 - ~ 0 will be at most of order log n, so it would be possible to get by with considerably less storage.

68 RANDOM NUMBERS 3.3.2 Form (Web design software) the matrix C

Sunday, January 27th, 2008

68 RANDOM NUMBERS 3.3.2 Form the matrix C of the covariances of the R s; for example, Crs = covar(Rr , Rs), while CIt = covar(Rr , R:). When t = 6, we have C = nC1 + C2 =n + if n > 12. Now form A = (ai?), the inverse of the matrix C, and compute Cl

Web hosting account - 3.3.2 EMPIRICAL TESTS 67 are (~+q+l)~~q)-(Pll~~l)-(p+~+l)+l (16) ways

Saturday, January 26th, 2008

3.3.2 EMPIRICAL TESTS 67 are (~+q+l)~~q)-(Pll~~l)-(p+~+l)+l (16) ways to arrange them in the order (15), as shown in exercise 13; and there are (n -p -q -l)! ways to arrange the remaining elements. Thus there are 1 1 times 16 ways in all, and dividing by n! we get the (p+:+&-P-q-1. ( 1 desired formula. From relations (14) a rather lengthy calculation leads to mean(R ,) = mean(Z,, + . . . + Z,,) = (n + l)Pl(P + 111 - (P - 1)/P!, l n, (18) where t = max(p, q), s = p + q, and ^, j(p, q, n) = (n + 1) s(l -pq) + pq 1Y -L)+(Y) (lg) ( (P + lY(q + (s + I)! , (s2 -s - 2)pq -s2 - p2q2 + 1 t. (P + lY(q + l)! . This expression for the covariance is unfortunately quite complicated, but it is necessary for a successful run test as described above. From these formulas it is easy to compute mean = mean(R ,) -mean(R ,+,), covar(Rp, Rb) = covar(RL, Rb) -covar(R ,+, , R ,), (20) covar(Rp, R4) = covar(R,, R ,) -covar(Rp, Rb+,). In Annals Math. Stat. 15 (1944), 163-165, J. Wolfowitz proved that the quantities Rl, R2, . . . , RtF1, R: become normally distributed as n –+ 00, subject to the mean and covariance expressed above; this implies that the following test for runs is valid: Given a sequence of n random numbers, compute the number of runs R, of length p for 1 2 p < t, and also the number of runs Ri of length t or more. Let &I = RI -mean( . . . . &t-l = Rt-1 -mean(Rt-I), &t = R$ -mean( (21)

66 RANDOM (Hosting your own web site) NUMBERS 3.3.2 Given any permutation on

Saturday, January 26th, 2008

66 RANDOM NUMBERS 3.3.2 Given any permutation on n elements, let Z,, = 1 if position i is the beginning of an ascending run of length p or more, and let Zpi = 0 otherwise. For example, consider the permutation (9) with n = 10; we have and all other Z s are zero. With this notation, R; = Z,I + Z,2 + . . . + z,n (12) is the number of runs of length > p, and Rp=R;-R;+l (13) is the number of runs of length p exactly. Our goal is to compute the mean value of Rp, and also the covariance covar(RP, R4) = mean((RP -mean(R,))(R, -mean(R, which measures the interdependence of Rp and R,. These mean values are to be computed as the average over the set of all n! permutations. Equations (12) and (13) show that the answers can be expressed in terms of the mean values of Z,, and of ZpiZql, so as the first step of the derivation we obtain the following results (assuming that i < j): Zpi = (P + hlMP + 1)!1 ifisn-p+l; otherwise. 3 (0 il: + wql(P + lY(q + l)!, ifi+p 1. Note that Z~iZ ~~ is either zero or one, so the summation consists of counting all permutations Ur Us . . . U, for which Z,, = .Z ,j = 1, that is, all permutations such that a-1 > u% < . < Uifp-1 > Ui+p < . . . < Ui+p+q-l. (15) The number of such permutations may be enumerated as follows: there are (P+q+l n ) ways to choose the elements for the positions indicated in (15); there

3.3.2 EMPIRICAL TESTS 65 G. Run test. A (My web server)

Friday, January 25th, 2008

3.3.2 EMPIRICAL TESTS 65 G. Run test. A sequence may also be tested for runs up and runs down. This means we examine the length of monotone subsequences of the original sequence, i.e., segments that are increasing or decreasing. As an example of the precise definition of a run, consider the sequence of ten numbers 1298536704 ; putting a vertical line at the left and right and between X, and X3+1 whenever X, > X3+1, we obtain /l 2 9181513 6 710 41, (9) which displays the runs up : there is a run of length 3, followed by two runs of length 1, followed by another run of length 3, followed by a run of length 2. The algorithm of exercise 12 shows how to tabulate the length of runs up. Unlike the gap test and the coupon collector s test (which are in many other respects similar to this test), we should not apply a &i-square test to the above data, since adjacent runs are not independent. A long run will tend to be followed by a short run, and conversely. This lack of independence is enough to invalidate a straightforward chi-square test. Instead, the following statistic may be computed, when the run lengths have been determined as in exercise 12: V = k c (COUNT[~] - &,)(couNT[~] -nbY)az3, (10) 11w16 where n is the length of the sequence, and the coefficients at3 and bi are equal to fall a12 a13 a14 a15 a16 (4529.4 9044.9 13568 18091 22615 27892 a21 a22 a23 a24 a25 a26 9044.9 18097 27139 36187 45234 55789 a31 a32 a33 a34 a35 a36 13568 27139 40721 54281 67852 83685 a41 a42 a43 a44 a45 a46 18091 36187 54281 72414 90470 111580 a51 a52 a53 a54 a55 a56 22615 45234 67852 90470 113262 139476 ia61 a62 a63 a64 a65 a66 I= L 27892 55789 83685 111580 139476 172860 (11) (bl bz b3 h b5 b6) = (8 & & +& & &). (The values of az3 shown here are approximate only; exact values may be obtained by using formulas derived below.) The statistic V in (10) should have the chi- square distribution with six (not five) degrees of freedom, when n is large. The value of n should be, say, 4000 or more. The same test can be applied to runs down. A vastly simpler and more practical run test appears in exercise 14, so a reader who is interested only in testing random number generators should skip the next few pages and go on to the maximum-of-t test after looking at exercise 14. On the other hand it is instructive from a mathematical standpoint to see how a complicated run test with interdependent runs can be treated, so we shall now digress for a moment.