9

Don't use rand(): a guide to random number generators in C++

 2 years ago
source link: http://codeforces.com/blog/entry/61587
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Don't use rand(): a guide to random number generators in C++

By neal, 3 years ago,

Don't use rand(). Why? Let's jump right into some code. What value will the following code print, approximately?

#include <cstdlib>
#include <iostream>
using namespace std;

const int ITERATIONS = 1e7;

int main() {
    double sum = 0;

    for (int i = 0; i < ITERATIONS; i++)
        sum += rand() % 1000000;

    cout << "Average value: " << sum / ITERATIONS << '\n';
}

Should be about 500,000, right? Turns out it depends on the compiler, and on Codeforces it prints 16382, which isn't even close. Try it out yourself.

What's happening here?

If you look up C++ documentation on rand(), you'll see that it returns "a pseudo-random integral value between 0 and RAND_MAX." Click again on RAND_MAX and you'll see that "This value is implementation dependent. It's guaranteed that this value is at least 32767." On the Codeforces machines, it turns out RAND_MAX is exactly 32767. That's so small!

It doesn't stop there though; random_shuffle() also uses rand(). Recall that in order to perform a random shuffle, we need to generate random indices up to n, the size of the array. But if rand() only goes up to 32767, what happens if we call random_shuffle() on an array with significantly more elements than that? Time for some more code. What would you expect the following code to print?

#include <algorithm>
#include <iostream>
#include <vector>
using namespace std;

const int N = 3000000;

double average_distance(const vector<int> &permutation) {
    double distance_sum = 0;

    for (int i = 0; i < N; i++)
        distance_sum += abs(permutation[i] - i);

    return distance_sum / N;
}

int main() {
    vector<int> permutation(N);

    for (int i = 0; i < N; i++)
        permutation[i] = i;

    random_shuffle(permutation.begin(), permutation.end());
    cout << average_distance(permutation) << '\n';
}

This computes the average distance that each value moves in the random shuffle. If you work out a bit of math, you'll find that the answer on a perfectly random shuffle should be 1a648f8709223b9fa2328efe8411e9f202bff4c5.png = 1,000,000. Even if you don't want to do the math, you can observe that the answer is between cbf25ae6a8f2f0effa5ee0e4c69cea0c3a56d49d.png = 1,500,000, the average distance for index 0, and 1d78021314703e229e6f20754e20467b48f0a073.png = 750,000, the average distance for index cbf25ae6a8f2f0effa5ee0e4c69cea0c3a56d49d.png.

Well, once again the code above disappoints; it prints out 64463. Try it yourself. In other words, random_shuffle() moved each element a distance of 2% of the length of the array on average. Based on my testing, the implementation of random_shuffle() on Codeforces matches the following exactly:

    for (int i = 1; i < N; i++)
        swap(permutation[i], permutation[rand() % (i + 1)]);

So naturally if RAND_MAX is much less than N, this shuffle will be problematic.

rand() itself has more quality problems than just RAND_MAX being small though; it is typically implemented as a relatively simple linear congruential generator. On the Codeforces compiler, it looks like this:

static long holdrand = 1L;

void srand(unsigned int seed) {
  holdrand = (long) seed;
}

int rand() {
  return (((holdrand = holdrand * 214013L + 2531011L) >> 16) & 0x7fff);
}

In particular, linear congruential generators (LCGs) suffer from extreme predictability in the lower bits. The k-th bit (starting from k = 0, the lowest bit) has a period of at most 2k + 1 (i.e., how long until the sequence takes to repeat). So the lowest bit has a period of just 2, the second lowest a period of 4, etc. This is why the function above discards the lowest 16 bits, and the resulting output is at most 32767.

What's the solution?

Don't worry, as of C++11 there are much better random number generators available in C++. The only thing you need to remember is to use mt19937, included in the <random> header. This is a Mersenne Twister based on the prime 219937 - 1, which also happens to be its period. It's a much higher-quality RNG than rand(), in addition to being much faster (389 ms to generate and add 108 numbers from mt19937 in Custom Invocation, vs. 1170 ms for rand()). It also produces full 32-bit unsigned outputs between 0 and 232 - 1 = 4294967295, rather than maxing out at a measly 32767.

To replace random_shuffle(), you can now call shuffle() and pass in your mt19937 as the third argument; the shuffle algorithm will use your provided generator for shuffling.

C++11 also gives you some nifty distributions. uniform_int_distribution gives you perfectly uniform numbers, without the bias of mod -- i.e., rand() % 10000 is more likely to give you a number between 0 and 999 than a number between 9000 and 9999, since 32767 is not a perfect multiple of 10000. There are many other fun distributions as well including normal_distribution and exponential_distribution.

To give you a more concrete idea, here's some code using several of the tools mentioned above. Note that the code seeds the random number generator using a high-precision clock. This is important for avoiding hacks specifically tailored to your code, since using a fixed seed means that anyone can determine what your RNG will output. For more details, see How randomized solutions can be hacked, and how to make your solution unhackable.

One last thing: if you want 64-bit random numbers, just use mt19937_64 instead.

#include <algorithm>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
using namespace std;

const int N = 3000000;

double average_distance(const vector<int> &permutation) {
    double distance_sum = 0;

    for (int i = 0; i < N; i++)
        distance_sum += abs(permutation[i] - i);

    return distance_sum / N;
}

int main() {
    mt19937 rng(chrono::steady_clock::now().time_since_epoch().count());
    vector<int> permutation(N);

    for (int i = 0; i < N; i++)
        permutation[i] = i;

    shuffle(permutation.begin(), permutation.end(), rng);
    cout << average_distance(permutation) << '\n';

    for (int i = 0; i < N; i++)
        permutation[i] = i;

    for (int i = 1; i < N; i++)
        swap(permutation[i], permutation[uniform_int_distribution<int>(0, i)(rng)]);

    cout << average_distance(permutation) << '\n';
}

Both shuffles result in almost exactly 106 average distance, like we originally expected.

Additional References

This post was inspired in part by Stephan T. Lavavej's talk "rand() Considered Harmful": https://channel9.msdn.com/Events/GoingNative/2013/rand-Considered-Harmful

If you want even faster, higher-quality random number generators, take a look at this site by Sebastiano Vigna.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK