Why the dip in speed increase for generating 400,000,000 random numbers?

Spread the love

Question Description

I’m generating around 400,000,000 (400 million) random numbers in parallel on an Intel i7 with 4 cores (8 threads hyperthreaded) on macOS with 8 GB RAM.

However, I’m also generating 400,000,000 random numbers on a DigitalOcean server with 20 cores on Debian with 64 GB RAM.

Here’s the code:

import multiprocessing
import random

rangemin = 1
rangemax = 9

def randomGenPar_backend(backinput):
    return random.randint(rangemin, rangemax)

def randomGenPar(num):
    pool = multiprocessing.Pool()
    return pool.map(randomGenPar_backend, range(0, num))

randNum = 400000000

random.seed(999)
randomGenPar(randNum)

These are the results of the benchmark:

5,000,000 Random Numbers:
1 Core: 5.984
8 Core: 1.982

50,000,000 Random Numbers:
1 Core: 57.28
8 Core: 19.799
20 Core: 18.257
Times Benefit (20 core vs. 8 core) = 1.08

100,000,000 Random Numbers:
1 Core: 115
8 Core: 40.434
20 Core: 31.652
Times Benefit (20 core vs. 8 core) = 1.28

200,000,000 Random Numbers:
8 Core: 87
20 Core: 60
Times Benefit (20 core vs. 8 core) = 1.45

300,000,000 Random Numbers:
8 Core: 157
20 Core: 88
Times Benefit (20 core vs. 8 core) = 1.78

400,000,000 Random Numbers:
8 Core: 202
20 Core: 139
Times Benefit (20 core vs. 8 core) = 1.45 (DIP!)

500,000,000 Random Numbers:
8 Core: 280
20 Core: 171
Times Benefit (20 core vs. 8 core) = 1.64 (INCREASE!)

600,000,000 Random Numbers:
8 Core: 342
20 Core: 198
Times Benefit (20 core vs. 8 core) = 1.73

700,000,000 Random Numbers:
8 Core: 410
20 Core: 206
Times Benefit (20 core vs. 8 core) = 1.99

800,000,000 Random Numbers:
8 Core: 482
20 Core: 231
Times Benefit (20 core vs. 8 core) = 2.09

Usually, the more random numbers that are generated, the more the parallelism of the 20 core CPU can be used. Therefore, the “times increase” of speed from 8 core to 20 core increases over time.

However, after 300 million random numbers, this decreases, and increases again until 800 million (I haven’t tested further).

Why is this? Is there a specific reason? Was it just random? (I’ve repeated this twice, and gotten the same result both times)

EDIT: If it makes any difference, I’m using the time function to time the execution of the script. Also, the OS isn’t the same on both machines (8 core – macOS, 20 core – Debian).

Practice As Follows

Two possible explanations come to mind.

This could be an artifact of garbage collection kicking in. An easy experiment would be to shut-off GC and see if the “dip” persists:

>>> import gc
>>> gc.disable()

Another possibility is that this is an artifact of list growth using realloc() under-the-hood. Lists implemented are fixed length arrays of pointers. When map() grows the list with append(), the C function call realloc() is called periodically to resize the array of pointers. Often, this call is very cheap because none of the data has to be moved. However, if even a single byte in memory “obstructs” the resize, then all of the data has to be relocated. This is very expensive and could cause your the “dip” if at that point in execution multiprocessing is creating an obstructing byte.

To test this hypothesis, you could use imap() instead of map() and feed the results into collections.deque() instead of a list(). The deque implementation does not use relloc so its performance is consistent in the face of fragmented memory (internally, it just makes repeated calls to malloc() to obtained fixed length memory blocks).

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.