SciPy Image Convolution Benchmark

This is an image processing benchmark. It performs image convolution using the SciPy package. The computation is done in parallel by using a pool of workers. For a given number of iterations, the benchmark reports the average time and standard deviation of the computation.

We can succeed scaling with respect to * the number of workers, * the workload, by controlling the size of the NumPy array (image) that each worker is working on and the number of work items, i.e., the number of images that the workers will work on.

Listing 33 scipy_scale_work.py: A SciPy image convolution benchmark
 1import dragon
 2import argparse
 3import multiprocessing
 4import numpy as np
 5import scipy.signal
 6import time
 7
 8def get_args():
 9    parser = argparse.ArgumentParser(description="Basic SciPy test")
10    parser.add_argument("--num_workers", type=int, default=4, help="number of workers")
11    parser.add_argument("--iterations", type=int, default=10, help="number of iterations to do")
12    parser.add_argument(
13        "--burns", type=int, default=2, help="number of iterations to burn/ignore in order to warm up"
14    )
15    parser.add_argument("--dragon", action="store_true", help="run with dragon objs")
16    parser.add_argument("--size", type=int, default=1000, help="size of the array")
17    parser.add_argument(
18        "--mem", type=int, default=(32 * 1024 * 1024), help="overall footprint of image dataset to process"
19    )
20    my_args = parser.parse_args()
21
22    return my_args
23
24def f(args):
25    image, random_filter = args
26    # Do some image processing.
27    return scipy.signal.convolve2d(image, random_filter)[::5, ::5]
28
29if __name__ == "__main__":
30    args = get_args()
31
32    if args.dragon:
33        print("using dragon runtime")
34        multiprocessing.set_start_method("dragon")
35    else:
36        print("using regular mp libs/calls")
37        multiprocessing.set_start_method("spawn")
38
39    num_cpus = args.num_workers
40
41    print(f"Number of workers: {num_cpus}")
42    pool = multiprocessing.Pool(num_cpus)
43    image = np.zeros((args.size, args.size))
44    nimages = int(float(args.mem) / float(image.size))
45    print(f"Number of images: {nimages}")
46    images = []
47    for j in range(nimages):
48        images.append(np.zeros((args.size, args.size)))
49    filters = [np.random.normal(size=(4, 4)) for _ in range(nimages)]
50
51    all_iter = args.iterations + args.burns
52    results = np.zeros(all_iter)
53    for i in range(all_iter):
54        start = time.perf_counter()
55        res = pool.map(f, zip(images, filters))
56        del res
57        finish = time.perf_counter()
58        results[i] = finish - start
59
60    print(f"Average time: {round(np.mean(results[args.burns:]), 2)} second(s)")
61    print(f"Standard deviation: {round(np.std(results[args.burns:]), 2)} second(s)")

Arguments list

 1dragon scipy_scale_work.py [-h] [--num_workers NUM_WORKERS] [--iterations ITERATIONS]
 2                           [--burns BURN_ITERATIONS] [--dragon] [--size IMAGE_SIZE]
 3                           [--mem MEMORY_FOOTPRINT]
 4
 5--num_workers       number of the workers, int, default=4
 6--iterations        number of iterations, int, default=10
 7--burns             number of iterations to burn, int, default=2
 8--dragon            use the Dragon calls, action="store_true"
 9--size              size of the numpy array used to create an image, int, default=1000
10--mem               overall memory footprint of image dataset to process, int, default=(32 * 1024 * 1024)

How to run

System Information

We ran the experiments in a Cray XC-50 system, with Intel Xeon(R) Platinum 8176 CPU @ 2.10GHz (28 cores per socket, single socket).

The Dragon version used is 0.3.

Single node

Base Multiprocessing

In order to use the standard multiprocessing library, this code can be run with dragon scipy_scale_work.py:

1> salloc --nodes=1 --exclusive
2> dragon scipy_scale_work.py --num_workers 4
3[stdout: p_uid=4294967296] using regular mp libs/calls
4Number of workers: 4
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 0.5 second(s)
7Standard deviation: 0.01 second(s)

Dragon runtime

In order to use Dragon runtime calls, run with dragon scipy_scale_work.py --dragon:

1> salloc --nodes=1 --exclusive
2> dragon scipy_scale_work.py --dragon --num_workers 4
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 4
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 0.39 second(s)
7Standard deviation: 0.0 second(s)

From the above results, we can see that Dragon outperforms Base Multiprocessing.

Multi-node

Dragon runtime

In multinode case, the placement of processes follows a round robin scheme. More information can be found in Processes

For running on 2 nodes with 4 workers:

1> salloc --nodes=2 --exclusive
2> dragon scipy_scale_work.py --dragon --num_workers 4
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 4
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 1.46 second(s)
7Standard deviation: 0.02 second(s)

For running on 2 nodes with 32 workers:

1> salloc --nodes=2 --exclusive
2> dragon ./scipy_scale_work.py --dragon --num_workers 32
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 32
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 1.34 second(s)
7Standard deviation: 0.61 second(s)

For running on 4 nodes with 32 workers:

1> salloc --nodes=4 --exclusive
2> dragon ./scipy_scale_work.py --dragon --num_workers 32
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 32
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 1.11 second(s)
7Standard deviation: 0.02 second(s)

In the above case, we keep the amount of work fixed, Number of images: 33, and we scale the number of nodes or number of workers. We see that the performance keeps improving as expected.

Finally, we run on 64 nodes with 896 workers:

1> salloc --nodes=64 --exclusive
2> dragon ./scipy_scale_work.py --dragon --mem 536870912 --size 256 --num_workers 896
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 896
5[stdout: p_uid=4294967296] Number of images: 8192
6Average time: 1.46 second(s)
7Standard deviation: 0.03 second(s)

In this last experiment, we scale the amount of nodes by a factor of 16, the number of workers by a factor of 28, and the amount of work by a factor of 16. This is a substantially heavier experiment compared to the previous ones. Dragon was able to scale and the performance almost remained the same as before.