SciPy Image Convolution Benchmark
This is an image processing benchmark. It performs image convolution using the SciPy package. The computation is done in parallel by using a pool of workers. For a given number of iterations, the benchmark reports the average time and standard deviation of the computation.
We can succeed scaling with respect to * the number of workers, * the workload, by controlling the size of the NumPy array (image) that each worker is working on and the number of work items, i.e., the number of images that the workers will work on.
1import dragon
2import argparse
3import multiprocessing
4import numpy as np
5import scipy.signal
6import time
7
8def get_args():
9 parser = argparse.ArgumentParser(description="Basic SciPy test")
10 parser.add_argument("--num_workers", type=int, default=4, help="number of workers")
11 parser.add_argument("--iterations", type=int, default=10, help="number of iterations to do")
12 parser.add_argument(
13 "--burns", type=int, default=2, help="number of iterations to burn/ignore in order to warm up"
14 )
15 parser.add_argument("--dragon", action="store_true", help="run with dragon objs")
16 parser.add_argument("--size", type=int, default=1000, help="size of the array")
17 parser.add_argument(
18 "--mem", type=int, default=(32 * 1024 * 1024), help="overall footprint of image dataset to process"
19 )
20 my_args = parser.parse_args()
21
22 return my_args
23
24def f(args):
25 image, random_filter = args
26 # Do some image processing.
27 return scipy.signal.convolve2d(image, random_filter)[::5, ::5]
28
29if __name__ == "__main__":
30 args = get_args()
31
32 if args.dragon:
33 print("using dragon runtime")
34 multiprocessing.set_start_method("dragon")
35 else:
36 print("using regular mp libs/calls")
37 multiprocessing.set_start_method("spawn")
38
39 num_cpus = args.num_workers
40
41 print(f"Number of workers: {num_cpus}")
42 pool = multiprocessing.Pool(num_cpus)
43 image = np.zeros((args.size, args.size))
44 nimages = int(float(args.mem) / float(image.size))
45 print(f"Number of images: {nimages}")
46 images = []
47 for j in range(nimages):
48 images.append(np.zeros((args.size, args.size)))
49 filters = [np.random.normal(size=(4, 4)) for _ in range(nimages)]
50
51 all_iter = args.iterations + args.burns
52 results = np.zeros(all_iter)
53 for i in range(all_iter):
54 start = time.perf_counter()
55 res = pool.map(f, zip(images, filters))
56 del res
57 finish = time.perf_counter()
58 results[i] = finish - start
59
60 print(f"Average time: {round(np.mean(results[args.burns:]), 2)} second(s)")
61 print(f"Standard deviation: {round(np.std(results[args.burns:]), 2)} second(s)")
Arguments list
1dragon scipy_scale_work.py [-h] [--num_workers NUM_WORKERS] [--iterations ITERATIONS]
2 [--burns BURN_ITERATIONS] [--dragon] [--size IMAGE_SIZE]
3 [--mem MEMORY_FOOTPRINT]
4
5--num_workers number of the workers, int, default=4
6--iterations number of iterations, int, default=10
7--burns number of iterations to burn, int, default=2
8--dragon use the Dragon calls, action="store_true"
9--size size of the numpy array used to create an image, int, default=1000
10--mem overall memory footprint of image dataset to process, int, default=(32 * 1024 * 1024)
How to run
System Information
We ran the experiments in a Cray XC-50 system, with Intel Xeon(R) Platinum 8176 CPU @ 2.10GHz (28 cores per socket, single socket).
The Dragon version used is 0.3.
Single node
Base Multiprocessing
In order to use the standard multiprocessing library, this code can be run with
dragon scipy_scale_work.py
:
1> salloc --nodes=1 --exclusive
2> dragon scipy_scale_work.py --num_workers 4
3[stdout: p_uid=4294967296] using regular mp libs/calls
4Number of workers: 4
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 0.5 second(s)
7Standard deviation: 0.01 second(s)
Dragon runtime
In order to use Dragon runtime calls, run with dragon scipy_scale_work.py --dragon
:
1> salloc --nodes=1 --exclusive
2> dragon scipy_scale_work.py --dragon --num_workers 4
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 4
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 0.39 second(s)
7Standard deviation: 0.0 second(s)
From the above results, we can see that Dragon outperforms Base Multiprocessing.
Multi-node
Dragon runtime
In multinode case, the placement of processes follows a round robin scheme. More information can be found in Processes
For running on 2 nodes with 4 workers:
1> salloc --nodes=2 --exclusive
2> dragon scipy_scale_work.py --dragon --num_workers 4
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 4
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 1.46 second(s)
7Standard deviation: 0.02 second(s)
For running on 2 nodes with 32 workers:
1> salloc --nodes=2 --exclusive
2> dragon ./scipy_scale_work.py --dragon --num_workers 32
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 32
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 1.34 second(s)
7Standard deviation: 0.61 second(s)
For running on 4 nodes with 32 workers:
1> salloc --nodes=4 --exclusive
2> dragon ./scipy_scale_work.py --dragon --num_workers 32
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 32
5[stdout: p_uid=4294967296] Number of images: 33
6Average time: 1.11 second(s)
7Standard deviation: 0.02 second(s)
In the above case, we keep the amount of work fixed, Number of images: 33
, and we scale
the number of nodes or number of workers. We see that the performance keeps improving as expected.
Finally, we run on 64 nodes with 896 workers:
1> salloc --nodes=64 --exclusive
2> dragon ./scipy_scale_work.py --dragon --mem 536870912 --size 256 --num_workers 896
3[stdout: p_uid=4294967296] using dragon runtime
4Number of workers: 896
5[stdout: p_uid=4294967296] Number of images: 8192
6Average time: 1.46 second(s)
7Standard deviation: 0.03 second(s)
In this last experiment, we scale the amount of nodes by a factor of 16, the number of workers by a factor of 28, and the amount of work by a factor of 16. This is a substantially heavier experiment compared to the previous ones. Dragon was able to scale and the performance almost remained the same as before.