Multiprocessing Pool versus MPI-based Pool ++++++++++++++++++++++++++++++++++++++++++ .. currentmodule:: dragon.mpbridge.context Dragon :ref:`multiprocessing ` presents a notably different programming model for parallel applications than options like MPI that are more familiar to the HPC community. Users coming from an HPC background might be curious about performance advantages or disadvantages of using Dragon, and this benchmark shows one approach to using MPI to implement a processing solution similar to :external+python:py:class:`multiprocessing.pool.Pool`. The :example_benchmark:`dragon_pool_vs_mpi.py` benchmark measures the performance of a map-like workload using either :py:meth:`~dragon.mpbridge.context.DragonContext.Pool` or an MPI-based equivalent, attempting to be fair in both implementation complexity and optimization. Figure :numref:`pool_v_mpi_perf` shows the performance improvement using a :py:meth:`~dragon.mpbridge.context.DragonContext.Pool` for a map-like workload over the MPI implementation. Two sets of data are shown for a case where each work item requires 0.1 seconds of processing time (imbalance=1) and a case where work items require a randomly selected amount of processing time between 0.1 and 0.8 seconds (imbalance=8). For balanced workloads (imbalance=1), MPI outperforms Dragon on a single node and has equivalent performance up to 4 nodes. Beyond 4 nodes, Dragon rapidly outperforms the MPI implementation and is 32X faster on 32 nodes. The exact cause of this requires deeper investigation, but we believe it is related to Dragon's design to manage many-to-one patterns that occur in this case as workers communicate nearly sychronously back to the primary orchestration process. For the imbalanced case (imbalance=8), Dragon consistently outperforms the MPI implementation and is 6X faster at 32 nodes. :py:meth:`~dragon.mpbridge.context.DragonContext.Pool` is designed for imbalanced workloads as work items are fetched as workers become idle. This is difficult to achieve with the MPI implementation. .. _pool_v_mpi_perf: .. figure:: images/pool_v_mpi.png :align: center :scale: 20 % Performance improvement using a Dragon `Pool` over an MPI-based implmentation. Values higher than 1 indicate better performance by that factor. All data was gathered on a Cray EX system equipped with a single HPE Slingshot 200Gbps NIC on each node. To run the same benchmarks: .. code-block:: console # in an allocation of up to 32 nodes dragon dragon_pool_vs_mpi.py --items_per_worker=4 --imbalance_factor=1 dragon dragon_pool_vs_mpi.py --items_per_worker=4 --imbalance_factor=8