Distributed Dictionary Performance

How well does the DDict perform? We improve Dragon performance with each release, but this is where we are at with Dragon v0.12.1. For the gups_ddict.py , inspired by the classic GUPS (Global Updates Per Second) benchmark, some large number of processes will put or get a unique set of key/value pairs into or from the DDict. The keys are always 128 bytes in size in this implementation, but the values vary in length. Figure Fig. 39 below shows the aggregate bandwidth measured across the clients for writing key/value pairs into a DDict sharded across up to 512 nodes on a Cray EX system. For the largest value sizes, DDict is achieving ~1/3 of the hardware-limited network bandwidth and scales linearly with the number of nodes.

../_images/ddict_put.png — Fig. 39 Aggregate bandwidth for the `put` operation on a `DDict`.

Figure Fig. 40 shows similar data but now using start_batch_put() and end_batch_put() to enable aggregating operations, which can eliminate some overhead in communicating with managers. In comparison with basic put() operations, this optimization is most effective at lower client node counts and values less than 1 MB. For example, 4 KB values on a single node achieve 5.6X higher throughput using batched operations. At large node counts, however, batched operations may reduce performance.

../_images/ddict_batch_put.png — Fig. 40 Aggregate bandwidth for the batched `put` operation on a `DDict`.

Figure Fig. 41 is the same but now for get() operations. Additional optimizations were recently done to this path for read-heavy use cases, such as AI training data loading, that account for get() frequently achieving higher performance than put() in the v0.12.1 release.

../_images/ddict_get.png — Fig. 41 Aggregate bandwidth for the `get` operation on a `DDict`.

A new feature added in v0.12 is the ability to freeze() a DDict. A frozen DDict allows clients more direct access to dictionary buffers and eliminates some required copy overheads. This optimization is most effective at low client node counts and large value sizes, as seen in Figure Fig. 42. For example, 64 MB values on a single node achieve 2X higher read throughput with a frozen DDict, and 16 MB values on two nodes achieve 1.5X higher throughput.

../_images/ddict_frozen_get.png — Fig. 42 Aggregate bandwidth for the `get` operation on frozen a `DDict`.

All data was gathered on a Cray EX system equipped with a single HPE Slingshot 200Gbps NIC on each node. To run the same benchmarks:

export DRAGON_DEFAULT_SEG_SZ=21474836480
dragon gups_ddict.py --benchit