Proxy

Dragon supports the ability to start a runtime on a remote system and connect to it from a local system via a proxy. This allows users to run Dragon applications that burst across system boundaries. A prototypical example would be using a jupyter notebook on a laptop and connecting to a high-performance computing cluster to run expensive computations. The Dragon runtime would be started on the remote HPC system, and the local jupyter notebook would connect to it via a proxy. As long as the remote runtime is not shutdown the local client can enable and disable the proxy as many times as needed.

Setup

  1. Ensure that passwordless SSH access is set up from the local system to the remote system. The proxy mechanism currently uses SSH to connect to the remote system and forward communication between the proxy client and proxy server.

  2. Ensure that the Dragon infrastructure is installed and configured on both the local and remote systems.

  3. Ensure that the remote python environment has access to the same code and dependencies as the local system. It can be in a different path than where the server was started but must be accessible.

Hello World Example

The following code snippets illustrate a simple example of starting a proxy server on a remote system and connecting to it from a local system via a proxy client. The proxy server can be re-used with almost no change for any proxy client. The proxy client code will need to be modified to specify the remote system and paths used for publishing and shutdown on the remote system.

Listing 68 Proxy server ran on remote system
 1import os
 2import time
 3import dragon.infrastructure.parameters as dparm
 4import dragon.workflows.runtime as runtime
 5
 6def wait_for_exit(exit_file_path):
 7    while not os.path.exists(exit_file_path):
 8        time.sleep(1)
 9    time.sleep(1)
10    if dparm.this_process.index == 0:
11        os.remove(exit_file_path)
12
13if __name__ == '__main__':
14    name='proxy_runtime'
15    publish_path=os.getcwd() # path where sdesc is published. needs to be accessible by client
16    exit_path=os.path.join(os.getcwd(), 'exit_client') # path where client exit file is created. needs to be writable by client
17
18    sdesc = runtime.publish(name, publish_path)
19    print(f'Runtime serialized descriptor: {sdesc}', flush=True)
20    wait_for_exit(exit_path)
Listing 69 Proxy client ran on local system
 1import dragon
 2import multiprocessing as mp
 3import os
 4import socket
 5
 6from dragon.native.process import Process
 7import dragon.workflows.runtime as runtime
 8
 9def signal_exit(exit_path):
10    file = open(exit_path, "w")
11    file.close()
12
13def shutdown_remote_runtime(exit_path, remote_working_dir):
14    exit_proc = Process(target=signal_exit, args=(exit_path,), cwd=remote_working_dir)
15    exit_proc.start()
16    exit_proc.join()
17
18def howdy(q):
19    q.put(
20        f"howdy from {socket.gethostname()} - local num cores is {os.cpu_count()}, runtime available cores is {mp.cpu_count()}"
21    )
22
23def remote_work(proxy, remote_working_dir):
24
25    if proxy is not None:
26        proc_env = proxy.get_env()
27    else:
28        proc_env = os.environ.copy()
29
30    q = mp.Queue()
31    procs = []
32
33    print("Launching remote runtime processes...", flush=True)
34    for _ in range(2):
35        # using native process so we can set cwd
36        p = Process(target=howdy, args=(q,), cwd=remote_working_dir)
37        p.start()
38        procs.append(p)
39
40    for p in procs:
41        msg = q.get()
42        print(f"Message from remote runtime: {msg}", flush=True)
43
44    for p in procs:
45        p.join()
46
47    # when running in proxy mode, explicitly delete processes and queue can be helpful
48    for p in procs:
49        del p
50    del q
51
52
53if __name__ == '__main__':
54
55    # paths to find remote runtime sdesc and signal exit
56    system = "my.remote.system"
57    runtime_name = "proxy_runtime"
58    publish_dir = "/my/remote/publish/dir"
59    exit_path = "/my/remote/publish/dir/exit_client"
60
61    # paths to files used during remote runtime execution
62    remote_working_dir = "/my/remote/working/dir"
63
64
65    mp.set_start_method("dragon")
66    runtime_sdesc = runtime.lookup(system, runtime_name, 30, publish_dir=publish_dir)
67    proxy = runtime.attach(runtime_sdesc, remote_cwd=remote_working_dir)
68
69    print("\n")
70
71    # run remote work with proxy
72    proxy.enable()
73    remote_work(proxy, remote_working_dir)
74    proxy.disable()
75
76    # run remote work without proxy
77    remote_work(None, os.getcwd())
78
79    # run remote work with proxy again
80    proxy.enable()
81    remote_work(proxy, remote_working_dir)
82    # signal client's exit
83    shutdown_remote_runtime(exit_path, remote_working_dir)
84    proxy.disable()

Using Pickle by Value

Cloudpickle is used to serialize functions and objects sent to the remote runtime. Cloudpickle documents an experimental feature to support serializing modules by value rather than the default of by reference. This feature may be helpful when using proxies to ensure that the remote runtime has access to the same code as the client. See the [Cloudpickle documentation](https://github.com/cloudpipe/cloudpickle?tab=readme-ov-file#overriding-pickles-serialization-mechanism-for-importable-constructs ) for more information. The following code snippet shows how a local module can be organized to utilize cloudpickle’s register_pickle_by_value to avoid needing module paths to be available on the remote system.

Listing 70 My module that is only on the local system
 1import dragon
 2import multiprocessing as mp
 3import os
 4import socket
 5
 6from dragon.native.process import Process
 7import dragon.workflows.runtime as runtime
 8
 9def signal_exit(exit_path):
10    file = open(exit_path, "w")
11    file.close()
12
13def shutdown_remote_runtime(exit_path):
14    exit_proc = Process(target=signal_exit, args=(exit_path,))
15    exit_proc.start()
16    exit_proc.join()
17
18def howdy(q):
19    q.put(
20        f"howdy from {socket.gethostname()} - local num cores is {os.cpu_count()}, runtime available cores is {mp.cpu_count()}"
21    )
22
23def remote_work():
24
25    q = mp.Queue()
26    procs = []
27
28    print("Launching remote runtime processes...", flush=True)
29    for _ in range(2):
30        # using native process so we can set cwd
31        p = Process(target=howdy, args=(q,))
32        p.start()
33        procs.append(p)
34
35    for p in procs:
36        msg = q.get()
37        print(f"Message from remote runtime: {msg}", flush=True)
38
39    for p in procs:
40        p.join()
41
42    # when running in proxy mode, explicitly delete processes and queue can be helpful
43    for p in procs:
44        del p
45    del q

In this example, the my_local_module.py contains code that is only available on the local system. By registering the module with cloudpickle.register_pickle_by_value, we can ensure that it is serialized and sent to the remote runtime when needed, allowing the remote runtime to execute code from the module even though it is not available on the remote system.

Listing 71 Proxy client ran on local system
 1import dragon
 2import multiprocessing as mp
 3import os
 4
 5import dragon.workflows.runtime as runtime
 6import cloudpickle
 7import my_local_module  # module only on local system
 8# this may have performance impacts and recursive imports may not work properly, see cloudpickle docs.
 9cloudpickle.register_pickle_by_value(my_local_module)
10
11if __name__ == '__main__':
12
13    # paths to find remote runtime sdesc and signal exit
14    system = "my.remote.system"
15    runtime_name = "proxy_runtime"
16    publish_dir = "/my/remote/publish/dir"
17    exit_path = "/my/remote/publish/dir/exit_client"
18
19
20    mp.set_start_method("dragon")
21    runtime_sdesc = runtime.lookup(system, runtime_name, 30, publish_dir=publish_dir)
22    proxy = runtime.attach(runtime_sdesc, remote_cwd=publish_dir)
23
24    print("\n")
25
26    # run remote work with proxy
27    proxy.enable()
28    my_local_module.remote_work()
29    proxy.disable()
30
31    # run remote work without proxy
32    my_local_module.remote_work()
33    # run remote work with proxy again
34    proxy.enable()
35    my_local_module.remote_work()
36    # signal client's exit
37    my_local_module.shutdown_remote_runtime(exit_path)
38    proxy.disable()

Tips and Tricks

Remote:

  • Ensure that the named file used to signal exit is not present on the remote system before starting the proxy server.

  • Ensure that the remote working directory is accessible and writable by the client process.

  • When using a proxy, the remote runtime will continue to run even if the local client disconnects. Be sure to signal the remote runtime to shutdown when finished to avoid leaving stray runtimes running on the remote system.

  • Multiple runtimes can attach to the same remote runtime via proxies. Each client will need to lookup and attach to the remote runtime separately. The remote runtime will manage resources for all connected clients. The clients will share the resources of the remote runtime. If a client disconnects, the other clients will still be able to use the remote runtime as long as it is not shutdown. If a client shuts down the remote runtime, all clients will lose access to it and any resources within it. Clients are likely to hang on remote operations if the remote runtime is shutdown while they are connected.

Local:

  • Ensure that the descriptor is correct and can be found by the client when looking up the runtime. If the client cannot find the runtime, check that the publish path is correct and accessible by the client, and that the descriptor is being published to the correct place by the server.

  • When using a proxy, it is helpful to explicitly delete any Dragon processes and queues created within the remote runtime to ensure proper cleanup.

  • When attaching to a remote runtime via a proxy, you can specify a different working directory for the remote processes using the remote_cwd parameter.

  • Use the get_env() method of the proxy to obtain the correct environment variables for processes running in the remote runtime.

  • If local runtime appears hung, check the remote system for any error messages or issues with the Dragon runtime. Many errors will be returned to the local runtime when using a proxy; however, some issues may only be visible on the remote system. Specifically, missing dependencies or code on the remote system may cause errors that are not visible on the local system.

  • When using a memory object that was created in the remote runtime on the local system, the object needs to be instatiated within the proxy enable/disable block but can be used outside of it until the remote runtime is shutdown.

  • Process and ProcessGroup objects created within the remote runtime cannot be used on the local system when the proxy is disabled, but can be used when the proxy is enabled. Be sure to only use these objects when the proxy is enabled to avoid errors. You can disable and re-enable the proxy as needed to check the status of remote processes. In the future, we may add functionality to automatically route process management calls through the proxy when a process was created in the remote runtime to avoid this issue.