Created by Moritz Wundke
Concurrent aims to be a different type of task distribution system compared to what MPI like system do. It adds a simple but powerful application abstraction layer to distribute the logic of an entire application onto a swarm of clusters holding similarities with volunteer computing systems.
Traditional task distributed systems will just perform simple tasks onto the distributed system and wait for results. Concurrent goes one step further by letting the tasks and the application decide what to do. The programming paradigm is then totally async without any waits for results and based on notifications once a computation has been performed.
Concurrent is build upon a flexible plug-able component framework. Most of the framework is plug-able in few steps and can be tweaked installing plug-ins.
Applications themselves are plug-ins that are then load on the environment and executed.
Components are singleton instances per ComponentManager. They implement the functionality of a given Interface and talk to each other through an ExtensionPoint.
Example setting of Components linked together with ExtensionPoints
Configuration system allows us to configure ExtensionPoints via config files.
Our distributed system is based on a 3 node setup, while more classes are involved for flexibility
Concurrent comes with two task scheduling policies, one optimized for heterogeneous systems and another for homogeneous systems.
From previous slide
Generic scheduling execution flow. A GenericTaskScheduler uses a GenericTaskScheduleStrategy to send work to a GenericTaskManager of the target slave.
From previous slide
ZMQ push/pull scheduling execution flow. A ZMQTaskScheduler pushes work onto a global work socket. The ZMQTaskManagers of the slaves pull that work from it and perform the processing. The result is then pushed back to the master node.
Concurrent comes with a complex transport module that features TCP and ZMQ sockets clients, servers and proxies.
Registering RPC methods is straightforward in concurrent. Just register a method with the given server or client instance.
@jsonremote(self.api_service_v1)
def register_slave(request, node_id, port, data):
self.stats.add_avg('register_slave')
return self.register_node(node_id, web.ctx['ip'], port, data, NodeType.slave)
@tcpremote(self.zmq_server, name='register_slave')
def register_slave_tcp(handler, request, node_id):
self.stats.add_avg('register_slave_tcp')
return self.register_node_tcp(handler, request, node_id, NodeType.slave)
Concurrent comes with a set of samples that should outline the power of the framework. They also guide how an application for concurrent should be created and what has to be avoided.
Sample implanted using plain tasks and an ITaskSystem. Comes in two flavors of tasks, an optimized and a non-optimized tasks on the data usage side.
Execution time between an ITaskSystem and a plain task implementation using the non-optimized tasks.
Execution time between an ITaskSystem and a plain task implementation using optimized tasks. Both ways gain but the plain tasks experience the main boost.
Sample using plain tasks and an ITaskSystem. A simple active wait task used to compare the framework and its setup to Amdahl's Law.
Execution time between an ITaskSystem and a plain task implementation. Both are nearly identical.
CPU usage of the sample indicating that we are using all available CPU power for the benchmark.
Very simple sample with low data transmission. Uses bruteforce to reverse a numerical MD5 hash knowing the range of numbers. The concurrent implementation is about 4 times faster then the sequential version.
The DNA sample is just a simple way to try to overload the system with over 10 thousand separate tasks. Each task requires a considerable amount of data and so sending it all at once has its drawbacks.
The tests and benchmarks has been performed on a single machine and so we setup the worst possible distributed network resulting in network overloads. This the sample is not intended for real use, it outlines the stability of the framework.As for the other samples the amount of data send through the network is considerable, the sample itself reaches a high memory load up to 3 GB on the master.
Sending huge amounts of data is the real bottleneck of any distributed task framework. We spend more time sending data then performing the actual computation. While in some cases sending less tasks with more data for each one is better then sending thousands of small tasks.
A main law creating any concurrent computing system is Amdahl's Law. Analyzing the performance from our samples, specially the Mandelbrot and the benchmark sample gives is fairly good speed up compared to the maximum speedup that could be achieved applying the law.
Image courtesy of Wikipedia
2.4 = 1/((1-P)+(P/N)) where P = 0.67 and N = 8
7.02 = 1/((1-P)+(P/N)) where P = 0.98 and N = 8
8 = 1/((1-P)+(P/N)) where P = 1 and N = 8