I recently needed to choose between two competing designs: a simple one that needed millions of tasks or a complex one that needed only a few processing threads.

Plan A was to leverage the power of Task-based asynchronous programming in .NET to create a simple solution that had tasks waiting on a mix of other tasks, timers, events, and more; but, it would need millions of tasks, all running in parallel. Plan B was a lot of complicated code, but also higher performance from having substantially less overhead by using just a few processing threads to continually churn through work and keep every CPU core busy.

Common wisdom suggests starting with the simple solution and only when that proves non-performant to start optimizing performance. However, in this case, optimizing performance wouldn't involve merely tweaking a few lines of code or a few expensive method calls; instead, it would require a fundamentally different architecture and design, which would involve a complete rewrite of the entire system.

Therefore, an experiment was in order at the outset. I needed to find out if the overhead of a million tasks would have no effect on throughput, hurt throughput a little, or kill it entirely.

TL;DR

You can use millions of tasks, provided throughput is modest (a few thousand operations per second), at the cost of a modest amount of memory and negligible CPU overhead.

Avoid using tasks in extreme throughput scenarios (millions of operations per second), the CPU overhead is huge.

Promise Tasks and Delegate Tasks

In .NET, there are tasks based on TaskCompletionSource, which represent a future result. There are also tasks based on delegates, that represent executable code. Stephen Cleary refers to these as promise tasks and delegate tasks respectively, and that's the terminology I'll use in this post.

Promise tasks get their name from the fact that TaskCompletionSource is equivalent to what is known as a promise; in promise tasks, the Task itself is what is known as a future.

It is worthwhile distinguishing promise tasks and delegate tasks as promise tasks are merely objects and there are no threads or scheduling to worry about. The state of a promise task is controlled entirely through TaskCompletionSource. This means that promise tasks are very efficient and there's no reason to worry about them impacting throughput.

Conversely, delegate tasks are heavier: they require scheduling and execute on threads. If millions of delegate tasks were running in parallel, the context switching between threads would kill throughput and no work would get done. Therefore, .NET schedules the tasks to run one after another. This linearization improves performance and eliminates the overhead of context switching.

However, the delegate tasks used in Plan A do not run for a while and then terminate; they run for a while and then wait, then when their conditions are met, they run again for a while, and the process repeats. Some of these tasks will run like this for seconds, while others will do this for minutes, hours, or even days. In other words, Plan A will stress the task scheduler and I want to know what impact this will have on throughput.

Actor Model

At this point, you're probably thinking that the millions of tasks in Plan A sound like actors and that I should be using an actor model with something like Akka.NET.

It is possible to solve this problem using an actor model. However, the actor model is not a particularly good fit. When an actor is executing, it is single-threaded, it cannot be interrupted to process new messages until it has finished the current one. For this reason, it's normally recommended to push long-running operations out of the actor and into tasks.

But, then you're back to tasks and if all your actors are like this, the actor model is not really providing much benefit. In fact, you must now worry about synchronization between actors and tasks. Furthermore, having the code for a single concern split between multiple classes doesn't help readability, testability, or any other ancillary objective.

There are many different solutions, the actor model is one of them. For the purposes of this post, I'm bundling all these alternate solutions under the label Plan B as they trade complexity for potentially higher performance.

Experiments

To find the overhead of tasks, I decided to measure the maximum throughput of tasks using a trivial operation with negligible cost: counting using the increment operator.

I created four tests:

Signalled Counters using Delegate Task
Signalled Counters using Multi-threading
Time Delayed Counters using Delegate Task
Time Delayed Counters using Multi-threading

In tests one and three, the counters are implemented by delegate tasks with local state and are scheduled by the task scheduler. In tests two and four, the counters are implemented by objects and are scheduled by a fixed number of dedicated threads.

In tests one and two, the counters are allowed to increment when signalled. Dedicated threads process a queue of unsignalled counters, signalling them sequentially. In test one, signalling is handled by a promise task; in test two, signalling is handled by another queue.

In tests three and four, the counters are allowed to increment when a time delay has elapsed. In test three, the counters await a Task.Delay. In test four, priority queues are used to track the earliest each counter can run.

Results

The task-based approaches (tests 1 and 3) support 10x fewer counters than the task-less approaches (tests 2 and 4) as the tasks consume 10x the memory. Under x86, the task-based approaches support 5 million counters before exhausting the available memory; whereas, the task-less approaches support 50 million counters.

With signalled counters, the task-based approach (test 1) had 5x less throughput than the task-less approach (test 2). The task-based approach had a throughput of 1.2 million counts per second; whereas, the task-less approach had a throughput of 6.4 million counts per second.

With time delayed counters, the task-based approach (test 3) had 10x less throughput than the task-less approach (test 4). The task-based approach had a throughput of 0.5 million counts per second; whereas, the task-less approach had a throughput of 5.1 million counts per second.

Here's a summary of the results:

                 Test | Max Counters |  Throughput |
--------------------- |------------- |------------ |
    1. Signalled Task |    5 million | 1.2 million |
  2. Signalled Thread |   50 million | 6.4 million |
   3. Time Delay Task |    5 million | 0.5 million |
 4. Time Delay Thread |   50 million | 5.1 million |

Analysis

From these results, we see that there is a real and significant overhead to using tasks. In situations with very low cost operations and extreme throughput requirements, it makes sense to avoid tasks.

It is perhaps a little surprising that time delay task only achieves 42% of signalled task throughput, but time delay thread maintains 80% of signalled thread throughput. The cause of the additional overhead in time delay task is Task.Delay, which uses a doubly linked list to provide O(1) inserts and deletes at the expense of O(n) scans for triggering timers. In contrast, time delay thread uses a priority queue, which provides O(log n) inserts and deletes and importantly, O(1) access to the next timer.

The Task.Delay implementation is more general and efficiently handles common scenarios like timeouts where the timers never fire. However, when the timers are expected to fire, as in the time delay tests, the priority queue performs better. The time delay task's throughput could be improved by implementing custom promise task based timers that are backed by a priority queue; this would help in situations where many timers are expected to fire.

Realistic Throughput

So far, the experiments have only investigated situations with very inexpensive operations. What happens when the operations are complex? My problem requires millions of tasks, but only a small percentage need to execute at any given moment, the remainder are merely waiting. The expected throughput requirements are between 1,000 and 10,000 operations per second.

To investigate these more modest throughput requirements, I reran the signalled tests using different parameters to simulate this scenario. The following tests used 1.2 million counters, but instead of retriggering every counter as fast as possible, a fixed number of counters are retriggered every second. After running the tests until they reached a steady-state, I recorded the CPU and memory usage of the task-based and task-less approaches. Each test was run multiple times to eliminate noise.

The results were as follows:

  Operations | Task-based Approach | Task-less Approach |  
  Per Second |      CPU |   Memory |     CPU |   Memory |
-------------|--------- |--------- |-------- |--------- |
       1,000 |    0.1 % |   324 MB |     0 % |    41 MB |
      10,000 |      1 % |   328 MB |     0 % |    43 MB |
      20,000 |      2 % |   348 MB |     0 % |    45 MB |
      50,000 |      5 % |   378 MB |     0 % |    51 MB |
     100,000 |     10 % |   430 MB |     0 % |    60 MB |

As in the earlier tests, we see that the task-based approach uses significantly more memory and has significantly more overhead. However, even with over a million tasks, the memory usage is quite manageable: only a few hundred megabytes. Furthermore, only the memory usage is significantly influenced by the number of tasks; the CPU usage and overhead is primarily determined by the number of operations per second.

Therefore, for problems requiring a modest throughput of a few thousand operations per second, the overhead of a task-based solution is negligible and is unlikely to significantly impact overall performance. However, as throughput increases, the overhead becomes significant and it makes sense to avoid tasks.

Conclusion

Tasks are lightweight, but they're not free. If you have extreme throughput requirements you should avoid tasks. But, for modest throughput requirements, a few thousand requests per second, using tasks is not going to impact you too much. You can even have millions of tasks, provided you don't need to execute more than a few thousand of them in any given second.

The default timer implementation is not optimized for extensive use of Task.Delay or more specifically for timers that are guaranteed to fire. You can improve performance significantly by using a priority queue based implementation if you require that sort of timer in task form.

Maximizing Throughput - The Overhead of 1 Million Tasks