-
Notifications
You must be signed in to change notification settings - Fork 881
Description
tl;dr The Cadence server side is vulnerable to database problems, particularly large-partitions given users starting and closing the same workflow quickly (such as the case with cron workflows)
What this affects
both NoSQL and SQL implementations for all versions
Context
Cadence stores timers for workflows in the executions table in the Cassandra implementation, (which I'll refer to here by way of illustration, however, some parts of the illustration might be wrong for SQL implementations).
Timers are removed after the task-processing queue processes them, which is typically immediately after the visibility_ts ack-level is passed. Timers set to fire into the future therefore have a visibility_ts set for when they're to fire. This is currently the only means by which they're removed. For workflows that are completed, timers set further into the future which haven't fired yet remain there until the visibility_ts point in time is passed.
This works fine and is as-designed. However, it faces problems when customers set timers are set very far into the future and created in the same partition in large amounts. This is trivially done by customers startign a cron-workflow, for example, with a very long start-to-close execution timeout and having it start and complete quickly.
The net effect of this (in the examples that I was observing) was that small cron workflows spinning quickly with large workflow timeouts, come quickly to dominate the records on the database shard with their disused timers (in the instances I saw it was the workflow execution start-to-close timeout, but presumably this is possible with all timers). Because cron workflows operate on a single shard, they create these large 'lumps' of data which cause the database to choke.
Operator's Solution
for operators tackling this, doing an anti-join on the the executions table's type-1 (concrete-executions) and type-3 (timers) records and removing the excess records is a short-term solution.
Actual solution
Workflows presumably should clean up their timers when closed and passing retention, adding this as a small fix should be pretty easy.