Timers / large-partition vulnerability with workflows with long timeouts & cron

tl;dr The Cadence server side is vulnerable to database problems, particularly large-partitions given users starting and closing the same workflow quickly (such as the case with cron workflows)

### What this affects 

both NoSQL and SQL implementations for all versions

### Context 

Cadence stores timers for workflows in the `executions` table in the Cassandra implementation, (which I'll refer to here by way of illustration, however, some parts of the illustration might be wrong for SQL implementations). 

Timers are removed after the task-processing queue processes them, which is typically immediately after the visibility_ts ack-level is passed. Timers set to fire into the future therefore have a `visibility_ts` set for when they're to fire. This is currently the only means by which they're removed. For workflows that are completed, timers set further into the future which haven't fired yet remain there until the `visibility_ts`  point in time is passed. 

This works fine and is as-designed. However, it faces problems when customers set timers are set very far into the future and created in the same partition in large amounts. This is trivially done by customers startign a cron-workflow, for example, with a very long start-to-close execution timeout and having it start and complete quickly. 

The net effect of this (in the examples that I was observing) was that small cron workflows spinning quickly with large workflow timeouts, come quickly to dominate the records on the database shard with their disused timers (in the instances I saw it was the workflow execution start-to-close timeout, but presumably this is possible with all timers). Because cron workflows operate on a single shard, they create these large 'lumps' of data which cause the database to choke. 

### Operator's Solution

for operators tackling this, doing an anti-join on the the `executions` table's `type-1` (concrete-executions) and `type-3` (timers) records and removing the excess records is a short-term solution. 

### Actual solution

Workflows presumably should clean up their timers when closed and passing retention, adding this as a small fix should be pretty easy. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Timers / large-partition vulnerability with workflows with long timeouts & cron #7568

What this affects

Context

Operator's Solution

Actual solution

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Timers / large-partition vulnerability with workflows with long timeouts & cron #7568

Description

What this affects

Context

Operator's Solution

Actual solution

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions