Skip to content

Timers / large-partition vulnerability with workflows with long timeouts & cron #7568

@davidporter-id-au

Description

@davidporter-id-au

tl;dr The Cadence server side is vulnerable to database problems, particularly large-partitions given users starting and closing the same workflow quickly (such as the case with cron workflows)

What this affects

both NoSQL and SQL implementations for all versions

Context

Cadence stores timers for workflows in the executions table in the Cassandra implementation, (which I'll refer to here by way of illustration, however, some parts of the illustration might be wrong for SQL implementations).

Timers are removed after the task-processing queue processes them, which is typically immediately after the visibility_ts ack-level is passed. Timers set to fire into the future therefore have a visibility_ts set for when they're to fire. This is currently the only means by which they're removed. For workflows that are completed, timers set further into the future which haven't fired yet remain there until the visibility_ts point in time is passed.

This works fine and is as-designed. However, it faces problems when customers set timers are set very far into the future and created in the same partition in large amounts. This is trivially done by customers startign a cron-workflow, for example, with a very long start-to-close execution timeout and having it start and complete quickly.

The net effect of this (in the examples that I was observing) was that small cron workflows spinning quickly with large workflow timeouts, come quickly to dominate the records on the database shard with their disused timers (in the instances I saw it was the workflow execution start-to-close timeout, but presumably this is possible with all timers). Because cron workflows operate on a single shard, they create these large 'lumps' of data which cause the database to choke.

Operator's Solution

for operators tackling this, doing an anti-join on the the executions table's type-1 (concrete-executions) and type-3 (timers) records and removing the excess records is a short-term solution.

Actual solution

Workflows presumably should clean up their timers when closed and passing retention, adding this as a small fix should be pretty easy.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions