Skip to content

Conversation

@akash-R-A-J
Copy link

@akash-R-A-J akash-R-A-J commented Jan 5, 2026

Address issue #2001 where workers deadlock when numerous identical actions run simultaneously.

Root causes and fixes

  1. File permit exhaustion
    download_to_directory used unbounded parallelism, which could exhaust file permits
    under high concurrency. This is fixed by introducing
    MAX_CONCURRENT_FILE_OPS = 64 and switching to buffer_unordered to bound
    concurrent file operations.

  2. Race condition during action registration
    create_and_add_action checked for duplicate actions after async work had begun,
    allowing identical actions to pass the check concurrently. This is fixed by
    registering a placeholder entry before async work begins, with cleanup on failure,
    ensuring duplicates are rejected deterministically.

Fixes #2001

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How has this been tested?

  • cargo check
  • cargo test download_to_directory
  • Pre-commit hooks via nix develop (formatting, linting, Vale docs checks)

Note: There is currently no test in the codebase that reproduces the exact concurrent
identical-action scenario described in #2001. This change addresses the identified
root causes deterministically.

Checklist

  • Tests pass locally
  • Change is contained in a single commit

This change is Reviewable

@CLAassistant
Copy link

CLAassistant commented Jan 5, 2026

CLA assistant check
All committers have signed the CLA.

@MarcusSorealheis
Copy link
Collaborator

MarcusSorealheis commented Jan 5, 2026

Thanks for the PR. Before we review it, we'd like to see all tests passing.

Address issue TraceMachina#2001 where workers deadlock when numerous identical actions run simultaneously.

Root causes and fixes:

1. File permit exhaustion: download_to_directory used unbounded parallelism. Added MAX_CONCURRENT_FILE_OPS=64 with buffer_unordered.

2. Race condition: create_and_add_action checked for duplicates AFTER async work. Added early registration with placeholder BEFORE async work.

Fixes TraceMachina#2001
@akash-R-A-J
Copy link
Author

akash-R-A-J commented Jan 6, 2026

All CI checks are now passing. Thanks for taking another look. @MarcusSorealheis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlock when numerous identical actions run at the same time

3 participants