fix (orchestrator): resolve EPERM errors on Windows during checkpoint #20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hey! I ran into some EPERM error: operation not permitted errors while running MemoryBench on Windows, specifically when the system tries to save the checkpoint file during a run (or when stopping the run, which triggers a final save). It looks like fs.renameSync sometimes conflicts with file locks (likely anti-virus or the OS just holding onto handles briefly).
I fixed this by:
Wrapping the atomic renameSync operation in a retry loop.
Adding exponential backoff to handle transient EBUSY or EPERM locks, which are common on Windows.
This preserves the safety of atomic writes (vital for crash resilience) while making it robust against Windows file locking issues.
On Linux/macOS, the rename succeeds on the first try, so there is no performance impact.