-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Checklist:
- I've included steps to reproduce the bug.
- I've included the version of argo rollouts.
Describe the bug
When you are in the middle of a release with Canary (eg 70% stable, 30% canary) but have another release overwriting that argo-rollouts tries to start over with the new version (eg 95% stable, 5% canary)
During this flip to the new version we check the health of stable in UpdateHash but it has a specific case where it doesn't return an error but also didn't do what we thought it did.
This causes us to continue in trafficrouting and starting the scale down with reconcileOtherReplicaSets().
This causes us to end up in a situation where:
- DestinationRule is still pointing to old-canary
- old-canary scale down is triggered
- new-canary is healthy but not getting traffic
- after scale down is done, 503/UH due to no endpoints being left
This will get eventually fixed, eg as stable RS recovers.
To Reproduce
- Trigger a deployment with
IstioandCanarymode. Usesubsetfor traffic routing - Let it run to a phase where roll-out is mid way (eg. 70/30 split)
- Continuously cause stable RS to be not fully up by eg deleting pods randomly in a loop
- Trigger a new deployment
This should show logs like
New weights: &TrafficWeights{Canary:WeightDestination{Weight:5,ServiceName:,PodTemplateHash:<new-canary-rs>,},Stable:WeightDestination{Weight:95,ServiceName:,PodTemplateHash:<stable-rs,},Additional:[]WeightDestination{},Verified:nil,}Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:70,ServiceName:,PodTemplateHash:<old-canary-rs>,},Stable:WeightDestination{Weight:30,ServiceName:,PodTemplateHash:<stable-rs>,},Additional:[]WeightDestination{},Verified:nil,}delaying destination rule switch: ReplicaSet <stable-rs> not fully available
And key here is that we do not see:
DestinationRule <rule> subset updated (canary: <new-rs>, stable: <stable-rs>)
Expected behavior
Theres two routes we could take
- Delay the whole process until
stableis fully healthy. Eg return error fromupdateHash()- This would be essentially reverting fix: abort scenario where canary/stable service is not provided #4299
- Ignore
stablenot being fully up and continue in cases where there is already a stable+canary in existence
Screenshots
6c89fdis stable777998isold-canary6595ccisnew-canary
Version
v1.8.3 of argo-rollouts
Logs
I've had to cut out some logs to reduce noise etc but this is the key. Event though still referenced is noticed during original run on the next sync the Rollout is updated to point to new canary even though we havent updated DestinationRule for subset
2025-07-31T16:47:58Z: New weights: &TrafficWeights{Canary:WeightDestination{Weight:5,ServiceName:,PodTemplateHash:6595cc886,},Stable:WeightDestination{Weight:95,ServiceName:,PodTemplateHash:6c89fd9477,},Additional:[]WeightDestination{},Verified:nil,}
2025-07-31T16:47:58Z: Skip scale down of older RS 'xxx-7779987b4d': still referenced
2025-07-31T16:47:58Z: Rollout step 1/9 completed (setWeight: 5)
2025-07-31T16:47:58Z: Reconciliation completed
2025-07-31T16:47:58Z: Started syncing rollout
2025-07-31T16:47:58Z: Reconciling 1 old ReplicaSets (total pods: n)
2025-07-31T16:47:58Z: scaling down intermediate RS 'xxx-7779987b4d'
2025-07-31T16:47:58Z: delaying destination rule switch: ReplicaSet xxx-6c89fd9477 not fully available
2025-07-31T16:48:32Z: DestinationRule wealthsimple subset updated (canary: 6595cc886, stable: 6c89fd9477)
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.