If stable is not ready a new canary can cause traffic to be still on old RS while its being scaled down

Checklist:

* [x] I've included steps to reproduce the bug.
* [x] I've included the version of argo rollouts.

**Describe the bug**

When you are in the middle of a release with `Canary` (eg 70% stable, 30% canary) but have another release overwriting that `argo-rollouts` tries to start over with the new version (eg 95% stable, 5% canary)

During this flip to the new version we check the health of `stable` in [UpdateHash](https://github.com/argoproj/argo-rollouts/blob/master/rollout/trafficrouting.go#L287) but it has a specific case [where it doesn't return an error](https://github.com/argoproj/argo-rollouts/blob/master/rollout/trafficrouting/istio/istio.go#L333-L334) but also didn't do what we thought it did.

This causes us to continue in `trafficrouting` and starting the scale down with `reconcileOtherReplicaSets()`. 

This causes us to end up in a situation where:

- DestinationRule is still pointing to old-canary
- old-canary scale down is triggered
- new-canary is healthy but not getting traffic
- after scale down is done, 503/UH due to no endpoints being left

This will get eventually fixed, eg as stable RS recovers.

**To Reproduce**

- Trigger a deployment with `Istio` and `Canary` mode. Use `subset` for traffic routing
- Let it run to a phase where roll-out is mid way (eg. 70/30 split)
- Continuously cause stable RS to be not fully up by eg deleting pods randomly in a loop
- Trigger a new deployment

This should show logs like
- `New weights: &TrafficWeights{Canary:WeightDestination{Weight:5,ServiceName:,PodTemplateHash:<new-canary-rs>,},Stable:WeightDestination{Weight:95,ServiceName:,PodTemplateHash:<stable-rs,},Additional:[]WeightDestination{},Verified:nil,}`
- `Previous weights: &TrafficWeights{Canary:WeightDestination{Weight:70,ServiceName:,PodTemplateHash:<old-canary-rs>,},Stable:WeightDestination{Weight:30,ServiceName:,PodTemplateHash:<stable-rs>,},Additional:[]WeightDestination{},Verified:nil,}`
- `delaying destination rule switch: ReplicaSet <stable-rs> not fully available`

And key here is that we do not see:

`DestinationRule <rule> subset updated (canary: <new-rs>, stable: <stable-rs>)`

**Expected behavior**

Theres two routes we could take

- Delay the whole process until `stable` is fully healthy. Eg return error from `updateHash()`
  - This would be essentially reverting  https://github.com/argoproj/argo-rollouts/pull/4299
- Ignore `stable` not being fully up and continue in cases where there is already a stable+canary  in existence

**Screenshots**

<img width="1359" height="452" alt="Image" src="https://github.com/user-attachments/assets/09a2271e-95fd-4c5a-a975-7c5f890a3f89" />

<img width="1607" height="1067" alt="Image" src="https://github.com/user-attachments/assets/fed65c04-0702-4ad9-94cb-ae33f0d6dd2d" />

- `6c89fd` is stable
- `777998` is `old-canary`
- `6595cc` is `new-canary`


**Version**

v1.8.3 of `argo-rollouts`

**Logs**

I've had to cut out some logs to reduce noise etc but this is the key. Event though `still referenced` is noticed during original run on the next sync the `Rollout` is updated to point to new `canary` even though we havent updated `DestinationRule` for subset

`2025-07-31T16:47:58Z`: `New weights: &TrafficWeights{Canary:WeightDestination{Weight:5,ServiceName:,PodTemplateHash:6595cc886,},Stable:WeightDestination{Weight:95,ServiceName:,PodTemplateHash:6c89fd9477,},Additional:[]WeightDestination{},Verified:nil,}`
`2025-07-31T16:47:58Z`: `Skip scale down of older RS 'xxx-7779987b4d': still referenced`
`2025-07-31T16:47:58Z`: `Rollout step 1/9 completed (setWeight: 5)`
`2025-07-31T16:47:58Z`: `Reconciliation completed`
`2025-07-31T16:47:58Z`: `Started syncing rollout`
`2025-07-31T16:47:58Z`: `Reconciling 1 old ReplicaSets (total pods: n)`
`2025-07-31T16:47:58Z`: `scaling down intermediate RS 'xxx-7779987b4d'`
`2025-07-31T16:47:58Z`: `delaying destination rule switch: ReplicaSet xxx-6c89fd9477 not fully available`
`2025-07-31T16:48:32Z`: `DestinationRule wealthsimple subset updated (canary: 6595cc886, stable: 6c89fd9477)`

---

**Message from the maintainers**:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

If stable is not ready a new canary can cause traffic to be still on old RS while its being scaled down #4390

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

If stable is not ready a new canary can cause traffic to be still on old RS while its being scaled down #4390

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions