From e564e3d8927e39d15dfb74674e4f06be4533110c Mon Sep 17 00:00:00 2001 From: walteck Date: Wed, 15 Mar 2023 11:47:08 +0000 Subject: [PATCH 01/10] adding details around deployment strategies --- patterns/deployment.md | 110 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 110 insertions(+) create mode 100644 patterns/deployment.md diff --git a/patterns/deployment.md b/patterns/deployment.md new file mode 100644 index 00000000..23b5d02e --- /dev/null +++ b/patterns/deployment.md @@ -0,0 +1,110 @@ +# Deployment Strategies + +When developing services and applications development teams must consider their approach to deploying new releases. + +This is a fundamental step in the design of the solution, development teams must consider the technical impact of different strategies alongside the impact to service during deployments. This starts at the point of building the CI pipelines and permeates through to final live deployment pipelines. Different approaches may be applicable depending on the chosen approach, for example serverless deployments as opposed to VM based deployments. + +Deployments must be repeatable and [idempotent](../practices/continuous-integration.md#deploy-what-you-tested), so that deploying the same version twice will result in the dame deployed environment. + +## CI/CD pipeline-based deployment + +As part of the development process, a [CI/CD pipeline](../practices/continuous-integration.md) will be created. This pipeline will be designed to run unit, integration and other tests and formatting/linting checks, code quality checks, security checks and accessibility checks, successfully before merging into the main branch. + +### Immutable signed deployment artefacts + +Following a merge into the main branch development teams must ensure that all appropriate tests are performed, once all tests have passed a [deployment artefact](../practices/continuous-integration.md#deploy-what-you-tested) should be created if applicable. This means that there is a fully tested, known-good, signed deployment artefact that can be used to confidently deploy to other environments. These artefacts should be immutable, tagged, signed and stored in a location where they are accessible to all environments (e.g., GitHub or an s3 bucket). + +### Promotion through path-to-live environments + +These artefacts should be used to deploy to environments on your path-to-live, and then finally to production/live. The development team should think about what environments are required, some example environments are: + +1. Dev environments +1. Integration test environments +1. User Acceptance Test environments +1. Non functional / pre-production / live-like environment +1. Production environment + +The pre-production environment must be used to test deployments to ensure that the new release will go smoothly into the production environment. + +Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? + +Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. + +## Continuous deployment vs approval gateways + +The holy grail of CI/CD is continuous deployment, every Pull Request completed in dev gets promoted through environments and tests automatically until it is automatically deployed to production. + +However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production when the RFC for deployment is approved. As confidence grows teams should look sto migrate to more frequent smaller deployments in discussion with their Live service teams. + +Further, teams must implement approval gateways in their CD pipeline, and assess building of integration with service management tools to automate releases to production during the release window once approved. + +## Manual deployments + +Manual deployments should be avoided. If they are required then development teams should minimise the required access rights the deploying person requires. This person should not need full admin rights to execute the process. + +For example, one possible way of starting manual deployments in AWS (e.g., rolling back to a specific version) would be to have a user who only has write access to one SSM parameter store value, which is the version to install. AWS Event Bridge can monitor a change to this variable, and automatically trigger an AWS Code Build pipeline that uses a more privileged service role to perform the installation. + +## Zero-downtime deployments + +Development teams must build systems and deployment patterns in such a way as to achieve zero downtime deployments, where this is not required, due to service levels and complexity teams should highlight and ensure the decision has been ratified through a Key Architectural Decision document being presented at appropriate boards. + +To achieve zero downtime deployments the implemented deployment strategy must be “Blue-Green”. + +Blue-green deployments are a technique for safely rolling out updates to a software application. It involves maintaining two identical production environments and ensuring that traffic is routed to only one of these environments. At deployment time, one of the environments is “Live” and receiving all production traffic, while the other environment is idle. At deployment time the release is tested and deployed to the idle environment (known as the green environment) (some teams may elect to deploy to the live environment keeping the idle environment available as a fall back). Once initial smoke tests have been completed the production traffic can be routed to the green environment with the blue environment becoming idle. The new release should be allowed to soak for a defined period, if any issues arise with the release, it can be quickly rolled back by routing traffic away from the green environment back to the idle blue environment. + +After a suitable period, the blue environment can then be uplifted to the latest release, or alternatively stood down until the next release is ready. + +Utilising this mechanism means that teams can rapidly fail away from an issue associated with the release onto a known good version of the service. + +## Roll-back strategies + +Engineering teams should consider the potential impact to their service of a failed deployment and the appropriate mechanisms to ensure they are able to revert a deployment quickly and easily. This failed deployment mechanism should ideally be tested regularly as part of the test lifecycle. Ideally this mechanism would be to revert the service to the leg that has not been deployed to. In some cases, teams may decide that it is not possible to achieve a roll back, in these cases the deployment must be flagged with Service management and the Engineering management as a “High Risk” deployment. Teams should expect to justify why it is not possible to provide a rollback option and should consider alternative risk mitigation activities. + +## Example deployment pipeline stages + +The following table includes steps that development teams should consider when planning their deployment strategy, it assumes a CI/CD pipeline deploying each merge to main into production, development teams would need to review these steps against their specific needs and governance cycles. + +| Step | Description | Actor | Stage | +| :---: | --- | --- | +| 1 | Developer makes changes to their branch. | Developer | Development | +| 2 | Changes are committed to their remote branch. | Developer | Development | +| 3 | Majority of tests run (including linting, security). Ideally all tests if acceptable runtime. | CI tooling | CI | +| 4 | Peer Review. | Development team | Development | +| 5 | Merge approved. | Development team | Development | +| 6 | Committed to main branch. | Development team | Development | +| 7 | All tests run (including linting, security, code quality, accessibility, etc) | CI tooling | CI | +| 8 | Artefacts built. | CI tooling | CI | +| 9 | Deployment of artefacts on live-like preprod environment | CI tooling | CI | +| 10 | New preprod leg built / idle leg released to. | CI tooling | CI | +| 11 | Initial smoke tests run | CI tooling | CI | +| 12 | Switch traffic to new preprod leg. | CI tooling | CI | +| a | Run required pre-release performance tests / smoke tests / soak tests. | CI tooling | CI | +| b | Monitoring | CI tooling | CI | +| c | Testing period complete. | CI tooling | CI | +| 13 | Confirmation that there are no issues, ready to deploy to production. | CI tooling | CI | +| 14 | Ensure RFCs have been approved | CI tooling | CI | +| 15 | Confirmation that all traffic is currently being processed by one production leg. +| 16 | New production leg built / idle leg released to. | CI tooling | CI | +| 17 | Initial smoke tests run. | CI tooling | CI | +| 18 | Traffic is migrated to new leg. | CI tooling | CI | +| a | Monitoring occurs. | CI tooling | CI | +| b | Further smoke tests run in parallel with monitoring. | CI tooling | CI | +| c | Soak period started. | CI tooling | CI | +| d | Monitoring continues. | CI tooling | CI | +| e | Soak period complete. | CI tooling | CI | +| 19 | Release is deployed to second production leg. | CI tooling | CI | +| 20 | Initial smoke tests run. | CI tooling | CI | +| 21 | Traffic is migrated to second production leg. | CI tooling | CI | +| a | Monitoring occurs. | CI tooling | CI | +| b | Further smoke tests run in parallel with monitoring. | CI tooling | CI | +| c | Soak period started. | CI tooling | CI | +| d | Monitoring continues. | CI tooling | CI | +| e | Soak period complete. | CI tooling | CI | +| 22 | Release is marked as successful. Update RFC | CI tooling | CI | + +### Failure mode. + +If initial smoke tests fail OR monitoring identifies increased failures / other indicator OR Further smoke tests fail: +1. Traffic is migrated back to previously healthy leg. +1. Release is marked as failed. + From 5595047d2bff916eec3f2a23456f379e1f8e0df6 Mon Sep 17 00:00:00 2001 From: walteck Date: Wed, 15 Mar 2023 12:15:43 +0000 Subject: [PATCH 02/10] resolving linting issues --- patterns/deployment.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/patterns/deployment.md b/patterns/deployment.md index 23b5d02e..2f3879e5 100644 --- a/patterns/deployment.md +++ b/patterns/deployment.md @@ -12,7 +12,7 @@ As part of the development process, a [CI/CD pipeline](../practices/continuous-i ### Immutable signed deployment artefacts -Following a merge into the main branch development teams must ensure that all appropriate tests are performed, once all tests have passed a [deployment artefact](../practices/continuous-integration.md#deploy-what-you-tested) should be created if applicable. This means that there is a fully tested, known-good, signed deployment artefact that can be used to confidently deploy to other environments. These artefacts should be immutable, tagged, signed and stored in a location where they are accessible to all environments (e.g., GitHub or an s3 bucket). +Following a merge into the main branch development teams must ensure that all appropriate tests are performed, once all tests have passed a [deployment artefact](../practices/continuous-integration.md#deploy-what-you-tested) should be created if applicable. This means that there is a fully tested, known-good, signed deployment artefact that can be used to confidently deploy to other environments. These artefacts should be immutable, tagged, signed and stored in a location where they are accessible to all environments (e.g., GitHub or an s3 bucket). ### Promotion through path-to-live environments @@ -32,7 +32,7 @@ Finally, development teams must ensure that no lower environment (e.g., dev, tes ## Continuous deployment vs approval gateways -The holy grail of CI/CD is continuous deployment, every Pull Request completed in dev gets promoted through environments and tests automatically until it is automatically deployed to production. +The holy grail of CI/CD is continuous deployment, every Pull Request completed in dev gets promoted through environments and tests automatically until it is automatically deployed to production. However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production when the RFC for deployment is approved. As confidence grows teams should look sto migrate to more frequent smaller deployments in discussion with their Live service teams. @@ -102,9 +102,9 @@ The following table includes steps that development teams should consider when p | e | Soak period complete. | CI tooling | CI | | 22 | Release is marked as successful. Update RFC | CI tooling | CI | -### Failure mode. +### Failure mode If initial smoke tests fail OR monitoring identifies increased failures / other indicator OR Further smoke tests fail: + 1. Traffic is migrated back to previously healthy leg. 1. Release is marked as failed. - From 22695a896585af318f1e0b5878a62c83235d056d Mon Sep 17 00:00:00 2001 From: walteck Date: Wed, 15 Mar 2023 17:25:18 +0000 Subject: [PATCH 03/10] updates following review --- patterns/deployment.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/patterns/deployment.md b/patterns/deployment.md index 2f3879e5..d466b4ee 100644 --- a/patterns/deployment.md +++ b/patterns/deployment.md @@ -12,7 +12,7 @@ As part of the development process, a [CI/CD pipeline](../practices/continuous-i ### Immutable signed deployment artefacts -Following a merge into the main branch development teams must ensure that all appropriate tests are performed, once all tests have passed a [deployment artefact](../practices/continuous-integration.md#deploy-what-you-tested) should be created if applicable. This means that there is a fully tested, known-good, signed deployment artefact that can be used to confidently deploy to other environments. These artefacts should be immutable, tagged, signed and stored in a location where they are accessible to all environments (e.g., GitHub or an s3 bucket). +Following a merge from your feature branch development teams must ensure that all appropriate tests are performed, once all tests have passed a [deployment artefact](../practices/continuous-integration.md#deploy-what-you-tested) should be created if applicable. This means that there is a fully tested, known-good, signed deployment artefact that can be used to confidently deploy to other environments. These artefacts should be immutable, tagged, signed and stored in a location where they are accessible to all environments (e.g., GitHub or an s3 bucket). ### Promotion through path-to-live environments @@ -24,19 +24,19 @@ These artefacts should be used to deploy to environments on your path-to-live, a 1. Non functional / pre-production / live-like environment 1. Production environment -The pre-production environment must be used to test deployments to ensure that the new release will go smoothly into the production environment. +Manual deployments should be restricted whereever possible, even in development environments automation of deployments and the ability to rapidly spin up and down environments is key, manua deployments can lead to infrastructure being left running and potential issues with deployments that are costly to identify and resolve. The pre-production environment must be used to test deployments to ensure that the new release will go smoothly into the production environment. Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? -Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. +Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. Development teams should adopt a "management" account to perform deployment / orchestration activities to prevent cross environment communication / interactions. ## Continuous deployment vs approval gateways -The holy grail of CI/CD is continuous deployment, every Pull Request completed in dev gets promoted through environments and tests automatically until it is automatically deployed to production. +The target approach of CI/CD is continuous deployment, every Pull Request completed in dev gets promoted through environments and tests automatically until it is automatically deployed to production. However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production when the RFC for deployment is approved. As confidence grows teams should look sto migrate to more frequent smaller deployments in discussion with their Live service teams. -Further, teams must implement approval gateways in their CD pipeline, and assess building of integration with service management tools to automate releases to production during the release window once approved. +Further, teams must implement approval gateways in their CD pipeline, and assess building of integration with service management tools to automate releases to production during the release window once approved. These approval gateways may just include code review depending on the context and risks for the specific service. ## Manual deployments @@ -46,11 +46,11 @@ For example, one possible way of starting manual deployments in AWS (e.g., rolli ## Zero-downtime deployments -Development teams must build systems and deployment patterns in such a way as to achieve zero downtime deployments, where this is not required, due to service levels and complexity teams should highlight and ensure the decision has been ratified through a Key Architectural Decision document being presented at appropriate boards. +Development teams must build systems and deployment patterns in such a way as to achieve zero downtime deployments, where this is not required, due to service levels and complexity teams should highlight and ensure the decision has been ratified through a Key Architectural Decision document. To achieve zero downtime deployments the implemented deployment strategy must be “Blue-Green”. -Blue-green deployments are a technique for safely rolling out updates to a software application. It involves maintaining two identical production environments and ensuring that traffic is routed to only one of these environments. At deployment time, one of the environments is “Live” and receiving all production traffic, while the other environment is idle. At deployment time the release is tested and deployed to the idle environment (known as the green environment) (some teams may elect to deploy to the live environment keeping the idle environment available as a fall back). Once initial smoke tests have been completed the production traffic can be routed to the green environment with the blue environment becoming idle. The new release should be allowed to soak for a defined period, if any issues arise with the release, it can be quickly rolled back by routing traffic away from the green environment back to the idle blue environment. +Blue-green deployments are a technique for safely rolling out updates to a software application. It involves maintaining two identical production environments (or distinct components, e.g. by cylcing through nodes and deploying one at a time while the service continues to handle live load and bringing the new updated services into load in a controlled manner) and ensuring that traffic is routed to only one of these environments. At deployment time, one of the environments is “Live” and receiving all production traffic, while the other environment is idle. At deployment time the release is tested and deployed to the idle environment (known as the green environment) (some teams may elect to deploy to the live environment keeping the idle environment available as a fall back). Once initial smoke tests have been completed the production traffic can be routed to the green environment with the blue environment becoming idle. The new release should be allowed to soak for a defined period, if any issues arise with the release, it can be quickly rolled back by routing traffic away from the green environment back to the idle blue environment. Development teams should look to run automated tests, through their pipeline, following the deploy to build confidence that the deployment was successful. After a suitable period, the blue environment can then be uplifted to the latest release, or alternatively stood down until the next release is ready. From ca33be7bc21e4e2eb2aa8f9b080e57073f10c246 Mon Sep 17 00:00:00 2001 From: walteck Date: Mon, 20 Mar 2023 13:19:42 +0000 Subject: [PATCH 04/10] Further updates expanding on a number of areas --- patterns/deployment.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/patterns/deployment.md b/patterns/deployment.md index d466b4ee..012f6352 100644 --- a/patterns/deployment.md +++ b/patterns/deployment.md @@ -24,9 +24,9 @@ These artefacts should be used to deploy to environments on your path-to-live, a 1. Non functional / pre-production / live-like environment 1. Production environment -Manual deployments should be restricted whereever possible, even in development environments automation of deployments and the ability to rapidly spin up and down environments is key, manua deployments can lead to infrastructure being left running and potential issues with deployments that are costly to identify and resolve. The pre-production environment must be used to test deployments to ensure that the new release will go smoothly into the production environment. +Manual deployments should be restricted whereever possible, even in development environments automation of deployments and the ability to rapidly spin up and down environments is key, manual deployments can lead to infrastructure being left running and potential issues with deployments that are costly to identify and resolve. The pre-production environment must be used to test deployments to ensure that the new release will go smoothly into the production environment. -Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? +Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? Development teams should also consider the implications on their spend of multiple environments, particularly where there is a high reliance on Cloud-based testing environments in large Programmes. Teams should implement controls to shut down environments and should give particular thought to the frequency and duration of ephemeral environments. Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. Development teams should adopt a "management" account to perform deployment / orchestration activities to prevent cross environment communication / interactions. @@ -34,7 +34,7 @@ Finally, development teams must ensure that no lower environment (e.g., dev, tes The target approach of CI/CD is continuous deployment, every Pull Request completed in dev gets promoted through environments and tests automatically until it is automatically deployed to production. -However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production when the RFC for deployment is approved. As confidence grows teams should look sto migrate to more frequent smaller deployments in discussion with their Live service teams. +However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production when the RFC for deployment is approved. As confidence grows teams should look to migrate to more frequent smaller deployments in discussion with their Live service teams. Further, teams must implement approval gateways in their CD pipeline, and assess building of integration with service management tools to automate releases to production during the release window once approved. These approval gateways may just include code review depending on the context and risks for the specific service. @@ -60,6 +60,8 @@ Utilising this mechanism means that teams can rapidly fail away from an issue as Engineering teams should consider the potential impact to their service of a failed deployment and the appropriate mechanisms to ensure they are able to revert a deployment quickly and easily. This failed deployment mechanism should ideally be tested regularly as part of the test lifecycle. Ideally this mechanism would be to revert the service to the leg that has not been deployed to. In some cases, teams may decide that it is not possible to achieve a roll back, in these cases the deployment must be flagged with Service management and the Engineering management as a “High Risk” deployment. Teams should expect to justify why it is not possible to provide a rollback option and should consider alternative risk mitigation activities. +Particular care should be taken around deployments for serverless architectures with a thought being given to idempotent data workflows and clear version control strategies. Other strategies to reduce the risk in such situations are more frequent and smaller changes. + ## Example deployment pipeline stages The following table includes steps that development teams should consider when planning their deployment strategy, it assumes a CI/CD pipeline deploying each merge to main into production, development teams would need to review these steps against their specific needs and governance cycles. @@ -108,3 +110,7 @@ If initial smoke tests fail OR monitoring identifies increased failures / other 1. Traffic is migrated back to previously healthy leg. 1. Release is marked as failed. + +### Game days and chaos testing + +Development teams should look at implementing regular game days to cover how their system and their teams handle failures, these game days should consider the release and deployment process as this is a key window for issues to arise in and can complicate investigations due to the impact of the change being deployed. Teams should also explore and implement [Chaos Engineering](../practices/testing.md#other-tools-to-consider) techniques. From fc3125e035d9a9b8631a0425d103d5133545bf95 Mon Sep 17 00:00:00 2001 From: walteck Date: Mon, 20 Mar 2023 13:27:12 +0000 Subject: [PATCH 05/10] resolving markdown format issue - lint --- patterns/deployment.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/patterns/deployment.md b/patterns/deployment.md index 012f6352..51c43645 100644 --- a/patterns/deployment.md +++ b/patterns/deployment.md @@ -26,7 +26,7 @@ These artefacts should be used to deploy to environments on your path-to-live, a Manual deployments should be restricted whereever possible, even in development environments automation of deployments and the ability to rapidly spin up and down environments is key, manual deployments can lead to infrastructure being left running and potential issues with deployments that are costly to identify and resolve. The pre-production environment must be used to test deployments to ensure that the new release will go smoothly into the production environment. -Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? Development teams should also consider the implications on their spend of multiple environments, particularly where there is a high reliance on Cloud-based testing environments in large Programmes. Teams should implement controls to shut down environments and should give particular thought to the frequency and duration of ephemeral environments. +Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? Development teams should also consider the implications on their spend of multiple environments, particularly where there is a high reliance on Cloud-based testing environments in large Programmes. Teams should implement controls to shut down environments and should give particular thought to the frequency and duration of ephemeral environments. Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. Development teams should adopt a "management" account to perform deployment / orchestration activities to prevent cross environment communication / interactions. From 93a118360159a6fb60ec1cd2c8c19c379b506725 Mon Sep 17 00:00:00 2001 From: walteck Date: Wed, 29 Mar 2023 13:14:38 +0100 Subject: [PATCH 06/10] update guidance on use of management accounts --- patterns/deployment.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/patterns/deployment.md b/patterns/deployment.md index 51c43645..e41e7c88 100644 --- a/patterns/deployment.md +++ b/patterns/deployment.md @@ -28,7 +28,9 @@ Manual deployments should be restricted whereever possible, even in development Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? Development teams should also consider the implications on their spend of multiple environments, particularly where there is a high reliance on Cloud-based testing environments in large Programmes. Teams should implement controls to shut down environments and should give particular thought to the frequency and duration of ephemeral environments. -Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. Development teams should adopt a "management" account to perform deployment / orchestration activities to prevent cross environment communication / interactions. +Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. In this scenario the build artefacts would get pushed to a repository, for example published to GitHub and each environment would pull this release down as part of their deployment process. + +Using a "management" account to perform deployment / orchestration activities should be avoided as this can lead to issues with users having too much permission in the management account to support the work they need to perform in a development environment and these permissions leaking into the access for the production environment. ## Continuous deployment vs approval gateways From 7202ad5c2d3f38a5f13a8bb05f7795ceeac977dd Mon Sep 17 00:00:00 2001 From: walteck Date: Wed, 29 Mar 2023 13:18:17 +0100 Subject: [PATCH 07/10] removed trailing white space --- patterns/deployment.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/patterns/deployment.md b/patterns/deployment.md index e41e7c88..57d15a2c 100644 --- a/patterns/deployment.md +++ b/patterns/deployment.md @@ -28,7 +28,7 @@ Manual deployments should be restricted whereever possible, even in development Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? Development teams should also consider the implications on their spend of multiple environments, particularly where there is a high reliance on Cloud-based testing environments in large Programmes. Teams should implement controls to shut down environments and should give particular thought to the frequency and duration of ephemeral environments. -Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. In this scenario the build artefacts would get pushed to a repository, for example published to GitHub and each environment would pull this release down as part of their deployment process. +Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. In this scenario the build artefacts would get pushed to a repository, for example published to GitHub and each environment would pull this release down as part of their deployment process. Using a "management" account to perform deployment / orchestration activities should be avoided as this can lead to issues with users having too much permission in the management account to support the work they need to perform in a development environment and these permissions leaking into the access for the production environment. From a8cc3c6d5cffc48b9602e6304e330837f1737b05 Mon Sep 17 00:00:00 2001 From: walteck Date: Mon, 3 Apr 2023 08:49:28 +0100 Subject: [PATCH 08/10] linking to deployment from principles page --- patterns/deployment.md | 4 ++-- principles.md | 2 ++ 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/patterns/deployment.md b/patterns/deployment.md index 57d15a2c..59bd489b 100644 --- a/patterns/deployment.md +++ b/patterns/deployment.md @@ -38,7 +38,7 @@ The target approach of CI/CD is continuous deployment, every Pull Request comple However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production when the RFC for deployment is approved. As confidence grows teams should look to migrate to more frequent smaller deployments in discussion with their Live service teams. -Further, teams must implement approval gateways in their CD pipeline, and assess building of integration with service management tools to automate releases to production during the release window once approved. These approval gateways may just include code review depending on the context and risks for the specific service. +Further, teams must implement approval gateways in their CD pipeline, and assess building of integration with service management tools to automate releases to production during the release window once approved. These approval gateways may just include code review depending on the context and risks for the specific service. Teams should review the [Governance as a side effect](https://github.com/NHSDigital/software-engineering-quality-framework/blob/main/patterns/governance-side-effect.md) pattern and should aspire to bake in as many of these checks and reviews into the delivery process as this will negate or minimise the need for downstream reviews. ## Manual deployments @@ -52,7 +52,7 @@ Development teams must build systems and deployment patterns in such a way as to To achieve zero downtime deployments the implemented deployment strategy must be “Blue-Green”. -Blue-green deployments are a technique for safely rolling out updates to a software application. It involves maintaining two identical production environments (or distinct components, e.g. by cylcing through nodes and deploying one at a time while the service continues to handle live load and bringing the new updated services into load in a controlled manner) and ensuring that traffic is routed to only one of these environments. At deployment time, one of the environments is “Live” and receiving all production traffic, while the other environment is idle. At deployment time the release is tested and deployed to the idle environment (known as the green environment) (some teams may elect to deploy to the live environment keeping the idle environment available as a fall back). Once initial smoke tests have been completed the production traffic can be routed to the green environment with the blue environment becoming idle. The new release should be allowed to soak for a defined period, if any issues arise with the release, it can be quickly rolled back by routing traffic away from the green environment back to the idle blue environment. Development teams should look to run automated tests, through their pipeline, following the deploy to build confidence that the deployment was successful. +Blue-green deployments are a technique for safely rolling out updates to a software application. It involves maintaining two identical production environments (or distinct components, e.g. by cycling through nodes and deploying one at a time while the service continues to handle live load and bringing the new updated services into load in a controlled manner) and ensuring that traffic is routed to only one of these environments. At deployment time, one of the environments is “Live” and receiving all production traffic, while the other environment is idle. At deployment time the release is tested and deployed to the idle environment (known as the green environment) (some teams may elect to deploy to the live environment keeping the idle environment available as a fall back). Once initial smoke tests have been completed the production traffic can be routed to the green environment with the blue environment becoming idle. The new release should be allowed to soak for a defined period, if any issues arise with the release, it can be quickly rolled back by routing traffic away from the green environment back to the idle blue environment. Development teams should look to run automated tests, through their pipeline, following the deploy to build confidence that the deployment was successful. After a suitable period, the blue environment can then be uplifted to the latest release, or alternatively stood down until the next release is ready. diff --git a/principles.md b/principles.md index 6d24e586..efe28509 100644 --- a/principles.md +++ b/principles.md @@ -56,6 +56,8 @@ The following practices support the principle of building quality in. **Collaborative analysis and elaboration** with the right people involved to examine a change from all angles to ensure requirements are fully understood at the point the work will be done. +**[Deploy safely.](patterns/deployment.md)** Use automation and repeatable processes to ensure that products can be deployed safely. + **[Deliver incrementally.](patterns/little-and-often.md)** Establish build-measure-learn loops to keep the system simple and to ensure it meets evolving user needs. **Pair programming**. Avoid quality issues by combining the skills and experience of two developers instead of one. Take advantage of navigator and driver roles. Also consider cross-discipline (e.g. dev-test) pairing. From bd544d8f3a0d0cafd2b2beb6d17c2129b62c6a6b Mon Sep 17 00:00:00 2001 From: walteck Date: Tue, 18 Apr 2023 21:55:12 +0100 Subject: [PATCH 09/10] further review updates --- patterns/deployment.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/patterns/deployment.md b/patterns/deployment.md index 59bd489b..88cc6fb4 100644 --- a/patterns/deployment.md +++ b/patterns/deployment.md @@ -24,31 +24,31 @@ These artefacts should be used to deploy to environments on your path-to-live, a 1. Non functional / pre-production / live-like environment 1. Production environment -Manual deployments should be restricted whereever possible, even in development environments automation of deployments and the ability to rapidly spin up and down environments is key, manual deployments can lead to infrastructure being left running and potential issues with deployments that are costly to identify and resolve. The pre-production environment must be used to test deployments to ensure that the new release will go smoothly into the production environment. +Manual deployments should be restricted wherever possible, even in development environments automation of deployments and the ability to rapidly spin up and down environments is key, manual deployments can lead to infrastructure being left running and potential issues with deployments that are costly to identify and resolve. The pre-production environment must be used to test deployments to ensure that the new release will go smoothly into the production environment. -Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? Development teams should also consider the implications on their spend of multiple environments, particularly where there is a high reliance on Cloud-based testing environments in large Programmes. Teams should implement controls to shut down environments and should give particular thought to the frequency and duration of ephemeral environments. +Other things worth considering are whether the team requires a different testing environment for any manual/accessibility testing (but think how this can be automated in the CI/CD pipeline). Does the project require supplier testing against the latest version before it is promoted to production, and is a separate environment required for this? Development teams should also consider the implications on their spend of multiple environments, particularly where there is a high reliance on Cloud-based testing environments. Teams should implement controls to shut down environments and should give particular thought to the frequency and duration of ephemeral environments. -Finally, development teams must ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams must always use a pull model from their higher-level environments. In this scenario the build artefacts would get pushed to a repository, for example published to GitHub and each environment would pull this release down as part of their deployment process. +Finally, development teams should ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams should consider using a pull model from their higher-level environments. In this scenario the build artefacts would get pushed to a repository, for example published to GitHub and each environment would pull this release down as part of their deployment process. -Using a "management" account to perform deployment / orchestration activities should be avoided as this can lead to issues with users having too much permission in the management account to support the work they need to perform in a development environment and these permissions leaking into the access for the production environment. +Using a "management" account to perform deployment / orchestration activities can lead to issues with users having too much permission in the management account to support the work they need to perform in a development environment and these permissions leaking into the access for the production environment. Therefore, it is essential to restrict permissions and to utilise separate roles to enable deployments to different environments. Access to the Management account must be restricted and monitored at the same level as access to the production accounts that this account supports. ## Continuous deployment vs approval gateways The target approach of CI/CD is continuous deployment, every Pull Request completed in dev gets promoted through environments and tests automatically until it is automatically deployed to production. -However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production when the RFC for deployment is approved. As confidence grows teams should look to migrate to more frequent smaller deployments in discussion with their Live service teams. +However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production. As confidence grows teams should look to migrate to more frequent smaller deployments in discussion with their Live service teams. Further, teams must implement approval gateways in their CD pipeline, and assess building of integration with service management tools to automate releases to production during the release window once approved. These approval gateways may just include code review depending on the context and risks for the specific service. Teams should review the [Governance as a side effect](https://github.com/NHSDigital/software-engineering-quality-framework/blob/main/patterns/governance-side-effect.md) pattern and should aspire to bake in as many of these checks and reviews into the delivery process as this will negate or minimise the need for downstream reviews. ## Manual deployments -Manual deployments should be avoided. If they are required then development teams should minimise the required access rights the deploying person requires. This person should not need full admin rights to execute the process. +Manual deployments and [all access to production](../practices/security.md#infrastructure-security) should be avoided. If they are required then development teams should minimise the required access rights the deploying person requires. This person should not need full admin rights to execute the process. For example, one possible way of starting manual deployments in AWS (e.g., rolling back to a specific version) would be to have a user who only has write access to one SSM parameter store value, which is the version to install. AWS Event Bridge can monitor a change to this variable, and automatically trigger an AWS Code Build pipeline that uses a more privileged service role to perform the installation. ## Zero-downtime deployments -Development teams must build systems and deployment patterns in such a way as to achieve zero downtime deployments, where this is not required, due to service levels and complexity teams should highlight and ensure the decision has been ratified through a Key Architectural Decision document. +Development teams must build systems and deployment patterns in such a way as to achieve zero downtime deployments, where this is not required, due to service levels and complexity teams should highlight and ensure the decision has been ratified through an [Any Decision Record](../any-decision-record-template.md) document. To achieve zero downtime deployments the implemented deployment strategy must be “Blue-Green”. From 45fa373a82e789c43163534d23e1c148e81c9c0f Mon Sep 17 00:00:00 2001 From: walteck Date: Wed, 19 Apr 2023 14:41:33 +0100 Subject: [PATCH 10/10] additional notes around secure deployments --- patterns/deployment.md | 4 ++-- practices/security.md | 4 ++++ 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/patterns/deployment.md b/patterns/deployment.md index 88cc6fb4..6279e7d2 100644 --- a/patterns/deployment.md +++ b/patterns/deployment.md @@ -30,13 +30,13 @@ Other things worth considering are whether the team requires a different testing Finally, development teams should ensure that no lower environment (e.g., dev, test) has access to a higher environment (e.g., preprod/production). It is an anti-pattern to push artefacts up to a higher environment from a lower one, or to run a deployment runner instance in dev that deploys to production. Development teams should consider using a pull model from their higher-level environments. In this scenario the build artefacts would get pushed to a repository, for example published to GitHub and each environment would pull this release down as part of their deployment process. -Using a "management" account to perform deployment / orchestration activities can lead to issues with users having too much permission in the management account to support the work they need to perform in a development environment and these permissions leaking into the access for the production environment. Therefore, it is essential to restrict permissions and to utilise separate roles to enable deployments to different environments. Access to the Management account must be restricted and monitored at the same level as access to the production accounts that this account supports. +Using a "management" account to perform deployment / orchestration activities can lead to issues with users having too much permission in the management account to support the work they need to perform in a development environment and these permissions leaking into the access for the production environment. Therefore, if this approach is used it is essential to [restrict permissions and to utilise separate roles](../practices/security.md#infrastructure-security) to enable deployments to different environments. Access to the Management account must be restricted and monitored at the same level as access to the production accounts that this account supports. ## Continuous deployment vs approval gateways The target approach of CI/CD is continuous deployment, every Pull Request completed in dev gets promoted through environments and tests automatically until it is automatically deployed to production. -However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production. As confidence grows teams should look to migrate to more frequent smaller deployments in discussion with their Live service teams. +However this presents challenges due to clinical, security, service and other approval requirements, and the release windows assigned to specific products. In these situations development teams should consider Continuous Delivery, and automatically build artefacts that are ready to deploy to production. As confidence grows teams should look to migrate to more frequent smaller deployments. Further, teams must implement approval gateways in their CD pipeline, and assess building of integration with service management tools to automate releases to production during the release window once approved. These approval gateways may just include code review depending on the context and risks for the specific service. Teams should review the [Governance as a side effect](https://github.com/NHSDigital/software-engineering-quality-framework/blob/main/patterns/governance-side-effect.md) pattern and should aspire to bake in as many of these checks and reviews into the delivery process as this will negate or minimise the need for downstream reviews. diff --git a/practices/security.md b/practices/security.md index 21841586..d47d3815 100644 --- a/practices/security.md +++ b/practices/security.md @@ -108,6 +108,10 @@ The remainder of this page gives more detailed and specific recommendations to b - **Secure the route** to infrastructure: all access to infrastructure (production or otherwise) must be via a secured route, for example via a hardened bastion only accessible via a VPN (with MFA challenge), and with an audit of usage. - Ensure infrastructure **IAM** is robust - Strong passwords and MFA +- **Secure deployment** infrastructure. + - [Maual deployments should be avoided.](../patterns/deployment.md#manual-deployments) + - Deployment routes should be secured and should be considered access to Production systems. + - Consider the way code is [promoted through development environments to production.](../patterns/deployment.md#promotion-through-path-to-live-environments)
Example IAM policy to prevent assume role without MFA (click to expand)