Modern software development teams building cloud applications often fail to follow these nine key principles:
- Configuration and secrets should be external to the code
- All aspects of the system should be observable
- The underlying database should be designed to evolve
- All interfaces should be versioned
- Automated testing of key components is a must
- All code should go through analysis
- Interfaces should be owned by only one team
- All components should always be deployed separately
- There should be a golden data set that contains everything a running system needs
For consultants, two additional items are critical to success:
- Measure everything
- Document everything
The Fine Nine
Or, how to improve the chances that your project succeeds
The Twelve-Factor App methodology was released in 2011 and describes the best practices for building software as a service (SaaS) applications. While originally developed for the Heroku platform as a service company, it is now widely acknowledged as the foundation for building modern cloud applications. There are many other buzzwords thrown around in cloud development (containers, serverless, single page applications, and web assembly, to name a few), but the beauty and staying power of the ideas encompassed in the twelve-factor app is that it is above the implementation details. The design goals are agnostic of the underlying technologies used to implement the solution.
It is in that spirit that I want to introduce you to the Fine Nine, which are nine things that I have seen project teams forget to do again and again that always cause pain. I believe these nine to be necessary – but not sufficient – to ensure success. Included are two bonus factors that specifically target consultants, but are not limited to them. I have left out many common practices because they are just that: common practices. For example, I don’t mention that all code should be kept in source control, as that is now ubiquitous. Also, creating continuous integration/continuous delivery (CI/CD) pipelines for your code is not listed because in 2023 it should be assumed that any new project will be built with a CI/CD pipeline in place from day one.
1. Configuration and secrets
One binary for all environments. No more secrets or “magic strings” in your code.
This should be a side effect of using a modern CI/CD pipeline, but I have worked with teams that have gone to heroic efforts to inject environment-specific data into their code as it is being built. Your pipelines should produce a binary or equivalent and that artifact should be deployed into your different environments without modification. The main benefit of this is that you always know exactly what code is running in each environment – which allows you to have a greater chance of reproducing production issues in lower environments. As a side effect, you should have previous binaries available to roll back to if there is a problem. I have had several clients who broke this rule and I have invariably had to log in to production servers to check binary versions when there were errors we couldn’t reproduce in QA or locally. Azure Key Vault is an example of a product built to store configuration and secrets outside of your binaries.
Logging is more than just capture. Data ≠ information.
You should build software so that the intermediate steps in a process are available for interrogation. Decomposition into services is a great way to decouple components and create stateless computational units, but if you can’t watch the internal traffic, then your troubleshooting is still a series of shots in the dark. When you have an issue in production, you need to be able to isolate the problem code without resorting to debugging the entire application. I have a client right now that has two teams fighting over who is causing an issue because none of the traffic between the two components is easily observable. The source team is logging a serialized object into a database and then sending a message without updating any status on the entry. The second team is deserializing the incoming message and logging a digest version. Neither team will take ownership and neither knows for sure that they are not the cause of the issue – because no one can prove what is going on between the components. Elastic Observability is an example of a tool used to capture and surface data streams as information.
3. Evolutionary database design
Everyone stores something. Plan to allow multiple versions of your code to be running against the same database.
Databases are just code. We have schema definitions and stored procedures that should always be treated as code and stored in source control. If you are working on a code library, you ensure backward compatibility when you make changes, and a database should do the same. If we refactor code that is relied on by external components, we don’t make breaking changes within a release cycle. We deprecate one interface to functionality as we introduce another and then we give consumers time to adjust. One of my clients lost days of data when they had to revert to a backup because there were code issues in a release and the old version of the application wasn’t compatible with the new database schema. There were no problems with the new database other than that; the changes could have stayed had they been made with backward compatibility in mind. Refactoring Databases: Evolutionary Database Design is a great book on this technique.
4. Version your interfaces
Plan to allow multiple versions of your code to be running side by side against the same APIs.
If you can only run one version of software within an environment, then you can’t take advantage of a soft rollout to a partial audience without a lot of difficulty. Taking the time and effort to build your APIs so multiple versions can co-exist within the same environment and/or are backward compatible allows you to control your feature launches at a more granular level. Clients are constantly releasing updates to all of their customers at once because that is the only way that their applications can be delivered. Then they get customer feedback and have to rush fixes or changes to meet customer needs. A pilot rollout to a few picked clients can help you avoid the rush and bad experiences by getting feedback early on new versions. Using an API management tool like Apigee helps teams facilitate best practices.
5. Test just enough
Run tests on code check-in and have meaningful tests. Don’t drive to a coverage number.
If you have a defect reported for your code, you should add a test that exhibits the behavior and then make it pass. You don’t have to – and shouldn’t try to – achieve 100% test coverage unless the software you are working on is super critical. Most of us don’t work on software for rockets or air traffic control systems, so 70% coverage is fine if it is the right 70% and we run our tests. I am working with a client who has a team saying that they have 75% test coverage, but when I look at the test repository, the tests haven’t changed in more than two years. The tests are not run as part of the CI/CD pipeline and are kept isolated from the code in their own source code repository. I believe that stale tests which are never run are actually worse than no tests at all – because they provide a feeling of unwarranted safety.
6. Analyze everything
Run static analysis, linters, and cyclomatic complexity tools on every check-in and discuss the results.
Tools like SonarQube can easily be added to your CI/CD pipelines and can keep you from checking in code with obvious security flaws like SQL injections. Linters can help keep our code style consistent so that we can more quickly and easily understand what the code is doing. Finally, cyclomatic complexity tools measure how interconnected our code is to help us keep our code readable and maintainable. I have a client right now that is going through a security audit and running code analysis for the first time. It will take the whole team multiple sprints to fix all the issues found, and the software will be better for it. However, if they had been running the analysis all along, they would have seen the issues as they were introduced. It would have taken only minutes at each check-in to fix the issues. In the process of fixing the issues, the development team would have learned to avoid them.
7. Align ownership and responsibility with your interfaces
Clearly separate concerns and data and don’t cross boundaries.
This one is more of a business structure and team structure factor than a technical one, but an ineffective team or business structure can lead to severe technical issues. Your API to development team ratio should be many to one, not many to many. If more than one team owns an interface or API, then it is much harder to keep the interface focused and easier to break the single-responsibility principle (SRP). It can also lead to release issues as different teams may be working on different cycles. I have had several clients with shared components run into problems releasing high-priority features because another team was introducing low-priority features into the shared codebase and then getting pulled to higher-priority issues. Two teams generate two different sets of priorities, and one team usually ends up taking on more work than originally planned to clean up what the other team abandoned.
8. Deploy separate components separately
No “big bang” deployments. Database, UI, and each API deploy separately every time from day one.
Deploying individual components and testing in between deployments ensures that our components have low coupling. Software should always be designed with versioned interfaces and an evolutionary database as listed above, and deploying components separately allows us to continuously test that premise. It is fine for there to be dependencies between components requiring specific versions for some functionality, but that should be documented and planned for. We should gracefully degrade if the required version for a piece of functionality is not available. There are numerous libraries for handling this, so we don’t have to (and shouldn’t!) write our own code to handle this. A client had to delay releasing features because the components in the application had high cohesion and required database changes and one component team (out of 10) ran into technical issues and missed the launch date. The features that the late team was implanting were not a high priority, but they blocked high-priority features for two sprints while they fixed their issue.
9. Create a golden data set
Any data that needs to be in place for our system to run should be scripted and deployable as a component.
This can be a hard task, but there are two compelling reasons for doing this. First, you can stand up a new environment quickly, without user intervention. On past projects that didn’t put this in place, we ran into several problems that could have been easily avoided. In one case, we had to share one environment for UAT and demo because it was expensive to run the environments and too cumbersome to set up and tear down an environment each time one was needed. We had to stop UAT each time the sales team went out to trade shows because of this, and the QA team had to manually reset the demo accounts to a known state. If we had a golden data set, then we could have spun up and spun down environments in hours rather than weeks, and wouldn’t have had the investment in the environment to make deallocation too expensive. It would also have allowed us to stand up an environment to run the full suite of automated tests against.
The second compelling reason for doing this is that it forces you to know, capture, and document all the data required to run the system. If you don’t have a golden data set and practice using it, then you can end up with institutional knowledge obscuring requirements. On one project, the QA team set up the environment base data manually and there was an order of creation that they knew because of dependencies between features that had grown as the software evolved but were not documented anywhere.
Bonus 1: Measure, measure, measure
“You cannot manage what you do not measure.” – Peter Drucker
As a consultant, it is my job to make things better for my clients. It is much easier to keep them happy if I can prove that things are better with cold hard facts. Measuring performance before and after refactoring, measuring number of defects, measuring page hits, measuring page load times – these all give us a better picture of how our applications are being used and where improvements can create the largest impact. For example, an administrative page that is used once or twice a quarter that loads in five seconds should be prioritized below a page that loads in one and a half seconds but is used hourly by every user. Evidence and metric-based prioritization of optimization is vital. How else can you prove that the new version is better than the old?
Bonus 2: Document, document, document
“You cannot prove what you do not document.” – NOT Peter Drucker
As a consultant, one thing that I have learned again and again is that you can’t have too much documentation. You should document your design decisions using Lightweight Architecture Decision Records (LADR) to capture the context and consequences of every choice you make. Store them in a document repository and present the decisions in a public forum. Then, in six months when you are asked why you chose to use XYZ rather than ABC technologies, you can show why. I had a client this year who questioned why we chose to use Keycloak for a project, and the architect in charge when the decision was made didn’t have the documentation to back up the decision. It was not a fun conversation. Conversely, I had a client last year who asked why we were using CosmosDB on a different project (because of cost) and we had the documentation to show which requirements drove the decision (ease of development and anticipated growth). As a result, we were able to have a great conversation about the decreased development costs and increased scalability over the alternatives.
I hope that you find these examples and lessons to be valuable with your cloud development teams. There are many other things that make or break projects that are not listed, but these are the high-value ones that I see most often overlooked.
Elena Giebel: December 2023 Crew Consultant Spotlight
I graduated from Gustavus Adolphus College with double majors in communication studies and Spanish.
Beyond ChatGPT: Generative AI for knowledge management at work
Solutions like a proprietary, employee-facing RAG can bridge the knowledge gap and enhance efficiency in knowledge-based decision-making.
Jacob Becker: November 2023 Crew Consultant Spotlight
I collaborate with the Salesforce business and legal services team, partnering with law firms nationwide to implement case management and accounting software.