Lifecycle Management And Operations
Avoid Deployment Fragmentation
Delivering an on-premise offering, in addition to an existing, hosted offering may result in two different ways to deploy the application. This will lead to a bifurcation of team responsibilities and doubling the amount of work.
It is possible to unify deployments by migrating to Kubernetes as the primary platform for both deployments.
Kubernetes provides a way to abstract away the details of underlying infrastructure like disks, load balancers and network security rules. You can read more about Kubernetes in its documentation.
In addition to adopting Kubernetes, Helm (the Kubernetes native package manager) should be used to split components into independent packages.
Once the migration to Kubernetes and Helm is complete, the on-premise edition becomes just another deployment target alongside cloud deployments.
For example, many of our customers use Telekube’s supported upstream Kubernetes as a deployment target for on-premise and a managed Kubernetes service like GKE or AKS for their cloud deployments.
Simplify Installation Complexity
Installing a highly available, distributed system is difficult on infrastructure you control, not to mention infrastructure that you don’t. Setting up dozens of components and dependencies will lead to a multi-step installation process that is very hard to debug and execute by untrained, on-site personnel.
Some installations will fail and it will take many hours to troubleshoot the root cause while going back and forth with the customer. Completing an installation may take days and numerous attempts to get right. Eventually, customers may entirely abandon the idea in frustration, which can damage your reputation with your customers.
Automating as many steps as possible removes the human factor of the installation process. However, it can be difficult to automate the installation of complex applications for every environment. It is a good idea to limit the types of supported environments and to only support specific components so that you can safely implement automation (see section on “How to handle incompatible resources”). There also need to be an easy way to log and share information externally for debugging if something went wrong.
Speaking of installation failures, support teams should have the ability to roll back the installation to the point of failure rather than starting over. This has the potential to save wasted hours by not having to restart the installation from scratch when failures occur late in the installation process.
Telekube automates the installation and reduces the number of installation steps to one command which installs Kubernetes alongside all dependencies and application containers. It also has a simple way to collect operational reports which captures all possible information and allows for manually overriding the automated installation, if a failure occurs.
Handle Incompatible Resources
When you don’t control the infrastructure, installation problems can be caused by a variety of things outside of your control - slow disks, networks or an old OS distribution provided by customer. This can cause hours of troubleshooting and it will be unclear why the installation failed to work with a seemingly correct setup and configuration.
Always specify and enforce systems requirements for disk space and speed, network requirements, and OS distribution with every installation. The system should refuse to install unless requirements are met and have proper error reporting for the requirements that are not met. It is usually not enough to provide guidance in form of documentation to the customers because they often ignore it. We use a set of pre-checks to specify the list of operating systems, disk speed and capacity, network bandwidth and open ports.
Also, equip your services teams with lightweight tools to pre-check the system
readiness like our
gravity status tool that
can run even before the installation has begun to make sure that basic
requirements are satisfied.
Here is some advice on more specific requirements to consider:
- When using Kubernetes, require a separate disk for etcd (the internal Kubernetes database) and any other database that you ship. This requirement can be lifted for trial deploys, but make sure to include it in production specifications.
- Isolate slow network attached storage by providing a minimum performance requirements list for storage volumes. At a minimum, pick something as low as 20 MB/s just to eliminate completely incompatible or broken storage.
- Always set up capacity requirements for temporary and root partitions and database partitions. You will be surprised how often you will get VMs with minimal disk space available if you don’t.
- Apply baseline network throughput requirements. Setting something as low as 5MB/s will spare you from troubleshooting congested networks.
- Specify and encode all networking and port requirements needed for the application to run.
- Start with one or two of the most popular supported OS distributions. Typically, larger customers have RHEL available. This will spare you from troubleshooting a range of 5 different distros and kernels. Here are our guidelines on supported distributions, for reference.
Simplify Upgrade Complexity
Installing distributed systems on infrastructure you do not fully control is hard, but upgrading it is an order of magnitude more complex. Sometimes, only certain components of the system need to be upgraded but it may difficult to only upgrade those components in a safe way.
Upgrade failures can turn into quagmires. During an upgrade operation on-premise, there is no easy way to reinstall the OS or add new nodes to the rotation. Any part of the upgrade can fail at any time due to known or unknown circumstances like power outages, system going out of disk space or simply containers hanging because of older kernels. Complex updates will contribute to longer upgrade cycles, as customers will be wary of the risk and spending 2-3 days upgrading the system.
Our upgrade process consists of a single command that launches a full cluster and application upgrade. However, if the upgrade fails, it can be easily resumed from the explicit stage it last completed, instead of starting from the beginning.
This approach makes it possible to continue the upgrade even in the face of unexpected failures and it keeps the cluster running during failures. This also leaves a good impression on the customer, as you know at which stage the upgrade failed and can provide insight as to why it happened, which is difficult with a black box upgrade procedure.
Handle Application Upgrade Failures
Having a platform like Telekube is helpful but it’s not a magic bullet. If the application is not architected correctly, an upgrade can lead to failed database migrations and lost data, which can lead to many hours of troubleshooting and rollbacks.
We offer some guides and training on implementing proper application upgrade procedures. Here is just one example:
During the upgrade process, we strongly advise taking automatic backups of the system and drain off the write traffic to the database to avoid conflicts during migration. In addition to that, we recommend using a test suite like robotest to roll automated regression and upgrade testing with every code and deployment change.
We also offer upgrade hooks that your application can use with Telekube. Here is a sample application upgrade process that can be automated with the upgrade hook:
- Run migration as a separate process for the cluster instead of migrations running as part of individual service startup.
- Switch product landing page and API endpoints to show an “upgrade page” to prevent writes to the database during migration process.
- Drain off the traffic to databases.
- Take backup of the data.
- Run migrations on the database.
- Check that migrations are run safe by using simple sanity test.
- Upgrade services.
- Switch the traffic back to the services from the landing page.HOW TO
Create Consistent Application Environments
Even though Kubernetes and Docker can abstract away infrastructure differences, now you have to maintain Docker and Kubernetes and make sure they are consistent across deployments. Each version is slightly different and you will encounter slightly different behavior with various combinations of software, OS distributions and storage engines. This will introduce fragmentation and your ops team will be constantly asking customers questions about components versions and their respective configurations.
We create a “bubble of consistency” by using the following methodologies:
- We package Kubernetes and all of its dependencies, including etcd, docker, dnsmasq, systemd, etc. We test to make sure these dependencies are compatible before installation. This helps to ensure conflicting software is not running on the host during the install.
- We isolate the processes running in a special linux container and minimizes interaction with distribution packages.
- The runtime section of our application manifest sets up approved docker storage drivers that are production ready and can work reliably without losing data.
- We only support the most popular OS distributions and specify other requirements. Components are tested before each release.
- Doing these things means the support and services teams will never have to ask questions like what Docker or dnsmasq version is installed, because the packages are predetermined and tested for reliability and supportability.