The Going On-Premise Survival Handbook

Lifecycle Management And Operations

Avoid Deployment Fragmentation

Challenge

Delivering an on-premise offering, in addition to an existing, hosted offering may result in two different ways to deploy the application. This will lead to a bifurcation of team responsibilities and doubling the amount of work.

Solution

It is possible to unify deployments by migrating to Kubernetes as the primary platform for both deployments.

Kubernetes provides a way to abstract away the details of underlying infrastructure like disks, load balancers and network security rules. You can read more about Kubernetes in its documentation.

In addition to adopting Kubernetes, Helm (the Kubernetes native package manager) should be used to split components into independent packages.

Once the migration to Kubernetes and Helm is complete, the on-premise edition becomes just another deployment target alongside cloud deployments.

For example, many of our customers use Telekube’s supported upstream Kubernetes as a deployment target for on-premise and a managed Kubernetes service like GKE or AKS for their cloud deployments.

Simplify Installation Complexity

Challenge

Installing a highly available, distributed system is difficult on infrastructure you control, not to mention infrastructure that you don’t. Setting up dozens of components and dependencies will lead to a multi-step installation process that is very hard to debug and execute by untrained, on-site personnel.

Some installations will fail and it will take many hours to troubleshoot the root cause while going back and forth with the customer. Completing an installation may take days and numerous attempts to get right. Eventually, customers may entirely abandon the idea in frustration, which can damage your reputation with your customers.

Solution

Automating as many steps as possible removes the human factor of the installation process. However, it can be difficult to automate the installation of complex applications for every environment. It is a good idea to limit the types of supported environments and to only support specific components so that you can safely implement automation (see section on “How to handle incompatible resources”). There also need to be an easy way to log and share information externally for debugging if something went wrong.

Speaking of installation failures, support teams should have the ability to roll back the installation to the point of failure rather than starting over. This has the potential to save wasted hours by not having to restart the installation from scratch when failures occur late in the installation process.

Telekube automates the installation and reduces the number of installation steps to one command which installs Kubernetes alongside all dependencies and application containers. It also has a simple way to collect operational reports which captures all possible information and allows for manually overriding the automated installation, if a failure occurs.

Handle Incompatible Resources

Challenge

When you don’t control the infrastructure, installation problems can be caused by a variety of things outside of your control - slow disks, networks or an old OS distribution provided by customer. This can cause hours of troubleshooting and it will be unclear why the installation failed to work with a seemingly correct setup and configuration.

Solution

Always specify and enforce systems requirements for disk space and speed, network requirements, and OS distribution with every installation. The system should refuse to install unless requirements are met and have proper error reporting for the requirements that are not met. It is usually not enough to provide guidance in form of documentation to the customers because they often ignore it. We use a set of pre-checks to specify the list of operating systems, disk speed and capacity, network bandwidth and open ports.

Also, equip your services teams with lightweight tools to pre-check the system readiness like our gravity status tool that can run even before the installation has begun to make sure that basic requirements are satisfied.

Here is some advice on more specific requirements to consider:

Simplify Upgrade Complexity

Challenge

Installing distributed systems on infrastructure you do not fully control is hard, but upgrading it is an order of magnitude more complex. Sometimes, only certain components of the system need to be upgraded but it may difficult to only upgrade those components in a safe way.

Upgrade failures can turn into quagmires. During an upgrade operation on-premise, there is no easy way to reinstall the OS or add new nodes to the rotation. Any part of the upgrade can fail at any time due to known or unknown circumstances like power outages, system going out of disk space or simply containers hanging because of older kernels. Complex updates will contribute to longer upgrade cycles, as customers will be wary of the risk and spending 2-3 days upgrading the system.

Solution

Our upgrade process consists of a single command that launches a full cluster and application upgrade. However, if the upgrade fails, it can be easily resumed from the explicit stage it last completed, instead of starting from the beginning.

This approach makes it possible to continue the upgrade even in the face of unexpected failures and it keeps the cluster running during failures. This also leaves a good impression on the customer, as you know at which stage the upgrade failed and can provide insight as to why it happened, which is difficult with a black box upgrade procedure.

Handle Application Upgrade Failures

Challenge

Having a platform like Telekube is helpful but it’s not a magic bullet. If the application is not architected correctly, an upgrade can lead to failed database migrations and lost data, which can lead to many hours of troubleshooting and rollbacks.

Solution

We offer some guides and training on implementing proper application upgrade procedures. Here is just one example:

During the upgrade process, we strongly advise taking automatic backups of the system and drain off the write traffic to the database to avoid conflicts during migration. In addition to that, we recommend using a test suite like robotest to roll automated regression and upgrade testing with every code and deployment change.

We also offer upgrade hooks that your application can use with Telekube. Here is a sample application upgrade process that can be automated with the upgrade hook:

Create Consistent Application Environments

Challenge

Even though Kubernetes and Docker can abstract away infrastructure differences, now you have to maintain Docker and Kubernetes and make sure they are consistent across deployments. Each version is slightly different and you will encounter slightly different behavior with various combinations of software, OS distributions and storage engines. This will introduce fragmentation and your ops team will be constantly asking customers questions about components versions and their respective configurations.

Solution

We create a “bubble of consistency” by using the following methodologies:

Next >> Release And Upgrade Cycles