The Going On-Premise Survival Handbook

The Road To Production

Manage Open Source Software Dependencies

Challenge

Many times you will be requested to provide a full list of third-party software used with all the versions and dependencies shipped with the product. This is to make sure there is no copyright infringement and to reduce the likelihood of vulnerabilities. This requires scanning the product for licenses, collecting all the software versions and assessing the license dependencies. This can can take some time and can block a deal until it is completed.

Solution

We recommend using Fossa to set up on-going scans for every pull request. If you do come across a restrictive license, we recommend checking in with a copyright lawyer (we use Silicon Legal who can provide guidance and assistance with your questions. You can also reference TLDR to educate yourself on the most common licenses.

For Docker containers, use private registries with security scanning capabilities that can show the software and all common vulnerabilities reported (CVEs). Quay.io is a good example.

Pass Security Audits

Challenge

It is likely one of the major reasons your customer requested an on-prem install is due to tight security requirements. This will usually lead to a full security audit to get a green light on the production deploy, especially if customer is a regulated entity like a bank or government entity. You may need to redesign a deployment on short notice if there are vulnerabilities discovered.

Solution

On the application level, here are some important steps to take to make a Kubernetes application is ready for a customer-driven external audit:

Application security

Infrastructure security audits vary in the level of thoroughness, but usually they all consist of network security scans and application black box scans.

Network security scanner will find out any ports that respond with plain text HTTP, or using weak ciphers and older protocols like SSLv2. Application security scanner will find out basic vulnerabilities, for example if server discloses version to non-authenticated clients or contains dependencies with versions known to vulnerable to CSRF attacks. In addition to that security auditor can conduct more advanced review by trying to find hard coded secrets in the code or break into the application.

Here are some guidelines on how to get application ready for the audit:

Once your product is ready to go post POC, it is helpful to engage with a third party security review agency to conduct an external review. We recommend Cure53 as we had some positive experience working with them over past years and they will publish their work upon request.

Kubernetes Deployment Security

Kubernetes deployment security has its own deployment gotchas that will be important at the time of audit.

Set up a restrictive Kubernetes deployment, by following some fine-grained security policies. For example, you should make sure that containers are not privileged and running as root if they don’t need to be.

Use Kubernetes secrets to store infrastructure secrets like API keys and database passwords.

If the application is not ready to set up and handle TLS in a scalable way on its own (for example python or nodejs services), it is helpful to set up a proxy sidecar container terminating TLS and sending traffic to the local app. You can read more about side-car containers on the Kubernetes Blog.

Telekube itself is reasonably audit-ready by using mutual TLS on the control plane and following the security best practices for Kubernetes deployments.

Offer And Enforce Evaluations

Challenge

It’s much easier to monitor usage of hosted software than installable software. With installable software, you need to figure out a way to monitor and enforce usage according to the license. In addition, evaluation or POC periods can extending beyond the intended time frame without enforcement, which leads to longer sales cycles.

Solution

Selling downloadable software requires a certain level of trust. Our position is that if someone really wants to pirate your software, they will likely succeed. Instead of spending expensive engineering cycles creating “unhackable” software, we recommend limiting your dealings with reputable customers who would not risk their reputation by knowingly using your software illegally.

Many customers will not want to report usage automatically back to you, given one of the reasons for running the application on-prem may be data privacy. So you’ll need to allow for some other reporting mechanism in the contract. Many customers will send quarterly summary reports and, in general, usage is usually bucketed into tiers or plans so that fine-grained usage reporting is not necessary.

There are also several third party vendors to take care of the license enforcement. In our experience they are either too complex or designed for legacy software, so adoption for SaaS offerings is a challenge. Initially, you may not need license enforcement to cover all use cases but a time based “reminder” flow for trials is a good minimal implementation.

Telekube does have a way to define a limited trial license in the application manifest. This will shut down the software or limit the amount of servers it is used on during the trial period to motivate the customer to close faster.

Highly Available (HA) Database Deployments

Challenge

It is very difficult to deploy traditional database on-premise, in a highly available manner, without risking data loss. Unfortunately, Kubernetes does not bring an out of the box solution to the problem.

Solution

There are entire books written about this. To keep it short, if you don’t have significant in-house expertise with a database, find a good partner that will provide a production-ready deployment of the database on Kubernetes that they will support. For example, we partner with Citus Data to deliver production ready HA Postgres with on-premise deployments.

Simplify The Operations Of Kubernetes

Challenge

Kubernetes is a complex system that consists of a distributed database (etcd), overlay network (VXLAN), container engine (Docker), Docker registry and many other components like iptables rules to keep in mind.

A successful install is just the beginning. The platform will degrade over time. Here are just some of the problems we have encountered in the past:

What happens if it fails and how do you troubleshoot in a scenario when you don’t have the access to the infrastructure? How does the customer even know if the platform is in degraded state?

Solution

There is no easy solution for this problem, however here are some steps we have taken to help manage Kubernetes:

Our tool, gravity status, helps to diagnose the most common reasons for cluster failure, reducing time to resolution. The tool provides fast checks on some common outages that we have seen in the past. Gravity uses our monitoring system, satellite, that constantly checks the parameters of the system, not only during the install, but after the platform has been set up.

Telekube provides integrated alerting.

We offer training for field teams to help them understand Kubernetes and Docker architecture so they can become more efficient during troubleshooting sessions with customers.

Recover From Failures

Challenge

Recovering a partially failed system can be harder than setting up a new one, as you don’t have fresh hardware to begin with and have to repair the system in place. In the absence of published runbooks, services teams will struggle to provide fast assistance to the customer.

Solution

We have published a series of runbooks targeting the most common cluster failure and recovery scenarios. We review the runbooks with customers, breaking clusters and recovering them, so services teams are comfortable providing assistance on the spot.

Monitor And Troubleshoot Deployments

Challenge

Actively monitoring a multitude of on-premise deployments is difficult. You may not even have access to the deployments. In order to provide proper support you need a consistent and scalable way to assess the situation and troubleshoot your deployments when issues arise.

Solution

Every install of your application should ship with pre-built alerts, application specific metrics and a dashboard so services teams and customers will get the same monitoring and visibility no matter which environment they are in.

Metrics and Alerts

Metrics and alerts go side-by side - anomalies in metrics cause trigger alerting. Telekube integrates with the TICK stack and Grafana to create built-in application dashboards and alerts. The Google SRE book has great advice on setting up proper alerting and monitoring in the application.

Here are some tips on how to set it up with Telekube:

Logging

The 12-Factor app manifesto provides good guidance on setting up the logs as structured event streams. Docker and Kubernetes make it easy to send logs for every application by capturing logs sent to stdout and stderr.

Telekube uses a 12-factor set up when deploying applications with Kubernetes, so the logs can be captured later. We can forward logs to the endpoint of customer choice using a log forwarder configuration.

Status checks

Application metrics and alerts are great for debugging, however most of the time customers only need an answer to one question: “Is everything up and running?“. That’s why it’s important to provide “self checkers” or “smoke tests” which are programs running in the cluster to make sure everything is in a good state. Once checkers detect any failure, they communicate to the customer that the system is in degraded state via UI and alerts.

Our customers write application specific “smoke test” programs and integrate them with status hooks to give customers clear visible notification if the application has been degraded.

Sending reports

In most of the cases there is no easy way to access to on-prem deployments, so we use use our cluster management tooling to take a snapshot of all system logs and metrics, and ship it to the development and support teams for inspection.

Set Up Data Storage And Integrity

Challenge

There may not be any block storage available to use. Even if it is available, it may not be possible to integrate with it in a timely manner.

Solution

Here are some general storage recommendations:

How to Maintain Data Integrity

Elastic block storage solutions at cloud providers hide the frequency data corruption by using software and hardware-powered data replication strategies. When going on-prem, this often won’t be available and as a result you will encounter data corruption much more often.

Gravitational uses the gravity backup subsystem to provide a way to backup and restore the important application state. We set up alerts to detect the absence of backups for a period of time to make sure they are happening.

Backups should be external - stored outside of the cluster’s storage. This makes it possible to quickly recover the system in case of data corruption. Solutions like zfs-snapshotting on the same disk won’t work when the disk is corrupted.

We test backup and restore functionality for every release in an automated way with robotest suite to make sure that backups work as intended, otherwise every release can have regressions.

Access Deployments Remotely

Challenge

Many customers will not allow remote access to their infrastructure. The ones that do will require robust security measures in place. So how do you set up a secure way to get access or get the data you need to troubleshoot problems if you can’t get access?

Solution

In the cases where remote access to customer’s infrastructure is not possible, we rely on automated cluster management tools to get a snapshot view of customers infrastructure. In addition, we provide them with training using runbooks to solve problems, with escalation available.

In the cases when access is possible, you usually need to meet some restrictive requirements:

Gravitational developed an open source SSH framework called Teleport to meet these requirements. Telekube fully integrates with Teleport and adds abilities to fine-tune remote access management.

Configure Networking

Challenge

Kubernetes ships with a specific set of requirements for overlay networks. Customers can face problems especially in cases of complex network topologies. For example, they could experience problems making custom subnet ranges routable within their data center.

Solution

Telekube uses the simplest possible overlay networking for Kubernetes, VXLAN, that encapsulates all traffic in UDP packets and does not need any special routing and only needs simplest connectivity between machines. You can read more about VXLAN here.

Next >> Organizational Issues