The Road To Production
Manage Open Source Software Dependencies
Many times you will be requested to provide a full list of third-party software used with all the versions and dependencies shipped with the product. This is to make sure there is no copyright infringement and to reduce the likelihood of vulnerabilities. This requires scanning the product for licenses, collecting all the software versions and assessing the license dependencies. This can can take some time and can block a deal until it is completed.
We recommend using Fossa to set up on-going scans for every pull request. If you do come across a restrictive license, we recommend checking in with a copyright lawyer (we use Silicon Legal who can provide guidance and assistance with your questions. You can also reference TLDR to educate yourself on the most common licenses.
For Docker containers, use private registries with security scanning capabilities that can show the software and all common vulnerabilities reported (CVEs). Quay.io is a good example.
Pass Security Audits
It is likely one of the major reasons your customer requested an on-prem install is due to tight security requirements. This will usually lead to a full security audit to get a green light on the production deploy, especially if customer is a regulated entity like a bank or government entity. You may need to redesign a deployment on short notice if there are vulnerabilities discovered.
On the application level, here are some important steps to take to make a Kubernetes application is ready for a customer-driven external audit:
Infrastructure security audits vary in the level of thoroughness, but usually they all consist of network security scans and application black box scans.
Network security scanner will find out any ports that respond with plain text HTTP, or using weak ciphers and older protocols like SSLv2. Application security scanner will find out basic vulnerabilities, for example if server discloses version to non-authenticated clients or contains dependencies with versions known to vulnerable to CSRF attacks. In addition to that security auditor can conduct more advanced review by trying to find hard coded secrets in the code or break into the application.
Here are some guidelines on how to get application ready for the audit:
- Set up mutual TLS in your application using side-car patterns. As a rule of thumb, there should be no unencrypted data floating between servers.
- Do not use the same static passwords/api-keys for every install, make sure you generate them on the fly during the installation process.
- Disable weak ciphers, use Mozilla’s recommendations as a starting point.
- A common gotcha with TLS is that if the web page or endpoint is external (customer facing), make sure TLS ciphers and certificates should be configurable, as all large customers have their own guidelines and requirements.
- Focus on common web security issues by going through OWASP Top 10.
Once your product is ready to go post POC, it is helpful to engage with a third party security review agency to conduct an external review. We recommend Cure53 as we had some positive experience working with them over past years and they will publish their work upon request.
Kubernetes Deployment Security
Kubernetes deployment security has its own deployment gotchas that will be important at the time of audit.
Set up a restrictive Kubernetes deployment, by following some fine-grained security policies. For example, you should make sure that containers are not privileged and running as root if they don’t need to be.
Use Kubernetes secrets to store infrastructure secrets like API keys and database passwords.
If the application is not ready to set up and handle TLS in a scalable way on its own (for example python or nodejs services), it is helpful to set up a proxy sidecar container terminating TLS and sending traffic to the local app. You can read more about side-car containers on the Kubernetes Blog.
Offer And Enforce Evaluations
It’s much easier to monitor usage of hosted software than installable software. With installable software, you need to figure out a way to monitor and enforce usage according to the license. In addition, evaluation or POC periods can extending beyond the intended time frame without enforcement, which leads to longer sales cycles.
Selling downloadable software requires a certain level of trust. Our position is that if someone really wants to pirate your software, they will likely succeed. Instead of spending expensive engineering cycles creating “unhackable” software, we recommend limiting your dealings with reputable customers who would not risk their reputation by knowingly using your software illegally.
Many customers will not want to report usage automatically back to you, given one of the reasons for running the application on-prem may be data privacy. So you’ll need to allow for some other reporting mechanism in the contract. Many customers will send quarterly summary reports and, in general, usage is usually bucketed into tiers or plans so that fine-grained usage reporting is not necessary.
There are also several third party vendors to take care of the license enforcement. In our experience they are either too complex or designed for legacy software, so adoption for SaaS offerings is a challenge. Initially, you may not need license enforcement to cover all use cases but a time based “reminder” flow for trials is a good minimal implementation.
Telekube does have a way to define a limited trial license in the application manifest. This will shut down the software or limit the amount of servers it is used on during the trial period to motivate the customer to close faster.
Highly Available (HA) Database Deployments
It is very difficult to deploy traditional database on-premise, in a highly available manner, without risking data loss. Unfortunately, Kubernetes does not bring an out of the box solution to the problem.
There are entire books written about this. To keep it short, if you don’t have significant in-house expertise with a database, find a good partner that will provide a production-ready deployment of the database on Kubernetes that they will support. For example, we partner with Citus Data to deliver production ready HA Postgres with on-premise deployments.
Simplify The Operations Of Kubernetes
Kubernetes is a complex system that consists of a distributed database (etcd), overlay network (VXLAN), container engine (Docker), Docker registry and many other components like iptables rules to keep in mind.
A successful install is just the beginning. The platform will degrade over time. Here are just some of the problems we have encountered in the past:
- Security teams automated blocking ports and stopping services without warning.
- Monitoring daemons set up by customer consuming all RAM on the host.
- Running out of disk space.
- Customer’s DNS server blocking queries.
What happens if it fails and how do you troubleshoot in a scenario when you don’t have the access to the infrastructure? How does the customer even know if the platform is in degraded state?
There is no easy solution for this problem, however here are some steps we have taken to help manage Kubernetes:
Our tool, gravity status, helps to diagnose the most common reasons for cluster failure, reducing time to resolution. The tool provides fast checks on some common outages that we have seen in the past. Gravity uses our monitoring system, satellite, that constantly checks the parameters of the system, not only during the install, but after the platform has been set up.
Telekube provides integrated alerting.
We offer training for field teams to help them understand Kubernetes and Docker architecture so they can become more efficient during troubleshooting sessions with customers.
Recover From Failures
Recovering a partially failed system can be harder than setting up a new one, as you don’t have fresh hardware to begin with and have to repair the system in place. In the absence of published runbooks, services teams will struggle to provide fast assistance to the customer.
We have published a series of runbooks targeting the most common cluster failure and recovery scenarios. We review the runbooks with customers, breaking clusters and recovering them, so services teams are comfortable providing assistance on the spot.
Monitor And Troubleshoot Deployments
Actively monitoring a multitude of on-premise deployments is difficult. You may not even have access to the deployments. In order to provide proper support you need a consistent and scalable way to assess the situation and troubleshoot your deployments when issues arise.
Every install of your application should ship with pre-built alerts, application specific metrics and a dashboard so services teams and customers will get the same monitoring and visibility no matter which environment they are in.
Metrics and Alerts
Metrics and alerts go side-by side - anomalies in metrics cause trigger alerting. Telekube integrates with the TICK stack and Grafana to create built-in application dashboards and alerts. The Google SRE book has great advice on setting up proper alerting and monitoring in the application.
Here are some tips on how to set it up with Telekube:
- Use a TICK stack integration to to ship pre-built dashboards.
- Set up built in alerts using Kapacitor integration.
- Set up retention policies and rollups for application metrics or use the ones shipped with Telekube by default.
The 12-Factor app manifesto provides good guidance on setting up the logs as structured event streams. Docker and Kubernetes make it easy to send logs for every application by capturing logs sent to stdout and stderr.
Telekube uses a 12-factor set up when deploying applications with Kubernetes, so the logs can be captured later. We can forward logs to the endpoint of customer choice using a log forwarder configuration.
Application metrics and alerts are great for debugging, however most of the time customers only need an answer to one question: “Is everything up and running?“. That’s why it’s important to provide “self checkers” or “smoke tests” which are programs running in the cluster to make sure everything is in a good state. Once checkers detect any failure, they communicate to the customer that the system is in degraded state via UI and alerts.
Our customers write application specific “smoke test” programs and integrate them with status hooks to give customers clear visible notification if the application has been degraded.
In most of the cases there is no easy way to access to on-prem deployments, so we use use our cluster management tooling to take a snapshot of all system logs and metrics, and ship it to the development and support teams for inspection.
Set Up Data Storage And Integrity
There may not be any block storage available to use. Even if it is available, it may not be possible to integrate with it in a timely manner.
Here are some general storage recommendations:
- You may need to rely on disk storage as the only available storage solution. Use Kubernetes host volumes to use local disk and provide clear requirements to use disk by using Telekube’s application manifest.
- If customer has an external NFS server, provide integration NFS endpoints powered by Kubernetes pluggable volumes to connect your application to it.
- Use clustered database deployments that are designed to work on bad hardware, like the Cassandra-powered S3 storage system, Pithos. Avoid using unproven and experimental storage systems that are designed to work with Kubernetes or systems with a large operational footprint like Ceph unless you have an in-house Ceph team to handle the support load.
- For services doing simple metadata storage, consider using custom resources provided by Kubernetes. Custom resources provide a powerful abstraction, generating versioned, secure API with RBAC using Etcd as a backing storage.
- Avoid deploying risky storage methods, unless you have seasoned data storage expertise on the team. As a rule of thumb, try not to experiment with data storage combinations as a part of the on-prem release. Make sure that any deployment with mission critical data is vetted with a storage expert.
- Consider the operational costs of any database. For example, the ELK stack is easy do deploy, but extremely hard and expensive to manage.
How to Maintain Data Integrity
Elastic block storage solutions at cloud providers hide the frequency data corruption by using software and hardware-powered data replication strategies. When going on-prem, this often won’t be available and as a result you will encounter data corruption much more often.
Gravitational uses the gravity backup subsystem to provide a way to backup and restore the important application state. We set up alerts to detect the absence of backups for a period of time to make sure they are happening.
Backups should be external - stored outside of the cluster’s storage. This makes it possible to quickly recover the system in case of data corruption. Solutions like zfs-snapshotting on the same disk won’t work when the disk is corrupted.
We test backup and restore functionality for every release in an automated way with robotest suite to make sure that backups work as intended, otherwise every release can have regressions.
Access Deployments Remotely
Many customers will not allow remote access to their infrastructure. The ones that do will require robust security measures in place. So how do you set up a secure way to get access or get the data you need to troubleshoot problems if you can’t get access?
In the cases where remote access to customer’s infrastructure is not possible, we rely on automated cluster management tools to get a snapshot view of customers infrastructure. In addition, we provide them with training using runbooks to solve problems, with escalation available.
In the cases when access is possible, you usually need to meet some restrictive requirements:
- Ability to segment time and duration of the access
- Limit access privileges using role based access controls.
- Never open any inbound internet accessible ports.
- Audit and record every action performed.
- Use second factor authentication and have ability to revoke access completely.
- Use approved crypto standards and protocols and turn off weak ciphers.
Kubernetes ships with a specific set of requirements for overlay networks. Customers can face problems especially in cases of complex network topologies. For example, they could experience problems making custom subnet ranges routable within their data center.
Telekube uses the simplest possible overlay networking for Kubernetes, VXLAN, that encapsulates all traffic in UDP packets and does not need any special routing and only needs simplest connectivity between machines. You can read more about VXLAN here.