Facing Challenges at the Edge

November 10, 2020

Tenry Fu
Spectro Cloud

Edge applications are here ... edge platforms and apps no longer present a chicken and egg problem, but platform and app manageability (and just about everything else) is still a challenge. We won't see mainstream adoption until a proper enterprise management platform exists which addresses these concerns. Here I'll lay out some challenges and potential designs and solutions in more detail.

First, let's compare edge environments to a traditional enterprise's multi-cluster and multi-cloud environments.

What is unique about the edge environments? Edge may have many locations (think smart restaurant, retail shops, airport, cruise ships, etc.), and these locations will not necessarily have a dedicated IT staff for operating and maintaining hardware and software; low touch and plug-n-play management is required.

Edge also may suffer from unreliable networks, the internet connection may temporarily get disconnected or be very slow, and workloads may still have to continue to work in those environments.

Because of the lack of IT administration on site, ideally the edge should directly work on bare metal hardware with less management and cost overhead (instead of a full blown datacenter virtualization layer).

And unlike the typical datacenter or cloud environment, each edge location will only need one Kubernetes cluster. These edge location Kubernetes clusters should be centrally managed from a datacenter or cloud, but the management solution at the edge should be part of an end-to-end edge stack, without requiring extra footprint for a separate management appliance.

Assuming you agree with these basic assumptions and requirements, let's get to the detail of potential solutions to address these challenges.

HW Discovery, Registration, and Infrastructure Provisioning

Because edge compute would optimally run on bare metal without a virtualization layer and be as low touch and plug-n-play as possible, one potential approach is for each compute node to be flashed with one lightweight bootstrap image. This image is very similar to a mobile phone's recovery partition, it will only have basic functionality to interact with the central management system to flash the rest of node partitions. This is a pretty clean and well understood approach.

How might this work? Upon boot up, the compute node could automatically register with the central management system, along with some metadata, such as its site location, capacity, hw info, and the management system will put it into a group of newly discovered/unassigned nodes.

Later, during the infrastructure provisioning process, the central management system admin can create an edge cluster with a full stack cluster definition that is composed of selected stack components: base OS (e.g., ubuntu 20.04), Kubernetes (e.g., k3s), CSI and CNI integrations, any necessary add-ons such as logging and monitoring, and also the additional edge applications (e.g., facial recognition, smart retailer ordering system).

Then the admin can assign the node to the edge cluster with different roles such as master nodes, worker nodes, etc. If things are not working properly, the node can be booted back into the bootstrap partition and start over. This approach requires centralized management and internet connectivity at this initial phase, but would give admins more fine grained control and flexibility.

Another approach could be a fully baked image for offline provisioning. The central management system can generate a full-stack offline provisioning image based on the cluster definition the admin has created.

The first compute node to come online becomes the master and worker node, and registers itself with the central management system.

The second node online within the local network will automatically discover the first node through broadcast and join the cluster as a worker node, and so on.

Once the cluster has more than three nodes, the initial master node can pick two extra worker nodes to make them also perform dual duty as master/worker to expand the cluster to be a multi-master HA setup.

The first approach is more flexible, but a two-step provisioning process (bootstrap + infra provisioning) takes more time and requires an internet connection.

The second approach works in a plug-n-play fashion without requiring an internet connection, but an admin has less control over each node's role, and since the full image can be pretty big, it probably would be better to preflash the nodes before shipping them to the edge site. Personally I prefer the second approach when possible, but it's likely that some form of both will be supported.

Manageability

Remote, or Centralized Remote Management, is really the key to making edge solutions successful. The remote management should cover both infrastructure stack and application updates.

For infrastructure updates, the cleanest approach is immutable and rolling updates, i.e., start with a new infrastructure node with the updated infra image, have it join the cluster, tear down an old node, and repeat. This is the common practice for Kubernetes upgrades in virtualized environments, whether it is in private clouds or public clouds.

However, for bare metal, it is not so easy, since there is no spare node to rotate with. One possible solution is creating an A-B dual partition: if the current infrastructure stack is on A partition, it can write the updated infra stack to B partition, and flip the boot table to boot from B partition. If post reboot there is anything wrong, it is fairly easy to revert back to A partition without losing data.

Once the infrastructure lifecycle management is resolved, the rest of add-on and application lifecycle management is pretty easy. After all, most of these components are just HELM charts, and it is easy to do rolling updates in K8s environment.

Visibility and monitoring are also very important when dealing with 100s to 1000s of edge clusters. Each cluster should embed logging and monitoring solutions, periodically ship the log to a central system, and also send system status and metrics to a central TSDB. Because of the scalability requirements, multi-cluster monitoring systems are recommended (Prometheus alone may not be able to scale to monitor a large number of clusters).
Also, because the network is not necessarily reliable at edge locations, the system should have the capability to cache the logs and metrics locally until it is able to send them to the central management system.

It is also worth noting that the central management system should just be a management plane, not a traditional control plane. In Kubernetes, the master nodes are the local control plane. In the event of management plane failure, all application workloads can still run locally without any issue. This is also applicable if the edge cluster lost internet connection to the management plane due to network issues.

Workload Placement

For edge micro datacenters, because the application workload can be part of a cluster profile stack definition, the workload placement in general is not an issue.

However, if it is the case that different edge locations need to have different kinds of workloads, since the central management system has all edge cluster metadata, it is easy to have a rule-based or machine learning based placement engine perform intent driven placement. The user can specify some intent such as SLA, location, latency, cost, performance requirements, and the workload placement engine can decide which edge location(s) to deploy the workload to.

Service Mesh and Data Sharing

Because of the nature of edge network unreliability and because each edge cluster is running independently, workloads at the edge probably will not have many multi-cluster service mesh use cases compared to their datacenter and cloud counterparts. However, there can be use cases where some data needs to be shared and some services need to be accessed by all edge locations. Such data and services will more likely be hosted in public clouds or a datacenter, but allow edge clusters to keep some local cached data for a period of time. These will be dealt on a case by case basis, and are more likely to be handled in the application layer rather than the edge infrastructure.

In conclusion, Kubernetes will be ideal for edge, but edge cluster management is still challenging. As of today there are no commercial solutions that have addressed all the challenges, but several building block technologies have emerged. Given the interest in edge computing, we should see robust edge management solutions in the very near future.

Tenry Fu is Co-Founder and CEO of Spectro Cloud

Industry News

GitLab Duo with Amazon Q Released

April 17, 2025

GitLab announced the general availability of GitLab Duo with Amazon Q.

Perforce Delphix Partners with Liquibase

April 17, 2025

Perforce Software and Liquibase announced a strategic partnership to enhance secure and compliant database change management for DevOps teams.

Spacelift Launches Saturnhead AI

April 17, 2025

Spacelift announced the launch of Saturnhead AI — an enterprise-grade AI assistant that slashes DevOps troubleshooting time by transforming complex infrastructure logs into clear, actionable explanations.

CodeSecure Integrates with FOSSA

April 16, 2025

CodeSecure and FOSSA announced a strategic partnership and native product integration that enables organizations to eliminate security blindspots associated with both third party and open source code.

Bauplan Launches with $7.5 Million in Seed Funding

April 16, 2025

Bauplan, a Python-first serverless data platform that transforms complex infrastructure processes into a few lines of code over data lakes, announced its launch with $7.5 million in seed funding.

Perforce Introduces Kafka Service Bundle

April 15, 2025

Perforce Software announced the launch of the Kafka Service Bundle, a new offering that provides enterprises with managed open source Apache Kafka at a fraction of the cost of traditional managed providers.

LambdaTest Launches HyperExecute MCP Server

April 14, 2025

LambdaTest announced the launch of the HyperExecute MCP Server, an enhancement to its AI-native test orchestration platform, HyperExecute.

Cloudflare Announces Workers VPC and VPC Private Link

April 14, 2025

Cloudflare announced Workers VPC and Workers VPC Private Link, new solutions that enable developers to build secure, global cross-cloud applications on Cloudflare Workers.

Nutrient Expands Cloud-Based Services

April 14, 2025

Nutrient announced a significant expansion of its cloud-based services, as well as a series of updates to its SDK products, aimed at enhancing the developer experience by allowing developers to build, scale, and innovate with less friction.

Check Point Recognized for #1 AI-Powered Cyber Security Platform by Miercom

April 10, 2025

Check Point® Software Technologies Ltd.(link is external) announced that its Infinity Platform has been named the top-ranked AI-powered cyber security platform in the 2025 Miercom Assessment.

Orca Introduces Bitbucket App

April 10, 2025

Orca Security announced the Orca Bitbucket App, a cloud-native seamless integration for scanning Bitbucket Repositories.

Live API for Gemini Models in Preview

April 10, 2025

The Live API for Gemini models is now in Preview, enabling developers to start building and testing more robust, scalable applications with significantly higher rate limits.

Backslash Security Digital Twin Approach to Application Security Gains Traction as Legacy Tools Fall Short

April 09, 2025

Backslash Security(link is external) announced significant adoption of the Backslash App Graph, the industry’s first dynamic digital twin for application code.

SmartBear Releases API Hub for Test

April 09, 2025

SmartBear launched API Hub for Test, a new capability within the company’s API Hub, powered by Swagger.

Akamai Announces App & API Protector Hybrid

April 09, 2025

Akamai Technologies introduced App & API Protector Hybrid.

DEVOPSdigest

HW Discovery, Registration, and Infrastructure Provisioning

Manageability

Workload Placement

Service Mesh and Data Sharing

Industry News

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

The Latest

Hot Topics

HW Discovery, Registration, and Infrastructure Provisioning

Manageability

Workload Placement

Service Mesh and Data Sharing

Related Links

Industry News

Search form

Upcoming Webinars

On-Demand Webinars

Analyst Reports

White Papers

Media Partners

User login

The Latest

Hot Topics