Operating edge applications in thousands of locations: 10 things you wish you knew
When you first start to deploy edge workloads, everything will look simple enough: a few sites, some containers, and a stable network. But eventually, it’s time to scale. The number of sites will grow into hundreds or thousands, and the target environments at the edges will show surprising behaviors.

What worked well in the lab started to show its limits in the field. The challenges were less about the software itself and more about operations, distribution, and reliability.
Below follows the top 10 lessons learned from our experiences at Avassa.
1: The control plane isn’t just a bigger Kubernetes
A centralized control plane works well when managing a few hundred nodes with stable connections. At the edge, the situation is different. Thousands of sites behave independently, and network stability cannot be assumed.
Design for local autonomy. Each site must operate safely independently, continue to function offline, and synchronize only when the network allows it. This changes how you think about orchestration, availability, and data consistency.
2: A scalable fleet manager and zero-touch maintenance
Once you accept that every site is essentially its own small cluster, a new challenge appears: managing all of them over time. Manual intervention doesn’t scale.
Fleet management needs to be fully automated. Software updates, configuration changes, and certificate renewals must occur without requiring anyone to log in or run commands. The fleet management capabilities must cover both the infrastructure layer and the application layer.
Edge environments require systems that can maintain and upgrade themselves safely.
3: Connectivity through segmented networks
Edge systems often operate behind strict network segmentation and firewalls, following ISA-95 or Purdue model boundaries. Direct inbound connections are usually blocked, and outbound access may be limited or filtered.
Communication must therefore be designed as outbound-initiated and proxy-aware, often using relays across DMZs. Control and monitoring traffic must be able to tolerate delays, retries, and asymmetric connectivity.
Cloud tools that assume public endpoints and full reachability do not work under these conditions. Reliable edge orchestration depends on protocols and proxies that respect existing security zones.
4: Certificates and trust require real engineering
At scale, managing certificates is not a small detail. Each site, host, and service requires its own set of credentials that expire, need to be renewed, and occasionally fail.
Without automation and clear ownership of the trust model, you will eventually face expired or mismatched certificates in production. It is worth investing early in a structured and automated approach to PKI.
5: Container images must be available locally
When running tests, it is easy to pull container images directly from a public or central registry. At a large scale, this does not work reliably.
A local or site-local registry ensures that applications can start after reboots, upgrades, or network outages. It removes a common source of operational failures and provides a predictable deployment path.
6: Observability must be distributed
Sending every log and metric from every container to a central dashboard does not make the system observable; it only makes it noisy. At scale, raw data without context becomes unusable.
Observability should start at the edge. Logs and metrics need to be aggregated, filtered, and contextualized locally, so that what is forwarded upstream already represents useful information. Each site should be able to summarize its own state and report only what matters, not every event line.
This makes troubleshooting faster and the overall monitoring system more meaningful. Central systems should consume insights, not raw output.
7: People are part of the system
Even with a high degree of automation, people still need to understand and troubleshoot the platform, as well as the edge applications, more importantly.
Clear interfaces, remote interactive troubleshooting tools, understandable aggregated states, and safe tools for local operators make a large difference when diagnosing issues in the field. Operational clarity improves reliability more than most technical optimizations.
8: Configuration drift is unavoidable
With thousands of sites, variations in hardware, capabilities, and network characteristics are expected.
Rather than enforcing identical configurations everywhere, use structured parameters and templates to handle these differences in a controlled way. This makes deployments more robust and reduces the risk of subtle mismatches.
9: Scaling tests must be continuous
Testing scalability once is not enough. Performance changes with new versions, additional features, and larger datasets.
Run scale tests regularly that simulate site creation, application deployments, and update cycles. This provides an early warning when behavior starts to drift, giving confidence before each release.
10: Keep the system as simple as possible
When a system grows, it is tempting to add more layers and abstractions. In most cases, that adds complexity faster than it adds capability.
Focus on reducing dependencies and keeping components loosely coupled. A simple, well-understood system is easier to scale, maintain, and debug.
Closing thoughts
Many organizations underestimate the effort required to build and maintain an edge platform. The early stages often go well, internal prototypes run smoothly, and a few connected sites validate the concept. The real challenges appear later, when the rollout extends to hundreds or thousands of sites and the system must operate continuously.
Cloud-based tooling cannot be simply reused for this purpose. The assumptions that make cloud platforms efficient, such as stable networks, shared storage, and centralized control, do not apply to the edge.
Building reliable, performant systems that can manage large fleets of edge sites is a demanding engineering task. It requires careful design, operational experience, and automation from the start.
The teams that succeed are usually those that recognize early that the edge is its own environment, distributed, unreliable at times, and different by nature, and design their systems accordingly.
Scalability report
We regularly perform large-scale tests to validate Control Tower performance in realistic environments. Read our Scability Test Report for managing applications in 10 000+ locations.

