Skip to content

Deploying SD-WAN Across 500+ Retail Stores: Lessons Learned

Deploying networking equipment to a handful of offices is one problem. Deploying it to 500+ retail stores scattered across Indian metros, tier-2 cities, and semi-urban locations is a fundamentally different problem. This is the story of what we learned doing it.

A national retail chain approached us with a straightforward ask: reliable connectivity for POS terminals, CCTV backhaul, and a centralized inventory system. Their existing setup was a patchwork — local ISP connections managed by store staff, VPN concentrators in a colo, and an IT team that spent most of its time firefighting connectivity issues.

The numbers:

  • 500+ stores across India — metros, tier-2 cities, and towns
  • 2–4 POS terminals per store, each needing sub-second transaction response
  • 99.9% uptime target for POS connectivity (translates to less than 8.7 hours of downtime per year)
  • Zero on-site IT staff — store employees are retail workers, not network engineers

ISP Quality Variance: The Elephant in the Room

Section titled “ISP Quality Variance: The Elephant in the Room”

If you have only deployed networks in metros, you have a skewed understanding of Indian ISP quality. In Bangalore or Mumbai, you can get reasonably reliable fiber from multiple providers. In a tier-3 town in Madhya Pradesh, your options might be:

  1. A local cable operator reselling bandwidth with no SLA
  2. BSNL broadband with variable latency
  3. 4G from one of the major telcos

None of these individually meet a 99.9% uptime target. Together, with intelligent failover, they can.

Every Hopbox device at a retail site connects to at least two independent WAN links. The key word is independent — two connections from the same ISP hitting the same last-mile infrastructure is not diversity.

Our standard configurations per tier:

City TierPrimary LinkSecondary LinkNotes
MetroFiber (Jio/Airtel)4G backupFiber is reliable enough for primary
Tier-2Broadband (local ISP)4G (Jio/Airtel)Different last-mile infrastructure
Tier-3/Rural4G (Airtel)4G (Jio)Different towers, different backhaul

For tier-3 locations, we sometimes add a third link if the two available 4G providers share tower infrastructure in that area. The goal is always: no single point of failure in the last mile.

Each store gets a Hopbox CPE running OpenWrt. The hardware is purpose-selected for the deployment:

  • Dual WAN ports + USB for 4G dongle
  • Enough flash for OpenWrt + our management packages
  • Low power draw (important — many stores have unstable power, and the Hopbox runs off a small UPS)

The device ships pre-configured. Store staff literally plug in two cables (or one cable + one 4G dongle), connect POS terminals to LAN ports, and power it on. The device phones home to our management platform, pulls its site-specific configuration, and establishes WireGuard tunnels to our hub infrastructure.

Terminal window
# What the device does on first boot (simplified)
1. DHCP on WAN interfaces -> get connectivity
2. NTP sync -> correct clock (critical for WireGuard/TLS)
3. Connect to management server via HTTPS
4. Pull site-specific config (WireGuard keys, VLAN assignments, QoS policies)
5. Establish WireGuard tunnels to regional hubs
6. Report status to Prometheus endpoint
7. Begin health monitoring on all WAN links

Shipping 500+ devices is not the hard part. Getting them installed correctly is.

We built an Ansible-based provisioning pipeline that generates per-site configurations from a central inventory:

# Simplified site inventory entry
sites:
- site_id: RET-MH-PUN-042
region: west
city: Pune
tier: metro
wan1:
type: fiber
isp: airtel
static_ip: false
wan2:
type: 4g
isp: jio
apn: jionet
lan:
pos_vlan: 10
cctv_vlan: 20
mgmt_vlan: 99
hub: hub-west-01.hopbox.in
Terminal window
# Generate configs for all sites
ansible-playbook site-provision.yml -i inventory/retail-client/
# Output: per-device config tarballs ready for flashing
# RET-MH-PUN-042.tar.gz
# RET-KA-BLR-017.tar.gz
# ...

We did not deploy 500 sites at once. The rollout was phased:

  1. Pilot (20 stores): Mix of metro and tier-2 sites. Two weeks of monitoring. This is where we discovered that certain ISP + 4G dongle combinations had DNS resolution issues that our initial config did not account for.
  2. Phase 1 (100 stores): Ironed out the DNS issue, standardized UPS requirements, created the “store staff installation guide” — a single laminated page with photos.
  3. Phase 2 (200 stores): Introduced remote hands support — a phone number store staff call if the LEDs do not match the laminated guide.
  4. Phase 3 (remaining stores): Largely automated. Ship device, call store, walk through install in 10 minutes.

Retail does not care about bandwidth. They do not care about latency in the abstract. They care about one thing: can the POS terminal process a transaction right now?

We defined uptime as: POS terminal can reach the payment gateway AND the inventory server with latency under 200ms. Not “WAN link is up.” Not “device is reachable.” The actual application path.

Every Hopbox device exports metrics to Prometheus via a lightweight exporter:

Terminal window
# Key metrics exported per device
hopbox_wan_link_status{interface="wan1",isp="airtel"} 1
hopbox_wan_link_status{interface="wan2",isp="jio"} 1
hopbox_wan_latency_ms{interface="wan1",target="8.8.8.8"} 12.4
hopbox_wan_latency_ms{interface="wan2",target="8.8.8.8"} 45.7
hopbox_wan_loss_percent{interface="wan1"} 0.0
hopbox_wan_loss_percent{interface="wan2"} 1.2
hopbox_pos_gateway_reachable{gateway="paytm"} 1
hopbox_pos_gateway_latency_ms{gateway="paytm"} 34.2
hopbox_uptime_seconds 2592000

The NOC dashboard (Grafana) shows:

  • Real-time map of all sites, color-coded by health
  • Aggregated uptime percentage over rolling 30-day window
  • ISP-level reliability comparison (which ISPs are causing the most failovers?)
  • Alert feed for sites dropping below single-link operation
# Prometheus alerting rules (simplified)
groups:
- name: retail-sdwan
rules:
- alert: SiteDownToSingleLink
expr: sum(hopbox_wan_link_status) by (site_id) < 2
for: 5m
labels:
severity: warning
annotations:
summary: "Site {{ $labels.site_id }} operating on single WAN link"
- alert: POSGatewayUnreachable
expr: hopbox_pos_gateway_reachable == 0
for: 1m
labels:
severity: critical
annotations:
summary: "POS gateway unreachable from {{ $labels.site_id }}"
- alert: SiteFullyOffline
expr: sum(hopbox_wan_link_status) by (site_id) == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Site {{ $labels.site_id }} fully offline - all WAN links down"

Power outages are the single biggest cause of site downtime — not ISP failures. Every site needs a UPS for the Hopbox device and the primary network switch. We learned this the hard way when monsoon-season power fluctuations took down dozens of sites simultaneously.

We tested with a specific 4G dongle model in the lab. Stores received whatever was locally available. Some models had firmware bugs that caused them to stop reconnecting after a network switch. We ended up standardizing on two approved models and shipping them with the Hopbox devices.

Our initial config used the ISP’s DNS servers. Some local ISPs run DNS resolvers that go down more often than the actual link. We switched every device to use our internal PowerDNS resolvers (reached over the WireGuard tunnel) with a public fallback. DNS-related POS failures dropped to near zero.

Terminal window
# /etc/resolv.conf on Hopbox devices
# Primary: internal PowerDNS over WireGuard tunnel
nameserver 10.200.0.53
# Fallback: public resolvers via WAN
nameserver 1.1.1.1
nameserver 8.8.8.8

4. Centralized Management Means Centralized Responsibility

Section titled “4. Centralized Management Means Centralized Responsibility”

When 500 sites depend on your management platform, that platform is your single point of failure. We run our management infrastructure across multiple availability zones, with devices designed to operate autonomously if they lose contact with the management plane. The device keeps its last-known-good configuration and continues forwarding traffic — it just stops reporting metrics until connectivity to the management server is restored.

Store staff cannot troubleshoot networks. Every alert that hits the NOC has a runbook. Every runbook starts with remote remediation (SSH into the device, restart a service, push a config update). Only if remote remediation fails do we dispatch a technician.

After full deployment and 6 months of operation:

  • POS uptime: % across all sites (target: 99.9%)
  • Average failover time: seconds
  • Remote resolution rate: % of issues resolved without on-site visit
  • ISP-triggered failovers per month: across all sites
  • Mean time to resolve critical alerts: minutes

The deployment is ongoing — new stores come online monthly, and we continuously refine QoS policies and failover thresholds based on the data flowing back from 500+ production devices.

v1.7.9