Deploying SD-WAN Across 500+ Retail Stores: Lessons Learned

Mar 26, 2026

Hopbox team

Deploying networking equipment to a handful of offices is one problem. Deploying it to 500+ retail stores scattered across Indian metros, tier-2 cities, and semi-urban locations is a fundamentally different problem. This is the story of what we learned doing it.

The Challenge

A national retail chain approached us with a straightforward ask: reliable connectivity for POS terminals, CCTV backhaul, and a centralized inventory system. Their existing setup was a patchwork — local ISP connections managed by store staff, VPN concentrators in a colo, and an IT team that spent most of its time firefighting connectivity issues.

The numbers:

500+ stores across India — metros, tier-2 cities, and towns
2–4 POS terminals per store, each needing sub-second transaction response
99.9% uptime target for POS connectivity (translates to less than 8.7 hours of downtime per year)
Zero on-site IT staff — store employees are retail workers, not network engineers

ISP Quality Variance: The Elephant in the Room

If you have only deployed networks in metros, you have a skewed understanding of Indian ISP quality. In Bangalore or Mumbai, you can get reasonably reliable fiber from multiple providers. In a tier-3 town in Madhya Pradesh, your options might be:

A local cable operator reselling bandwidth with no SLA
BSNL broadband with variable latency
4G from one of the major telcos

None of these individually meet a 99.9% uptime target. Together, with intelligent failover, they can.

ISP Diversity Strategy

Every Hopbox device at a retail site connects to at least two independent WAN links. The key word is independent — two connections from the same ISP hitting the same last-mile infrastructure is not diversity.

Our standard configurations per tier:

City Tier	Primary Link	Secondary Link	Notes
Metro	Fiber (Jio/Airtel)	4G backup	Fiber is reliable enough for primary
Tier-2	Broadband (local ISP)	4G (Jio/Airtel)	Different last-mile infrastructure
Tier-3/Rural	4G (Airtel)	4G (Jio)	Different towers, different backhaul

For tier-3 locations, we sometimes add a third link if the two available 4G providers share tower infrastructure in that area. The goal is always: no single point of failure in the last mile.

The Hopbox Device

Each store gets a Hopbox CPE running OpenWrt. The hardware is purpose-selected for the deployment:

Dual WAN ports + USB for 4G dongle
Enough flash for OpenWrt + our management packages
Low power draw (important — many stores have unstable power, and the Hopbox runs off a small UPS)

The device ships pre-configured. Store staff literally plug in two cables (or one cable + one 4G dongle), connect POS terminals to LAN ports, and power it on. The device phones home to our management platform, pulls its site-specific configuration, and establishes WireGuard tunnels to our hub infrastructure.

# What the device does on first boot (simplified)
1. DHCP on WAN interfaces -> get connectivity
2. NTP sync -> correct clock (critical for WireGuard/TLS)
3. Connect to management server via HTTPS
4. Pull site-specific config (WireGuard keys, VLAN assignments, QoS policies)
5. Establish WireGuard tunnels to regional hubs
6. Report status to Prometheus endpoint
7. Begin health monitoring on all WAN links

Deployment Logistics

Shipping 500+ devices is not the hard part. Getting them installed correctly is.

The Provisioning Pipeline

We built an Ansible-based provisioning pipeline that generates per-site configurations from a central inventory:

# Simplified site inventory entry
sites:
  - site_id: RET-MH-PUN-042
    region: west
    city: Pune
    tier: metro
    wan1:
      type: fiber
      isp: airtel
      static_ip: false
    wan2:
      type: 4g
      isp: jio
      apn: jionet
    lan:
      pos_vlan: 10
      cctv_vlan: 20
      mgmt_vlan: 99
    hub: hub-west-01.hopbox.in

# Generate configs for all sites
ansible-playbook site-provision.yml -i inventory/retail-client/

# Output: per-device config tarballs ready for flashing
# RET-MH-PUN-042.tar.gz
# RET-KA-BLR-017.tar.gz
# ...

The Rollout Schedule

We did not deploy 500 sites at once. The rollout was phased:

Pilot (20 stores): Mix of metro and tier-2 sites. Two weeks of monitoring. This is where we discovered that certain ISP + 4G dongle combinations had DNS resolution issues that our initial config did not account for.
Phase 1 (100 stores): Ironed out the DNS issue, standardized UPS requirements, created the “store staff installation guide” — a single laminated page with photos.
Phase 2 (200 stores): Introduced remote hands support — a phone number store staff call if the LEDs do not match the laminated guide.
Phase 3 (remaining stores): Largely automated. Ship device, call store, walk through install in 10 minutes.

POS Uptime: The Only Metric That Matters

Retail does not care about bandwidth. They do not care about latency in the abstract. They care about one thing: can the POS terminal process a transaction right now?

We defined uptime as: POS terminal can reach the payment gateway AND the inventory server with latency under 200ms. Not “WAN link is up.” Not “device is reachable.” The actual application path.

Monitoring Stack

Every Hopbox device exports metrics to Prometheus via a lightweight exporter:

# Key metrics exported per device
hopbox_wan_link_status{interface="wan1",isp="airtel"} 1
hopbox_wan_link_status{interface="wan2",isp="jio"} 1
hopbox_wan_latency_ms{interface="wan1",target="8.8.8.8"} 12.4
hopbox_wan_latency_ms{interface="wan2",target="8.8.8.8"} 45.7
hopbox_wan_loss_percent{interface="wan1"} 0.0
hopbox_wan_loss_percent{interface="wan2"} 1.2
hopbox_pos_gateway_reachable{gateway="paytm"} 1
hopbox_pos_gateway_latency_ms{gateway="paytm"} 34.2
hopbox_uptime_seconds 2592000

The NOC dashboard (Grafana) shows:

Real-time map of all sites, color-coded by health
Aggregated uptime percentage over rolling 30-day window
ISP-level reliability comparison (which ISPs are causing the most failovers?)
Alert feed for sites dropping below single-link operation

Alerting Rules

# Prometheus alerting rules (simplified)
groups:
  - name: retail-sdwan
    rules:
      - alert: SiteDownToSingleLink
        expr: sum(hopbox_wan_link_status) by (site_id) < 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Site {{ $labels.site_id }} operating on single WAN link"

      - alert: POSGatewayUnreachable
        expr: hopbox_pos_gateway_reachable == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "POS gateway unreachable from {{ $labels.site_id }}"

      - alert: SiteFullyOffline
        expr: sum(hopbox_wan_link_status) by (site_id) == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Site {{ $labels.site_id }} fully offline - all WAN links down"

Lessons Learned

1. UPS Is Not Optional

Power outages are the single biggest cause of site downtime — not ISP failures. Every site needs a UPS for the Hopbox device and the primary network switch. We learned this the hard way when monsoon-season power fluctuations took down dozens of sites simultaneously.

2. 4G Dongles Are Not All Equal

We tested with a specific 4G dongle model in the lab. Stores received whatever was locally available. Some models had firmware bugs that caused them to stop reconnecting after a network switch. We ended up standardizing on two approved models and shipping them with the Hopbox devices.

3. DNS Matters More Than You Think

Our initial config used the ISP’s DNS servers. Some local ISPs run DNS resolvers that go down more often than the actual link. We switched every device to use our internal PowerDNS resolvers (reached over the WireGuard tunnel) with a public fallback. DNS-related POS failures dropped to near zero.

# /etc/resolv.conf on Hopbox devices
# Primary: internal PowerDNS over WireGuard tunnel
nameserver 10.200.0.53
# Fallback: public resolvers via WAN
nameserver 1.1.1.1
nameserver 8.8.8.8

4. Centralized Management Means Centralized Responsibility

When 500 sites depend on your management platform, that platform is your single point of failure. We run our management infrastructure across multiple availability zones, with devices designed to operate autonomously if they lose contact with the management plane. The device keeps its last-known-good configuration and continues forwarding traffic — it just stops reporting metrics until connectivity to the management server is restored.

5. Have a Playbook for Every Failure Mode

Store staff cannot troubleshoot networks. Every alert that hits the NOC has a runbook. Every runbook starts with remote remediation (SSH into the device, restart a service, push a config update). Only if remote remediation fails do we dispatch a technician.

Results

After full deployment and 6 months of operation:

POS uptime: % across all sites (target: 99.9%)
Average failover time: seconds
Remote resolution rate: % of issues resolved without on-site visit
ISP-triggered failovers per month: across all sites
Mean time to resolve critical alerts: minutes

The deployment is ongoing — new stores come online monthly, and we continuously refine QoS policies and failover thresholds based on the data flowing back from 500+ production devices.

v1.7.9