Skip to content

Automating 900+ Network Devices with Ansible

At Hopbox, we manage over 900 SD-WAN appliances deployed across India. Each device runs OpenWrt, has 2-4 WAN links, maintains WireGuard tunnels, and runs local DNS resolution. Keeping all of them consistent, updated, and correctly configured is a non-trivial operational challenge.

Ansible is the backbone of our device automation. This post covers why we chose it, how we structure our inventory and playbooks, and the hard-won lessons from running Ansible against hundreds of embedded Linux devices.

We evaluated several options — SaltStack, custom scripts over SSH, a purpose-built agent — and landed on Ansible for these reasons:

  • Agentless. Our CPE devices run OpenWrt on constrained hardware (2-4GB RAM, mSATA SSD). We don’t want a persistent agent consuming resources. Ansible needs only SSH and Python (or raw mode for devices without Python).
  • SSH-based. Every Hopbox device is reachable over its WireGuard tunnel via SSH. No additional ports, no additional daemons.
  • Idempotent (mostly). Ansible’s declarative model means we can re-run playbooks safely. In practice, idempotency on OpenWrt requires some care — more on that below.
  • Ecosystem. The ansible.netcommon and community.general collections provide modules for network config templating, file management, and service control that work well on Linux-based network devices.

Static inventory files don’t work when devices are provisioned and decommissioned regularly. We wrote a custom dynamic inventory script that queries the Hopbox Cloud API:

#!/usr/bin/env python3
"""Ansible dynamic inventory from Hopbox API."""
import json
import os
import requests
HOPBOX_API = os.environ.get("HOPBOX_API_URL", "https://api.hopbox.net")
HOPBOX_TOKEN = os.environ["HOPBOX_API_TOKEN"]
def get_inventory():
headers = {"Authorization": f"Bearer {HOPBOX_TOKEN}"}
resp = requests.get(f"{HOPBOX_API}/v1/devices", headers=headers)
resp.raise_for_status()
devices = resp.json()["devices"]
inventory = {"_meta": {"hostvars": {}}, "all": {"children": []}}
# Group by region
regions = {}
for device in devices:
region = device["region"]
if region not in regions:
regions[region] = []
regions[region].append(device)
inventory["_meta"]["hostvars"][device["hostname"]] = {
"ansible_host": device["tunnel_ip"],
"ansible_user": "root",
"ansible_ssh_private_key_file": "/etc/ansible/keys/hopbox-automation",
"site_id": device["site_id"],
"wan_links": device["wan_links"],
"firmware_version": device["firmware_version"],
"hardware_model": device["hardware_model"],
}
for region, devices in regions.items():
group_name = f"region_{region}"
inventory[group_name] = {
"hosts": [d["hostname"] for d in devices]
}
inventory["all"]["children"].append(group_name)
return inventory
if __name__ == "__main__":
print(json.dumps(get_inventory(), indent=2))

This gives us groups like region_north, region_south, etc., with per-host variables including tunnel IP, site ID, current firmware version, and hardware model.

Our Ansible repository is structured as:

ansible/
inventory/
hopbox_inventory.py
playbooks/
firmware-upgrade.yml
config-push.yml
dns-zone-sync.yml
wireguard-rekey.yml
diagnostics.yml
roles/
hopbox-base/
hopbox-wan/
hopbox-dns/
hopbox-tunnel/
templates/
network.j2
dhcp.j2
firewall.j2
wireguard.j2
group_vars/
all.yml
region_north.yml
region_south.yml
ansible.cfg

Firmware upgrades are the highest-risk operation. A bad firmware push to 900 devices simultaneously would be catastrophic. We use a staged rolling update strategy:

playbooks/firmware-upgrade.yml
---
- name: Hopbox firmware upgrade (rolling)
hosts: "{{ target_group | default('canary') }}"
serial: "{{ batch_size | default(10) }}"
max_fail_percentage: 5
gather_facts: false
vars:
firmware_url: "https://releases.hopbox.net/firmware/{{ firmware_version }}/hopbox-{{ hardware_model }}.img.gz"
firmware_checksum: "sha256:{{ firmware_sha256 }}"
tasks:
- name: Check current firmware version
ansible.builtin.command: cat /etc/hopbox-version
register: current_version
changed_when: false
- name: Skip if already on target version
ansible.builtin.meta: end_host
when: current_version.stdout == firmware_version
- name: Download firmware image
ansible.builtin.get_url:
url: "{{ firmware_url }}"
dest: /tmp/firmware.img.gz
checksum: "{{ firmware_checksum }}"
timeout: 300
- name: Verify available disk space
ansible.builtin.shell: |
available=$(df /tmp | tail -1 | awk '{print $4}')
if [ "$available" -lt 102400 ]; then
echo "INSUFFICIENT_SPACE"
exit 1
fi
changed_when: false
- name: Apply firmware via sysupgrade
ansible.builtin.command: >
sysupgrade -v /tmp/firmware.img.gz
async: 300
poll: 0
register: sysupgrade_job
- name: Wait for device to come back online
ansible.builtin.wait_for_connection:
delay: 60
timeout: 300
- name: Verify new firmware version
ansible.builtin.command: cat /etc/hopbox-version
register: new_version
changed_when: false
failed_when: new_version.stdout != firmware_version
- name: Run post-upgrade health check
ansible.builtin.uri:
url: "http://{{ ansible_host }}:8080/health"
return_content: true
register: health
failed_when: health.json.status != "ok"
retries: 3
delay: 10

The rollout process:

  1. Canary group (5 devices across different regions) — deploy, monitor for 24 hours.
  2. Early adopters (50 devices) — deploy in batches of 10, with max_fail_percentage: 5.
  3. Full fleet — deploy in batches of 20-50, monitored via Grafana dashboards.

If any batch exceeds the failure threshold, Ansible halts and we investigate.

OpenWrt uses UCI (Unified Configuration Interface) for configuration. We template the key config files using Jinja2:

{# templates/network.j2 #}
config interface 'loopback'
option device 'lo'
option proto 'static'
option ipaddr '127.0.0.1'
option netmask '255.0.0.0'
config interface 'lan'
option device 'br-lan'
option proto 'static'
option ipaddr '{{ lan_ip | default("192.168.1.1") }}'
option netmask '{{ lan_netmask | default("255.255.255.0") }}'
{% for link in wan_links %}
config interface 'wan{{ loop.index0 }}'
option device '{{ link.device }}'
option proto '{{ link.proto }}'
{% if link.proto == 'static' %}
option ipaddr '{{ link.ipaddr }}'
option netmask '{{ link.netmask }}'
option gateway '{{ link.gateway }}'
{% endif %}
option metric '{{ link.metric | default(loop.index0 * 10) }}'
{% endfor %}

The corresponding playbook pushes the template and restarts networking:

playbooks/config-push.yml
---
- name: Push network configuration
hosts: "{{ target_group }}"
serial: 20
gather_facts: false
tasks:
- name: Template network configuration
ansible.builtin.template:
src: templates/network.j2
dest: /etc/config/network
mode: '0644'
register: network_config
- name: Template firewall configuration
ansible.builtin.template:
src: templates/firewall.j2
dest: /etc/config/firewall
mode: '0644'
register: firewall_config
- name: Restart networking if config changed
ansible.builtin.command: /etc/init.d/network restart
when: network_config.changed
- name: Reload firewall if config changed
ansible.builtin.command: /etc/init.d/firewall reload
when: firewall_config.changed
- name: Wait for connectivity after restart
ansible.builtin.wait_for_connection:
delay: 10
timeout: 120
when: network_config.changed

Every Hopbox device runs a local PowerDNS Recursor for DNS resolution. We sync zone overrides and blocklists via Ansible:

playbooks/dns-zone-sync.yml
---
- name: Sync DNS zone configuration
hosts: all
serial: 50
gather_facts: false
tasks:
- name: Sync forward zone overrides
ansible.builtin.copy:
src: "files/dns/forward-zones.conf"
dest: /etc/pdns-recursor/forward-zones.conf
mode: '0644'
register: forward_zones
- name: Sync blocklist
ansible.builtin.copy:
src: "files/dns/blocklist.lua"
dest: /etc/pdns-recursor/blocklist.lua
mode: '0644'
register: blocklist
- name: Restart recursor if config changed
ansible.builtin.command: /etc/init.d/pdns-recursor restart
when: forward_zones.changed or blocklist.changed

Ansible’s strength is idempotency, but OpenWrt presents some challenges:

  1. UCI vs flat files. OpenWrt’s uci commands are the “correct” way to manage configuration, but Ansible’s template module writes flat files. We chose flat files because they’re easier to template and diff, but this means we bypass UCI’s internal state. We run uci commit as a post-task to keep UCI in sync.

  2. No systemd. OpenWrt uses init.d scripts, not systemd. Standard Ansible service module works with some configuration, but we often use command for reliability.

  3. Minimal Python. Some of our older devices don’t have Python installed. For those, we use ansible.builtin.raw for basic commands and ensure Python is installed as a bootstrap step.

  4. Package state. opkg (OpenWrt’s package manager) doesn’t have robust state management. We maintain a list of required packages and use a simple shell task to install missing ones rather than relying on the opkg module.

Network automation failures can take a site offline. Our safety nets:

  • Pre-flight checks. Every playbook starts with connectivity verification and a config backup.
  • Config backups. Before any config change, we archive /etc/config/ to a timestamped tarball on the device and pull a copy to our central backup store.
  • Automatic rollback. For critical config changes (network, firewall), we use a “dead man’s switch” — a cron job scheduled 5 minutes in the future that restores the backup. The playbook cancels the cron job only after verifying connectivity post-change.
  • max_fail_percentage. Every playbook that touches the fleet has a failure threshold. If too many devices fail, the run stops.
# Rollback dead man's switch pattern
- name: Schedule automatic rollback in 5 minutes
ansible.builtin.cron:
name: "config-rollback"
minute: "{{ '%M' | strftime(ansible_date_time.epoch | int + 300) }}"
hour: "{{ '%H' | strftime(ansible_date_time.epoch | int + 300) }}"
job: "/usr/local/bin/hopbox-config-restore.sh"
state: present
# ... apply config changes ...
- name: Cancel rollback after successful verification
ansible.builtin.cron:
name: "config-rollback"
state: absent

We don’t run playbooks directly from laptops. All automation goes through a CI pipeline:

  1. Lint. ansible-lint checks for best practices and common mistakes.
  2. Dry run. --check --diff against a test group of devices.
  3. Canary deploy. Apply to canary devices, run health checks.
  4. Approval gate. Manual approval required before fleet-wide deployment.
  5. Rolling deploy. Apply to the fleet in batches with monitoring.
# .gitlab-ci.yml (simplified)
stages:
- lint
- dry-run
- canary
- deploy
lint:
stage: lint
script:
- ansible-lint playbooks/
dry-run:
stage: dry-run
script:
- ansible-playbook playbooks/$PLAYBOOK --check --diff -l test_devices
canary:
stage: canary
script:
- ansible-playbook playbooks/$PLAYBOOK -l canary
- ./scripts/run-health-checks.sh canary
when: manual
deploy:
stage: deploy
script:
- ansible-playbook playbooks/$PLAYBOOK -l all --forks 50
when: manual
  1. Never push to the entire fleet at once. Always use serial and max_fail_percentage. Always.
  2. Test on real hardware. VMs don’t catch OpenWrt-specific issues. We keep a rack of test devices that mirror production hardware.
  3. Backup before every change. Disk is cheap. Downtime isn’t.
  4. Ansible vault for secrets. WireGuard keys, API tokens, and SSH keys are all managed via Ansible Vault. No plaintext secrets in the repository.
  5. Keep playbooks simple. Complex logic belongs in scripts on the device, not in 200-line Jinja2 templates. Ansible should orchestrate, not compute.

Ansible isn’t a silver bullet for network automation, especially on embedded Linux. But its agentless model, SSH-based execution, and declarative approach make it the right tool for managing a fleet of OpenWrt-based SD-WAN devices at our scale. The key is investing in guardrails — rolling updates, automatic rollbacks, health checks, and CI pipelines — so that automation accelerates your operations without amplifying your mistakes.

v1.7.9