Running PowerDNS at Scale: Serving 240k Queries/Hour Across 900+ Sites
We run the DNS infrastructure at Hopbox on PowerDNS — both the Authoritative Server and the Recursor. Across 900+ customer sites, we handle around 240,000 queries per hour. That number sounds large, but in the DNS world it’s honestly modest. What makes it interesting is the operational side: keeping it reliable, observable, and fast.
This post is a look at how we run PowerDNS in production — the architecture, the monitoring, the tuning, and the incidents that taught us things the docs didn’t cover.
Architecture: Authoritative + Recursor Split
Section titled “Architecture: Authoritative + Recursor Split”We run the Authoritative Server and the Recursor as separate processes. This is the recommended PowerDNS deployment model and for good reason — they have fundamentally different jobs.
The Authoritative Server (pdns) answers queries for zones we own. It’s backed by a PostgreSQL database via the gpgsql backend, which stores all zone data. Zone updates come through our internal API and are written directly to the database.
The Recursor (pdns-recursor) handles recursive resolution for our infrastructure and customer services that need outbound DNS. It forwards to upstream resolvers for zones we don’t own, and is configured to query our own authoritative server for our zones.
┌─────────────────┐ Internet ────────>│ pdns (auth) │──── PostgreSQL │ port 53 │ └─────────────────┘
┌─────────────────┐ Internal ────────>│ pdns-recursor │──── Upstream resolvers services │ port 5353 │──── pdns (auth) for local zones └─────────────────┘Both run on dedicated hosts. We don’t colocate DNS with application workloads — a noisy neighbor causing CPU spikes is the last thing you want affecting DNS latency.
Lua Records for Health-Check-Based Responses
Section titled “Lua Records for Health-Check-Based Responses”One of PowerDNS’s more powerful features is Lua records. These let you write small Lua scripts that run at query time to generate dynamic responses. We use them for basic health-check-based DNS — if a backend is down, its IP gets pulled from the response.
Here’s a simplified version of what a Lua A record looks like in our setup:
-- Lua record content for a load-balanced A recordifportup(443, { '203.0.113.10', '203.0.113.11', '203.0.113.12'}, { selector='random', backupSelector='all' })This checks port 443 on each IP and only returns the healthy ones. If all backends fail the health check, it falls back to returning all of them (the backupSelector='all' bit) — the idea being that a maybe-working answer is better than no answer.
You can verify the behavior with dig:
$ dig @ns1.hopbox.net lb.example-customer.com A +short203.0.113.10203.0.113.12In this case, .11 is down and got excluded. A minute later when it recovers:
$ dig @ns1.hopbox.net lb.example-customer.com A +short203.0.113.10203.0.113.11203.0.113.12The caveat: Lua records add latency. The health checks are done asynchronously in the background, but the Lua execution itself has overhead. For high-QPS names, we set short TTLs (30-60 seconds) and rely on downstream caching rather than running the Lua logic on every query.
Monitoring with Prometheus
Section titled “Monitoring with Prometheus”PowerDNS has solid built-in metrics exposed via its HTTP API. We scrape these with Prometheus and visualize in Grafana. Here are the metrics we actually alert on:
Key Authoritative Metrics
Section titled “Key Authoritative Metrics”# prometheus alert rules (simplified)groups: - name: pdns-auth rules: - alert: HighServfailRate expr: rate(pdns_auth_servfail_answers_total[5m]) / rate(pdns_auth_queries_total[5m]) > 0.01 for: 5m annotations: summary: "SERVFAIL rate above 1%"
- alert: PacketCacheHitRateLow expr: rate(pdns_auth_packetcache_hits_total[5m]) / (rate(pdns_auth_packetcache_hits_total[5m]) + rate(pdns_auth_packetcache_misses_total[5m])) < 0.5 for: 10m annotations: summary: "Packet cache hit rate below 50%"
- alert: QueryRateSpike expr: rate(pdns_auth_queries_total[5m]) > 200 for: 2m annotations: summary: "Query rate exceeding 200 qps"The SERVFAIL rate is the most important one. A spike in SERVFAILs almost always means something is broken — a backend database issue, a corrupt zone, or a Lua record error. We keep this under 0.1% in normal operation.
Key Recursor Metrics
Section titled “Key Recursor Metrics”For the recursor, cache hit ratio is king:
$ rec_control get cache-hits cache-missescache-hits 1847293cache-misses 312847That’s roughly an 85% cache hit ratio, which is healthy. If this drops below 70%, something is wrong — usually a TTL configuration issue or a cache size problem.
Zone Transfer Debugging
Section titled “Zone Transfer Debugging”We use AXFR (full zone transfers) to replicate zones to secondary nameservers. When zone transfers break, the debugging process usually starts with dig:
$ dig @ns1.hopbox.net example-customer.com AXFR
; <<>> DiG 9.18.28 <<>> @ns1.hopbox.net example-customer.com AXFR;; global options: +cmdexample-customer.com. 3600 IN SOA ns1.hopbox.net. hostmaster.hopbox.net. 2026032601 3600 900 604800 86400example-customer.com. 3600 IN NS ns1.hopbox.net.example-customer.com. 3600 IN NS ns2.hopbox.net.example-customer.com. 300 IN A 203.0.113.50example-customer.com. 3600 IN MX 10 mail.example-customer.com.mail.example-customer.com. 3600 IN A 203.0.113.51example-customer.com. 3600 IN SOA ns1.hopbox.net. hostmaster.hopbox.net. 2026032601 3600 900 604800 86400If AXFR fails, it’s usually one of three things:
- ACL misconfiguration — the secondary’s IP isn’t in
allow-axfr-ips - Serial number not incremented — PowerDNS uses NOTIFY + SOA serial checks to trigger transfers. If you update records without bumping the serial, secondaries won’t pull.
- Network issues — AXFR uses TCP, and sometimes firewall rules only allow UDP/53.
The third one catches people more often than you’d think. DNS is “UDP port 53” in everyone’s mental model, but zone transfers, large responses, and increasingly normal queries all use TCP. If your firewall rules don’t allow TCP/53, you will have a bad time.
# Quick check if TCP zone transfer works$ dig @ns2.hopbox.net example-customer.com SOA +tcp +shortns1.hopbox.net. hostmaster.hopbox.net. 2026032601 3600 900 604800 86400Performance Tuning
Section titled “Performance Tuning”PowerDNS has two caching layers on the authoritative side that matter most:
Packet Cache
Section titled “Packet Cache”The packet cache stores raw DNS response packets keyed by the query. If the exact same question comes in again, PowerDNS returns the cached packet without touching the backend at all. This is the single biggest performance lever.
# pdns.confpacket-cache-ttl=60max-packet-cache-entries=1000000We set the packet cache TTL to 60 seconds. This means even records with a 300-second TTL get re-queried from the backend every 60 seconds, but the packet cache handles the thundering herd in between. For our workload, this is the right tradeoff between freshness and performance.
Query Cache
Section titled “Query Cache”The query cache sits between the packet cache and the backend. It caches the results of backend lookups (the SQL queries against PostgreSQL). This helps when the packet cache misses but the backend data hasn’t changed.
# pdns.confquery-cache-ttl=20max-cache-entries=1000000We keep this TTL lower than the packet cache — 20 seconds. The idea is that the query cache should be a safety net, not the primary caching layer.
Measuring the Impact
Section titled “Measuring the Impact”You can check cache effectiveness at runtime:
$ pdns_control ccountspacketcache-hit=1523847packetcache-miss=287492query-cache-hit=198273query-cache-miss=89219If your packet cache hit rate is below 80%, either your traffic has very high cardinality (lots of unique queries) or your cache TTL is too low.
Incident: The SOA Serial Overflow
Section titled “Incident: The SOA Serial Overflow”This one was fun. A customer’s zone had a SOA serial that was close to the 32-bit unsigned integer maximum (4294967295). An automated system was incrementing the serial on every record change, and it was burning through serial numbers fast.
When it overflowed, the serial wrapped to 0. Our secondary nameservers saw the serial go from ~4.2 billion to 0 and concluded the zone had been reset. Per RFC 1982 serial number arithmetic, this is technically handled — but not all implementations agree on what “handled” means.
The fix was straightforward: reset the serial to a date-based format (YYYYMMDDNN) and configure our tooling to use that scheme going forward. But the debugging took a while because the symptoms were subtle — intermittent resolution failures on the secondary, but only for this one zone.
# What the SOA looked like after the fix$ dig @ns1.hopbox.net affected-zone.com SOA +shortns1.hopbox.net. hostmaster.hopbox.net. 2026032601 3600 900 604800 86400Incident: Lua Record Timeout Cascade
Section titled “Incident: Lua Record Timeout Cascade”We had a Lua record that checked backend health via TCP port probes. One day, a network issue caused the port probes to start timing out instead of failing fast. The default timeout was 2 seconds per probe, and the Lua record checked 5 backends sequentially.
That’s 10 seconds of blocking per query. The thread pool backed up, legitimate queries started queuing, and within minutes our response latency went through the roof.
The fix was two-fold:
- Reduced the health check timeout to 500ms
- Configured the probes to run in parallel rather than sequentially
-- After: parallel checks with shorter timeoutifportup(443, { '203.0.113.10', '203.0.113.11', '203.0.113.12'}, { selector='random', backupSelector='all', timeout=500 })The broader lesson: any blocking I/O in the DNS query path is a risk. DNS is expected to be fast — clients typically wait 1-3 seconds before retrying with a different resolver. If your authoritative server takes 10 seconds to respond, the client has already moved on.
PowerDNS has been solid for us. It’s not the flashiest DNS server, but it’s well-documented, has good operational tooling, and the Lua records feature is genuinely useful for dynamic DNS use cases. If you’re evaluating DNS server software for a production deployment, it’s worth a serious look.