Running PowerDNS at Scale: Serving 240k Queries/Hour Across 900+ Sites

Mar 26, 2026

Sahil Dhiman

GNU/Linux Network Engineer

We run the DNS infrastructure at Hopbox on PowerDNS — both the Authoritative Server and the Recursor. Across 900+ customer sites, we handle around 240,000 queries per hour. That number sounds large, but in the DNS world it’s honestly modest. What makes it interesting is the operational side: keeping it reliable, observable, and fast.

This post is a look at how we run PowerDNS in production — the architecture, the monitoring, the tuning, and the incidents that taught us things the docs didn’t cover.

Architecture: Authoritative + Recursor Split

We run the Authoritative Server and the Recursor as separate processes. This is the recommended PowerDNS deployment model and for good reason — they have fundamentally different jobs.

The Authoritative Server (pdns) answers queries for zones we own. It’s backed by a PostgreSQL database via the gpgsql backend, which stores all zone data. Zone updates come through our internal API and are written directly to the database.

The Recursor (pdns-recursor) handles recursive resolution for our infrastructure and customer services that need outbound DNS. It forwards to upstream resolvers for zones we don’t own, and is configured to query our own authoritative server for our zones.

                    ┌─────────────────┐
  Internet ────────>│  pdns (auth)    │──── PostgreSQL
                    │  port 53        │
                    └─────────────────┘

                    ┌─────────────────┐
  Internal ────────>│  pdns-recursor  │──── Upstream resolvers
  services          │  port 5353      │──── pdns (auth) for local zones
                    └─────────────────┘

Both run on dedicated hosts. We don’t colocate DNS with application workloads — a noisy neighbor causing CPU spikes is the last thing you want affecting DNS latency.

Lua Records for Health-Check-Based Responses

One of PowerDNS’s more powerful features is Lua records. These let you write small Lua scripts that run at query time to generate dynamic responses. We use them for basic health-check-based DNS — if a backend is down, its IP gets pulled from the response.

Here’s a simplified version of what a Lua A record looks like in our setup:

-- Lua record content for a load-balanced A record
ifportup(443, {
    '203.0.113.10',
    '203.0.113.11',
    '203.0.113.12'
}, { selector='random', backupSelector='all' })

This checks port 443 on each IP and only returns the healthy ones. If all backends fail the health check, it falls back to returning all of them (the backupSelector='all' bit) — the idea being that a maybe-working answer is better than no answer.

You can verify the behavior with dig:

$ dig @ns1.hopbox.net lb.example-customer.com A +short
203.0.113.10
203.0.113.12

In this case, .11 is down and got excluded. A minute later when it recovers:

$ dig @ns1.hopbox.net lb.example-customer.com A +short
203.0.113.10
203.0.113.11
203.0.113.12

The caveat: Lua records add latency. The health checks are done asynchronously in the background, but the Lua execution itself has overhead. For high-QPS names, we set short TTLs (30-60 seconds) and rely on downstream caching rather than running the Lua logic on every query.

Monitoring with Prometheus

PowerDNS has solid built-in metrics exposed via its HTTP API. We scrape these with Prometheus and visualize in Grafana. Here are the metrics we actually alert on:

Key Authoritative Metrics

# prometheus alert rules (simplified)
groups:
  - name: pdns-auth
    rules:
      - alert: HighServfailRate
        expr: rate(pdns_auth_servfail_answers_total[5m]) / rate(pdns_auth_queries_total[5m]) > 0.01
        for: 5m
        annotations:
          summary: "SERVFAIL rate above 1%"

      - alert: PacketCacheHitRateLow
        expr: rate(pdns_auth_packetcache_hits_total[5m]) / (rate(pdns_auth_packetcache_hits_total[5m]) + rate(pdns_auth_packetcache_misses_total[5m])) < 0.5
        for: 10m
        annotations:
          summary: "Packet cache hit rate below 50%"

      - alert: QueryRateSpike
        expr: rate(pdns_auth_queries_total[5m]) > 200
        for: 2m
        annotations:
          summary: "Query rate exceeding 200 qps"

The SERVFAIL rate is the most important one. A spike in SERVFAILs almost always means something is broken — a backend database issue, a corrupt zone, or a Lua record error. We keep this under 0.1% in normal operation.

Key Recursor Metrics

For the recursor, cache hit ratio is king:

$ rec_control get cache-hits cache-misses
cache-hits      1847293
cache-misses    312847

That’s roughly an 85% cache hit ratio, which is healthy. If this drops below 70%, something is wrong — usually a TTL configuration issue or a cache size problem.

Zone Transfer Debugging

We use AXFR (full zone transfers) to replicate zones to secondary nameservers. When zone transfers break, the debugging process usually starts with dig:

$ dig @ns1.hopbox.net example-customer.com AXFR

; <<>> DiG 9.18.28 <<>> @ns1.hopbox.net example-customer.com AXFR
;; global options: +cmd
example-customer.com.   3600    IN      SOA     ns1.hopbox.net. hostmaster.hopbox.net. 2026032601 3600 900 604800 86400
example-customer.com.   3600    IN      NS      ns1.hopbox.net.
example-customer.com.   3600    IN      NS      ns2.hopbox.net.
example-customer.com.   300     IN      A       203.0.113.50
example-customer.com.   3600    IN      MX      10 mail.example-customer.com.
mail.example-customer.com. 3600 IN      A       203.0.113.51
example-customer.com.   3600    IN      SOA     ns1.hopbox.net. hostmaster.hopbox.net. 2026032601 3600 900 604800 86400

If AXFR fails, it’s usually one of three things:

ACL misconfiguration — the secondary’s IP isn’t in allow-axfr-ips
Serial number not incremented — PowerDNS uses NOTIFY + SOA serial checks to trigger transfers. If you update records without bumping the serial, secondaries won’t pull.
Network issues — AXFR uses TCP, and sometimes firewall rules only allow UDP/53.

The third one catches people more often than you’d think. DNS is “UDP port 53” in everyone’s mental model, but zone transfers, large responses, and increasingly normal queries all use TCP. If your firewall rules don’t allow TCP/53, you will have a bad time.

# Quick check if TCP zone transfer works
$ dig @ns2.hopbox.net example-customer.com SOA +tcp +short
ns1.hopbox.net. hostmaster.hopbox.net. 2026032601 3600 900 604800 86400

Performance Tuning

PowerDNS has two caching layers on the authoritative side that matter most:

Packet Cache

The packet cache stores raw DNS response packets keyed by the query. If the exact same question comes in again, PowerDNS returns the cached packet without touching the backend at all. This is the single biggest performance lever.

# pdns.conf
packet-cache-ttl=60
max-packet-cache-entries=1000000

We set the packet cache TTL to 60 seconds. This means even records with a 300-second TTL get re-queried from the backend every 60 seconds, but the packet cache handles the thundering herd in between. For our workload, this is the right tradeoff between freshness and performance.

Query Cache

The query cache sits between the packet cache and the backend. It caches the results of backend lookups (the SQL queries against PostgreSQL). This helps when the packet cache misses but the backend data hasn’t changed.

# pdns.conf
query-cache-ttl=20
max-cache-entries=1000000

We keep this TTL lower than the packet cache — 20 seconds. The idea is that the query cache should be a safety net, not the primary caching layer.

Measuring the Impact

You can check cache effectiveness at runtime:

$ pdns_control ccounts
packetcache-hit=1523847
packetcache-miss=287492
query-cache-hit=198273
query-cache-miss=89219

If your packet cache hit rate is below 80%, either your traffic has very high cardinality (lots of unique queries) or your cache TTL is too low.

Incident: The SOA Serial Overflow

This one was fun. A customer’s zone had a SOA serial that was close to the 32-bit unsigned integer maximum (4294967295). An automated system was incrementing the serial on every record change, and it was burning through serial numbers fast.

When it overflowed, the serial wrapped to 0. Our secondary nameservers saw the serial go from ~4.2 billion to 0 and concluded the zone had been reset. Per RFC 1982 serial number arithmetic, this is technically handled — but not all implementations agree on what “handled” means.

The fix was straightforward: reset the serial to a date-based format (YYYYMMDDNN) and configure our tooling to use that scheme going forward. But the debugging took a while because the symptoms were subtle — intermittent resolution failures on the secondary, but only for this one zone.

# What the SOA looked like after the fix
$ dig @ns1.hopbox.net affected-zone.com SOA +short
ns1.hopbox.net. hostmaster.hopbox.net. 2026032601 3600 900 604800 86400

Incident: Lua Record Timeout Cascade

We had a Lua record that checked backend health via TCP port probes. One day, a network issue caused the port probes to start timing out instead of failing fast. The default timeout was 2 seconds per probe, and the Lua record checked 5 backends sequentially.

That’s 10 seconds of blocking per query. The thread pool backed up, legitimate queries started queuing, and within minutes our response latency went through the roof.

The fix was two-fold:

Reduced the health check timeout to 500ms
Configured the probes to run in parallel rather than sequentially

-- After: parallel checks with shorter timeout
ifportup(443, {
    '203.0.113.10',
    '203.0.113.11',
    '203.0.113.12'
}, { selector='random', backupSelector='all', timeout=500 })

The broader lesson: any blocking I/O in the DNS query path is a risk. DNS is expected to be fast — clients typically wait 1-3 seconds before retrying with a different resolver. If your authoritative server takes 10 seconds to respond, the client has already moved on.

PowerDNS has been solid for us. It’s not the flashiest DNS server, but it’s well-documented, has good operational tooling, and the Lua records feature is genuinely useful for dynamic DNS use cases. If you’re evaluating DNS server software for a production deployment, it’s worth a serious look.

v1.7.9