[Chapter 8] 8.5 Planning for Disasters

8.5 Planning for Disasters

It's a fact of life on a network that things go wrong. Hardware fails, software has bugs, and people very occasionally make mistakes. Sometimes this results in minor inconvenience, like having a few users lose connections. Sometimes the results are catastrophic and involve the loss of important data and valuable jobs.

Because the Domain Name System relies so heavily on the network, it is vulnerable to network outages. Thankfully, the design of DNS takes into account the imperfection of networks: it allows for multiple, redundant name servers, retransmission of queries, retrying zone transfers, and so on.

The Domain Name System doesn't protect itself from every conceivable calamity, though. There are types of network failure - some of them quite common - that DNS doesn't or can't protect against. But with a small investment of time and money, you can minimize the threat of these outages.

8.5.1 Outages

Power outages, for example, are relatively common in many parts of the world. In some parts of the U.S., thunderstorms or tornadoes may cause a site to lose power, or to have only intermittent power, for an extended period. Elsewhere, typhoons, volcanoes, or construction work may interrupt your electrical service.

If all your hosts are down, of course, you don't need name service. Quite often, however, sites have problems when power is restored. Following our recommendations, they run their name servers on file servers and big multiuser machines. And when the power comes up, those machines are naturally the last to boot - because all those disks need to be fscked first! Which means that all the hosts on-site that are quick to boot do so without the benefit of name service.

This can cause all sorts of wonderful problems, depending on how your hosts' startup files are written. UNIX hosts often execute some variant of:

    /etc/ifconfig lan0 inet `hostname` netmask 255.255.128.0 up
    /etc/route add default site-router 1

to bring up their network interface. Using host names in the commands (`hostname` expands to the local host name and site-router is the name of the local router) is admirable for two reasons:

It lets the administrators change the router's IP address without changing all the startup files on-site.
It lets the administrators change the host's IP address by changing the IP address in only one file.

Unfortunately, the route command will fail without name service. The ifconfig command will fail only if the localhost's name and IP address don't appear in the host's /etc/hosts file, so it's a good idea to leave at least that data in each host's /etc/hosts.

By the time the startup sequence reaches the route command, the network interface will be up, and the host will use name service to map the name of the router to an IP address. And since the host has no default route until the route command is executed, the only name servers it can reach are those on the local subnet.

If the booting host can reach a working name server on its local subnet, it can execute the route command successfully. Quite often, however, one or more of the name servers it can reach aren't yet running. What happens then depends on the contents of resolv.conf.

In BIND 4.9 and BIND 8, the resolver will only fall back to the host table if there is only one name server listed in resolv.conf (or if no name server is listed, and the resolver defaults to using a name server on the local host). If only one name server is configured, the resolver will query it, and if the network returns an error each time the resolver sends a query, the resolver will fall back to searching the host table. The errors that cause the resolver to fall back include:

Receipt of an ICMP port unreachable message
Receipt of an ICMP network unreachable message
Inability to send the UDP packet (e.g., because networking is not yet running on the local host)[11]
[11] Check Chapter 6, Configuring Hosts, for vendor-specific enhancements to and variants on this resolver algorithm.

If the host running the one configured name server isn't running at all, though, the resolver won't receive any errors. The name server is effectively a black hole. After about 75 seconds of trying, the resolver will just time out and return a null answer to the application that called it. Only if the name server host has actually started networking - but not yet started the name server - will the resolver get an error: an ICMP port unreachable message.

Overall, the single name server configuration does work if you have name servers available on each net, but perhaps not as elegantly as we might like. If the local name server hasn't come up when a host on its network reboots, the route command will fail.

This may seem awkward, but it's not nearly as bad as what happens with multiple servers. With multiple servers listed in resolv.conf, BIND never falls back to the host table after the primary network interface has been ifconfiged. The resolver simply loops through the name servers, querying them until one answers or the 75-plus second timeout is reached.

This is especially problematic during bootup. If none of the configured name servers are available, the resolver will time out without returning an IP address, and adding the default route will fail.

8.5.2 Recommendations

Our recommendation, as primitive as it sounds, is to hardcode the IP address of the default router into the startup file, or to use an external file (many systems use /etc/defaultrouter). This will ensure that your host's networking will start correctly.

An alternative is to list just a single, reliable name server on your host's local net in resolv.conf. This will allow you to use the name of the default router in the startup file, as long as you make sure that the router's name appears in /etc/hosts (in case your reliable name server isn't running when the host reboots). Of course, if the host running the reliable name server isn't running when your host reboots, all bets are off. You won't fall back to /etc/hosts, because there won't be any networking running to return an error to your host.

If your vendor's version of BIND allows configuration of the order in which services are queried, or will fall back from DNS to /etc/hosts if DNS doesn't find an answer, take advantage of it! In the former case, you can configure the resolver to check /etc/hosts first, and then keep a "stub" /etc/hosts file on each host, including the default router and the local host's name. In the latter situation, just make sure such a "stub" /etc/hosts exists; no other configuration should be necessary.

A last, promising prospect is to do away with setting the default route manually by using ICMP Router Discovery Messages. This extension to the ICMP protocol, described in RFC 1256, uses broadcast or multicast messages to dynamically discover and advertise routers on a network. Sun includes an implementation of this protocol in recent versions of Solaris as /usr/sbin/in.rdisc, and newer versions of Cisco's Internetwork Operating System (IOS) support it too.

And what if your default route is added correctly, but the name servers still haven't come up? This can affect sendmail, NFS, and a slew of other services. sendmail won't canonicalize host names correctly without DNS, and your NFS mounts may fail.

The best solution to this problem is to run a name server on a host with uninterruptible power. If you rarely experience extended power loss, battery backup might be enough. If your outages are longer, and name service is critical to you, you should consider an uninterruptible power system (UPS) with a generator of some kind.

If you can't afford luxuries like these, you might just try to track down the fastest booting host around and run a name server on it. Hosts with filesystem journaling should boot especially quickly, since they don't need to fsck. Hosts with small filesystems should boot quickly, too, since they don't have as much filesystem to check.

Once you've located the right host, you'll need to make sure the host's IP address appears in the resolv.conf files of all the hosts that need full-time name service. You'll probably want to list the backed-up host last, since during normal operation, hosts should use the name server closest to them. Then, after a power failure, your critical applications will still have name service, albeit at a small sacrifice in performance.


8.4 Changing TTLs		8.6 Coping with Disaster