1776 Fault-Freedom II
1776 has been known for SCO-compatible RAID software for a decade. Their latest version of Fault-Freedom, dubbed Fault-Freedom II, is an extension of the company’s dedication to providing high-availability data integrity by using clustering. Fault-Freedom II lets you mirror up to 16 disk slices and 16 processes on one machine, and in case of a problem a second, backup mirror system is automatically used as a fail-over. System clustering and fail-over has been available on larger Unix servers for a number of years, but Fault-Freedom II is the first complete system for SCO we’ve tested.
A number of features of Fault-Freedom II stand out at first glance. The Motif interface is clean and easy to use. Fail-over conditions can be modified to suit your requirements. Automatic alerting through a number of methods such as pop-up windows, e-mail, and fax are possible (we rigged our test system to page us through a modem when fail-overs occurred). And an API allows programmers to customize both systems and applications so that Fault-Freedom II routines can be handled more easily. As usual with clustering software, there are limitations too. First, you cannot mirror a boot disk (this is not a limitation with Fault-Freedom II but all clustering software). You can’t mix mirrored and shared slices on the same system, either. If you use RAID with Fault-Freedom II, it must be hardware-based, as software RAID will conflict with the fail-over system.
Setting up Fault-Freedom II is a little time-consuming. You need to decide which system is to be mirrored and which will be a fail-over. The fail-over can be a regular machine, and does not need to be dedicated to its stand-by role. The two machines do not need to have similar hardware configurations. On our test network, we used our ALR Revolution 2XL server running SCO OpenServer 5 as the primary and an older Pentium II Pro system as the fail-over. SCSI disk capacities were approximately the same but composed of different combinations of disk sizes (two 9.1 GB and one 4GB on the server, four 6GB on the fail-over). Memory was slightly different, too, with 128MB on the server and 64MB on the fail-over. Hardware boards can be completely different, as can software. Both our machines had different add-in board setups, including SCSI controllers.
Both machines used in the cluster can be anywhere on the network as long as the distance is not too far. 1776 estimates that packet transmission shouldn’t take more than three-quarters of a second between the two devices. On our network, the two devices were installed in an office and basement, separated by 400 feet on a 100Base-T network. You can set Fault-Freedom II to switch IP addresses of the fail-over in case of a failure in the primary, allowing all network interaction to proceed properly. We used a TCP/IP Ethernet setup, but any SCO-supported network can be used with Fault-Freedom II.
The method described for setting up a primary and fail-over machine is the most common, providing an automatic backup for a single server. However, Fault-Freedom II can just as easily mirror the fail-over to the primary, so that two machines are effectively mirroring each other. This would be handy for installations where one machine acts as an application server and another as a Web server, for example. A failure in one has the other assuming fail-over roles.
Installing Fault-Freedom II is simple and took about two minutes. Configuring the system, on the other hand, took a few hours. The software is supplied on two diskettes, and gets installed with custom or scoadmin. You need to plan in advance which slices of the disks will be mirrored. As mentioned, up to 16 slices or partitions can be handled in each direction. You don’t need to mirror an entire disk, so you can select one or two partitions for mirroring from each physical hard drive in your system (except the boot partition). A partition that is mirrored requires that the target (fail-over) partition be dedicated, which usually means you can’t mount the backup partition unless you want to risk data integrity problems. Fault-Freedom II is clever in that it allows mirroring to be suspended either on demand (such as for a tape backup) or intermittently (to reduce network traffic at intervals or to provide for site to site backups in the night). Daily maintenance and monitoring of Fault-Freedom II is through the Motif interface usually, and is minimal except when you want to change a configuration.
The documentation supplied with Fault-Freedom II is in a small two-inch three-ring binder, complete with properly inserted page tabs. The manual is complete, but the printing leaves a little to be desired: it looks like the pages were laser printed with a minimal of formatting or design, then reproduced from the raw laser sheets. A professional design and better printing would improve the look of the manual, but what really matters is the content, which is complete.
Fail-overs from primary to backup can be set to occur at several occurrences, such as a disk slice failure, process failure, machine shutdown, command failure, and on specific command. On our test system we instructed the fail-over to take over for most of these conditions. When we deliberately unplugged a SCSI cable to one drive with four mirrored slices, the backup kicked in almost immediately. Despite a slight slow-down of requests served over the network from the backup, the system behaved as though nothing had happened. When we killed the HTTP daemon on our server, the backup’s daemon started up and handled Web requests right away. We also monitored the entire system (which most users will do) so when the power was interrupted and the UPS signaled a shutdown, the backup took over all roles of the server, including switching IP addresses dynamically. The fail-over process worked within seconds every time we tried it. The only problem we encountered in our testing was a complete freeze-up of our backup system at one point. The problem is known to 1776 and an appendix in the documentation explains how to set kernel parameters to prevent the problem. We followed their instructions and the issue went away.
1776’s Fault-Freedom II is not going to be ideal for every network, partially because of the cost and also because of the setup and maintenance involved. However, software like Fault-Freedom II can be thought of as RAID for an entire server (or two servers, even). If your data is important, you ought to protect it. If you need to have your server available all the time, then clustering software like Fault-Freedom II is the best way to go. Fault-Freedom II does its job well, and adds another tool to a system administrator’s arsenal against SCO system problems.