Tim Parker and Nancy Breaux
The term supercomputer has been in widespread use for many years. Most people think of supercomputers as a large, expensive, fast number-cruncher designed for complex calculation. Supercomputers belong at NASA, meteorological stations, universities and research organizations. The ability to have supercomputer power in your own office or home was unthinkable a few years ago, and when dealing with single PCs the performance is getting close to but not quite good enough to qualify as a supercomputer. There are a few ways you can have supercomputing capabilities using simple, inexpensive personal computers, though. All hinge around the concept of clustering.
Clustering computers to share their computational power for larger problems is not a new idea. Clusters have been around since the Digital Equipment VAX days, and large UNIX workstation manufacturers like IBM and Sun have offered clustering software capabilities. The change in focus towards inexpensive personal computers really started at NASA, where a number of old 80486-based computers were networked together into a cluster. Since then, the focus on clustering of personal computers has lead to many new developments, as well as a new category of supercomputer: shared computation power. With terms like Beowulf cluster entering our vocabulary, it’s time to take a look at what clustering is, what it does, what you need to create and manage a cluster, and how Linux can fit into the picture.
Let’s start with a few definitions. The word cluster actually refers to many different technologies and techniques, all intended to couple several machines together to provide either more computational resources than single machines can offer or to offer high availability servers (called HA clustering). HA clustering is used for mission-critical and never-down servers such as Web servers and large corporate application servers. By having a number of machines on a network that can perform the same tasks, failure in one machine simply allows another to assume the failed machine’s tasks. The creation of these redundant clusters is simple in Linux and UNIX, requiring only monitoring routines to handle the fail-over tasks as quickly as possible. One other benefit of HA clusters is shared load handling: if you have five servers all ready to handle requests, they will be able to handle a much higher load than a single server.
Clusters are often developed specifically for their computation power. Instead of spending millions of dollars on a supercomputer, companies and organizations could spend a few thousands on personal computers and use their shared, aggregate CPU powers. The original cluster at NASA arose for this reason. The NASA engineers needed around 1Gflops (one billion operations per second) for their project but couldn’t afford a supercomputer. Instead, they combined 16 66MHz 80486 systems into a cluster, writing software to allow the machines to cooperatively crunch the numbers. In 1994, this was the first Beowulf cluster (called Wiglaf, after Beowulf’s aide). The total cost of Wiglaf was well under $50,000, much less than the cost of a supercomputer to provide the same raw crunching power.
The high-performance cluster setups also have other uses than raw CPU power. The issue of bulk disk servers arises, since many low-power CPUs can be used for large files and filesystems. Fast file servers require fast CPUs to handle file I/O: spreading the loads across many machines reduces the amount of CPU required for each disk. Even better, with parallel Linux or UNIX clusters, a filesystem can be constructed that is as fast or faster than very expensive disk-based fileservers at a fraction of the cost. Bulk disk servers are ideal for database applications.
All clusters are not the same. There are several ways to implement clusters, and to some extent the method of implementing depends on the machines being used in the cluster and the purpose of the cluster. There are two technologies that compete in the cluster world: MPI (Message Passing Interface) and PVM (Parallel Virtual Machine). MPI is a published standard and is gaining rapid support from the clustering community including vendors and military applications. PVM is an implementation developed at one university and is not a standard. Clusters can be implemented without either MPI or PVM. However, one or the other message passing protocol is usually used for parallel programming. PVM is the de facto standard for clusters.
PVM was designed to support heterogenous environments (in other words, many different kinds of machines making up a cluster) while MPI was not designed for clusters at all. MPI was designed for programming massively parallel processor (MPP) machines such as the Cray T3D. MPI is suitable for any parallel computer that uses message passing to communicate between processors. The goal was to write portable libraries for number crunching, so that what you wrote (often in some flavor of FORTRAN) for your Cray could be ported to your IBM MPP. Clusters were an afterthought for MPI. Hence, the only clusters that are easily supported by MPI are those consisting of machines all of the same type. Nevertheless, if you want the application code you wrote for your Cray supercomputer to run on a bunch of Linux boxes, MPI is probably the first choice.
If you are starting developing an application from scratch and want a little more flexibility in load balancing, machine types, and fault tolerance, PVM is probably your best bet (as long as you don't mind writing your code in C. Although MPI does allow for heterogeneous networks, there are some limitations to MPI. Most notably, MPI does not allow for dynamic sizing of the cluster. While suitable for identical machines coupled by a high-speed network, it has no built-in way of handling faults. If new machines are added to the cluster while it is operational, they are not used. PVM, on the other hand, does allow dynamic sizing of the cluster, allowing machines to be added and dropped at will, adapting the clustering software to handle the changes easily.
You will often hear the term Beowulf cluster. A Beowulf cluster is simply a cluster of Linux-based machines although the term is sometimes used for the software component as well. In general, though, a Beowulf cluster is a set of PCs networked together in a dedicated subnet (usually 100Mbps Ethernet), with one machine designated the “master” and the others are “headless” (no monitor, keyboard, etc). The master is responsible for managing the cluster and for communications between the cluster and the outside world. A Beowulf cluster can be either MPI or PVM. The term Beowulf cluster refers both to the hardware and software parts. The software part consists of the system software made available by www.beowulf.org. Among the software provided are packages that allows the process ID space to span multiple nodes, special Ethernet drivers, a package to allow Ethernet channel bonding, and other tools. Either PVM or MPI can be used to write programs to run on the cluster, but when referring to the “Beowulf cluster software” you are specifically referring to the system software and extensions to the Linux kernel made available by Scyld computing corporation. In addition there are commercial variants of what is available as software under GNU licenses. It is important that the Beowulf cluster subnet be high speed. This usually means 100Mbps Ethernet switch, ATM, or similar system that can keep up with high bandwidth demands. Channel bonding was a solution to the problem when only 10Mbps Ethernet was available and switches were prohibitively expensive.
A neat use of cluster has to do with managing very large data sets. If the data set to be crunched is sufficiently large that it cannot reside in a single machine’s RAM and if the computation can be spread across multiple CPUs so a subset of the total data can fit in a single machine’s RAM, then “superlinear speedup” is possible. Superlinear speedup occurs then several processors can perform faster than a single processor machine would be capable of, given the effect of memory and disk swapping. In other words, if you have 8 processors in a cluster, the performance of the cluster would be higher than 8 times faster than that of a single processor due to the software overhead involved in memory swapping. Beowulf clusters are likely targets for superlinear speedup when using large data sets such as image analysis and matrix calculations.
There are three different categories of clustering application. The first, a simplest, is called “embarrassingly parallel problems”. These are computational problems where the parallelism is easy to exploit using SIMD (single instruction multiple datastream). In other words, these are problems where calculations must be repeatedly performed on a large data set or on multiple data sets, and the amount of data is more than can be reasonably crunched on a single CPU. By joining many machines into a cluster, the amount of data to be crunched can be shared over all the machines and each has its own software to control the number crunching. With embarrassingly parallel problems there is no need for machines to talk to each other except to download raw data and transmit results. Embarrassingly parallel problems are quite simple to code because there is no need for CPU to CPU conversations, no cross thread or process communications, and no need for repeated updates of data status. One of the most widely-used examples of embarrassingly parallel problems is distributed.net, where computers download a chunk of encrypted data and analyze it for key to the encryption, receiving raw data in chunks and sending results back to an Internet-based server.
More complex are tasks that require some communications between the machines that make up a cluster. These are not embarrassingly parallel problems because of the need for interprocess communications. Sharing these complex calculations among multiple machines means that the software to control the cluster is far more complicated to code and often requires complex algorithms. Linear algrebra routines involving large matrices are typical of such applications. Some of these matrix calculations have scalable algorithms associated with them, but most do not. In fact, this continues to be an active area of research, working towards the discovery of algorithms that can work well in a cluster environment. The term “matrix calculations” is often applied to these problems because matrix manipulation (such as matrix inversions) are typical tasks. Matrix calculations tend to be scalable and portable, but the overhead in coding and maintaining the cluster components can be significant.
The third type of clustering application is actually a modification of embarrassingly parallel problems. It is called “grid” computing, because it uses the Internet as a computational network. Grid computing is designed to allow many individual machines to download a client and data set from a server, crunch the data, and transfer the results back to the server, all over the Internet. The seti@home project is an example of grid computing, as are the applications offered by distributed.net. Grid computing extends the concept of using background CPU power to the entire Internet and its millions of machines.
Setting up a Linux cluster
In order to set up a Beowulf cluster of your own you need two things: Linux machines on a network together, and software to handle the clustering. Typically, a Linux cluster will have a single master and a number of headless machines, but all can be full systems with no performance penalty. A fast network (preferably 100Mbps Ethernet) with a dedicated switch is ideal for maximum performance. The disks on the Linux machines are often set up very similar to each other (often cloned from a single source) and RPC (Remote Procedure Call) is used for intermachine communications. Finally, the filesystems are often set up using NFS to fast remote access.
One problem with this approach to clustering is the security issue. RPC and NFS are both vulnerable to hacking, so a firewall or closed network loop is ideal to prevent intrusion and tampering. A closed network is a necessity with a Beowulf cluster meant for number crunching. The reason is that outside network traffic will interfere with the interprocess communication. The increase in variance of the communication time would make writing fast, efficient number crunching programs almost impossible. A developer needs a reasonable estimate of the mean and variance of the interprocess communication time in order to write effect parallel code to rival the performance of traditional multiprocessor mainframes.
The software component you can use on your Linux cluster depends to some extent on your application, and also on what is available to you. If you want to set up a high-availability cluster, your software will be different than a number-crunching cluster.
For a high-availability cluster, you need software that handles failure of a node on the network, and also provides some load-balancing feature so your machines are used optimally. MOSIX (www.mosix.cs.huji.ac.il) is one of the widest used Linux HA clustering applications and it handles load-balancing across the cluster. MOSIX requires no special modification to the Linux operating system or the system layout, which makes it easy to install and remove later. Although not completely transparent to users, MOSIX is non-intrusive and works well.
The Piranha (available from www.redhat.com and other Linux sites) software package has been popular lately, too. Piranha is intended for HA clustering or for load-balancing, but it doesn’t do a good job of handling both tasks at once. Piranha’s strength is its ease of setup and use, and it is tightly integrated into several current Linux releases including RedHat 6.2.
The Linux Virtual Server project (www.linuxvirtualserver.org) provides a set of kernel modifications and drivers for setting up load-balanced clusters using Linux systems. The Virtual Server approach is to allow scalable, reliable clusters to be constructed using a variety of Linux platforms. So far, the Virtual Server system has been plagued by some problems, but these are quickly being ironed out and the Linux Virtual Server idea holds great promise.
The final HA clustering software package worth noting comes from TurboLinux, with their TurboCluster EnFuzion package. TurboLinux is a well-known vendor of optimized Linux operating systems for workstation use, and TurboCluster is a superset of the TurboLinux release. TurboCluster is designed to allow integration between Linux and Sun Solaris platforms into a single cluster, and support for Windows NT as part of the cluster is also included.
Computational clusters share number-crunching tasks across several machines. Most computational cluster approaches require modification to the software load on a Linux system, as well as special drivers for the applications to be run themselves. The more widely known and oft-used computational cluster package is Beowulf (www.beowulf.org). To install a Beowulf-based cluster you really need to dedicate the machines to the cluster task, and not use them as stand-alone workstations. Setting up and managing a Beowulf cluster requires a fair amount of system administration skill, as well as some ongoing management tasks.
An alternative to Beowulf is Evolocity (www.linuxnetworx.com) which is a commercial cluster software application. Evolocity provides a complete software package including management tools, as well as support. Although not inexpensive, Evolocity is a good choice for organizations and companies that require a single point of contact for their application, as well as a solid technical support structure.
Clusters are quickly becoming an important use for Linux, as well as for aging personal computers. Setting up clusters is much simpler than many users assume, and can easily be done by any knowledgeable Linux user. While the dedicated high-speed networks may be beyond the reach of home or small office setups, the ability to load share and provide redundancy inexpensively is attractive and an excellent use of PCs.