Brian.J.Watson@hp.com
Revision History | ||
---|---|---|
Revision 1.04 | 2002-05-29 | Revised by: bjw |
LDP review | ||
Revision 1.03 | 2002-05-23 | Revised by: bpm |
LDP review | ||
Revision 1.02 | 2002-05-13 | Revised by: bjw |
Fixed minor typos and errors | ||
Revision 1.00 | 2002-05-09 | Revised by: bjw |
Initial release |
Provide a friendly testing and debugging environment for SSI developers.
Let developers test hundred-node clusters with only ten or so physical machines.
The raison d'être of the SSI Clustering project is to provide a full, highly available SSI environment for Linux. Goals for this project include availability, scalability and manageability, using standard servers. Technology pieces include: membership, single root and single init, single process space and process migration, load leveling, single IPC, device and networking space, and single management space.
The SSI project was seeded with HP's NonStop Clusters for UnixWare (NSC) technology. It also leverages other open source technologies, such as Cluster Infrastructure (CI), Global File System (GFS), keepalive/spawndaemon, Linux Virtual Server (LVS), and the Mosix load-leveler, to create the best general-purpose clustering environment on Linux.
The CI project is developing a common infrastructure for Linux clustering by extending the Cluster Membership Subsystem (CLMS) and Internode Communication Subsystem (ICS) from HP's NonStop Clusters for Unixware (NSC) code base.
GFS is a parallel physical file system for Linux. It allows multiple computers to simultaneously share a single drive. The SSI Clustering project uses GFS for its single, shared root. GFS was originally developed and open-sourced by Sistina Software. Later they decided to close the GFS source, which prompted the creation of the OpenGFS project to maintain a version of GFS that is still under the GPL.
keepalive is a process monitoring and restart daemon that was ported from HP's Non-Stop Clusters for UnixWare (NSC). It offers significantly more flexibility than the respawn feature of init.
spawndaemon provides a command-line interface for keepalive. It's used to control which processes keepalive monitors, along with various other parameters related to monitoring and restart.
Keepalive/spawndaemon is currently incompatible with the GFS shared root. keepalive makes use of shared writable memory mapped files, which OpenGFS does not yet support. It's only mentioned for the sake of completeness.
LVS allows you to build highly scalable and highly available network services over a set of cluster nodes. LVS offers various ways to load-balance connections (e.g., round-robin, least connection, etc.) across the cluster. The whole cluster is known to the outside world by a single IP address.
The SSI project will become more tightly integrated with LVS in the future. An advantage will be greatly reduced administrative overhead, because SSI kernels have the information necessary to automate most LVS configuration. Another advantage will be that the SSI environment allows much tighter coordination among server nodes.
LVS support is turned off in the current binary release of SSI/UML. To experiment with it you must build your own kernel as described in Section 4.
The Mosix load-leveler provides automatic load-balancing within a cluster. Using the Mosix algorithms, the load of each node is calculated and compared to the loads of the other nodes in the cluster. If it's determined that a node is overloaded, the load-leveler chooses a process to migrate to the best underloaded node.
Only the load-leveling algorithms have been taken from Mosix. The SSI Clustering project is using its own process migration model, membership mechanism and information sharing scheme.
The Mosix load-leveler is turned off in the current binary release of SSI/UML. To experiment with it you must build your own kernel as described in Section 4.
User-Mode Linux (UML) allows you to run one or more virtual Linux machines on a host Linux system. It includes virtual block, network, and serial devices to provide an environment that is almost as full-featured as a hardware-based machine.
High performance (HP) clusters, typified by Beowulf clusters, are constructed to run parallel programs (weather simulations, data mining, etc.).
Load-leveling clusters, typified by Mosix, are constructed to allow a user on one node to spread his workload transparently across all nodes in the cluster. This can be very useful for compute intensive, long running jobs that aren't massively parallel.
Web-service clusters, typified by the Linux Virtual Server (LVS) project and Piranha, do a different kind of load leveling. Incoming web service requests are load-leveled by a front end system across a set of standard servers.
Storage clusters, typified by Sistina's GFS and the OpenGFS project, consist of nodes which supply parallel, coherent, and highly available access to filesystem data.
Database clusters, typified by Oracle 9I RAC (formerly Oracle Parallel Server), consist of nodes which supply parallel, coherent, and HA access to a database.
High Availability clusters, typified by Lifekeeper, FailSafe and Heartbeat, are also often known as failover clusters. Resources, most importantly applications and nodes, are monitored. When a failure is detected, scripts are used to fail over IP addresses, disks, and filesystems, as well as restarting applications.
For more information about how SSI clustering compares to the cluster types above, read Bruce Walker's Introduction to Single System Image Clustering.
More testing is needed to know what the appropriate system requirements are. User feedback would be most useful, and can be sent to <ssic-linux-devel@lists.sf.net>.
The latest version of this HOWTO will always be made available on the SSI project website, in a variety of formats:
Feedback is most certainly welcome for this document. Please send your additions, comments and criticisms to the following email address: <ssic-linux-devel@lists.sf.net>.
First you need to download a SSI-ready root image. The compressed image weighs in at over 150MB, which will take more than six hours to download over a 56K modem, or about 45 minutes over a 500K broadband connection.
The image is based on Red Hat 7.2. This means the virtual SSI cluster will be running Red Hat, but it does not matter which distribution you run on the host system. A more advanced user can make a new root image based on another distribution. This is described in Section 5.
After downloading the root image, extract and install it.
host$ tar jxvf ~/ssiuml-root-rh72-0.6.5-1.tar.bz2 host$ su host# cd ssiuml-root-rh72 host# make install host# Ctrl-D |
Download the UML utilities. Extract, build, and install them.
host$ tar jxvf ~/uml_utilities_20020428.tar.bz2 host$ su host# cd tools host# make install host# Ctrl-D |
Download the SSI/UML utilities. Extract, build, and install them.
host$ tar jxvf ~/ssiuml-utils-0.6.5-1.tar.bz2 host$ su host# cd ssiuml-utils host# make install host# Ctrl-D |
host$ ssi-start 2 |
This command boots nodes 1 and 2. It displays each console in a new xterm. The nodes run through their early kernel initialization, then seek each other out and form an SSI cluster before booting the rest of the way. If you're anxious to see what an SSI cluster can do, skip ahead to Section 3.
You'll probably notice that two other consoles are started. One is the lock server node, which is an artefact of how the GFS shared root is implemented at this time. The console is not a node in the cluster, and it won't give you a login prompt. For more information about the lock server, see Section 7.3. The other console is for the UML virtual networking switch daemon. It won't give you a prompt, either.
Note that only one SSI/UML cluster can be running at a time, although it can be run as a non-root user.
The argument to ssi-start is the number of nodes that should be in the cluster. It must be a number between 1 and 15. If this argument is omitted, it defaults to 3. The fifteen node limit is arbitrary, and can be easily increased in future releases.
To substitute your own SSI/UML files for the ones in /usr/local/lib and /usr/local/bin, provide your pathnames in ~/.ssiuml/ssiuml.conf. Values to override are KERNEL, ROOT, CIDEV, INITRD, and INITRD_MEMEXP. This feature is only needed by an advanced user.
You can take down the entire cluster at once with
host$ ssi-stop |
If ssi-stop hangs, interrupt it and shoot all the linux-ssi processes before trying again.
host$ killall -9 linux-ssi host$ ssi-stop |
The following demos should familiarize you with what an SSI cluster can do.
node1# cd ~/dbdemo node1# ./dbdemo alphabet |
The dbdemo program is also listening on its terminal device for certain command keys.
Table 1. Command Keys for dbdemo
Key | Description |
---|---|
1-9 | move to that node and continue with the next record |
Enter | periodically moves to a random node until you press a key |
q | quit |
node3# ps -ef | grep dbdemo node3# where_pid <pid> 2 |
If you like, you can download the source for dbdemo. It's also available as a tarball in the /root/dbdemo directory.
node1# onnode 2 vi /tmp/newfile |
Confirm that it's on node 2 with where_pid. You need to get its PID first.
node3# ps -ef | grep vi node3# where_pid <pid> 2 |
node3# cat /tmp/newfile some text |
node3# kill <pid> |
Make a FIFO on the shared root.
node1# mkfifo /fifo |
echo something into the FIFO on node 1.
node1# echo something >/fifo |
node2# cat /fifo something |
This demostrates that FIFOs are clusterwide and remotely accessible.
On node 3, write "Hello World" to the console of node 1.
node3# echo "Hello World" >/devfs/node1/console |
Building your own kernel and ramdisk is necessary if you want to
Otherwise, feel free to skip this section.
The latest SSI release can be found at the top of this release list. At the time of this writing, the latest release is 0.6.5.
Download the latest release. Extract it.
host$ tar jxvf ~/ssi-linux-2.4.16-v0.6.5.tar.bz2 |
Determine the corresponding kernel version number from the release name. It appears before the SSI version number. For the 0.6.5 release, the corresponding kernel version is 2.4.16.
Follow these instructions to do a CVS checkout of the latest SSI code. The modulename is ssic-linux.
You also need to check out the latest CI code. Follow these instructions to do that. The modulename is ci-linux.
To do a developer checkout, you must be a CI or SSI developer. If you are interested in becoming a developer, read Section 8.3 and Section 8.4.
Determine the corresponding kernel version with
host$ head -4 ssic-linux/ssi-kernel/Makefile VERSION = 2 PATCHLEVEL = 4 SUBLEVEL = 16 EXTRAVERSION = |
In this case, the corresponding kernel version is 2.4.16. If you're paranoid, you might want to make sure the corresponding kernel version for CI is the same.
host$ head -4 ci-linux/ci-kernel/Makefile VERSION = 2 PATCHLEVEL = 4 SUBLEVEL = 16 EXTRAVERSION = |
They will only differ when I'm merging them up to a new kernel version. There is a window between checking in the new CI code and the new SSI code. I'll do my best to minimize that window. If you happen to see it, wait a few hours, then update your sandboxes.
host$ cd ssic-linux host$ cvs up -d host$ cd ../ci-linux host$ cvs up -d host$ cd .. |
Download the appropriate kernel source. Get the version you determined in Section 4.1. Kernel source can be found on this U.S. server or any one of these mirrors around the world.
Extract the source. This will take a little time.
host$ tar jxvf ~/linux-2.4.16.tar.bz2 |
or
host$ tar zxvf ~/linux-2.4.16.tar.gz |
Apply the patch in the SSI source tree.
host$ cd linux host$ patch -p1 <../ssi-linux-2.4.16-v0.6.5/ssi-linux-2.4.16-v0.6.5.patch |
host$ cd linux host$ patch -p1 <../ssic-linux/3rd-party/uml-patch-2.4.18-22 |
Copy CI and SSI code into place.
host$ cp -alf ../ssic-linux/ssi-kernel/. . host$ cp -alf ../ci-linux/ci-kernel/. . |
Apply the GFS patch from the SSI sandbox.
host$ patch -p1 <../ssic-linux/3rd-party/opengfs-ssi.patch |
host$ cp config.uml .config host$ make oldconfig ARCH=um |
Build the kernel image and modules.
host$ make dep linux modules ARCH=um |
Download any version of OpenGFS after 0.0.92, or check out the latest source from CVS.
Apply the appropriate kernel patches from the kernel_patches directory to your kernel source tree. Make sure you enable the /dev filesystem, but do not have it automatically mount at boot. (When you configure the kernel select 'File systems -> /dev filesystem support' and unselect 'File systems -> /dev filesystem support -> Automatically mount at boot'.) Build the kernel as usual, install it, rewrite your boot block and reboot.
Configure, build and install the GFS modules and utilities.
host$ cd opengfs host$ ./autogen.sh --with-linux_srcdir=host_kernel_source_tree host$ make host$ su host# make install |
Configure two aliases for one of the host's network devices. The first alias should be 192.168.50.1, and the other should be 192.168.50.101. Both should have a netmask of 255.255.255.0.
host# ifconfig eth0:0 192.168.50.1 netmask 255.255.255.0 host# ifconfig eth0:1 192.168.50.101 netmask 255.255.255.0 |
cat the contents of /proc/partitions. Select two device names that you're not using for anything else, and make two loopback devices with their names. For example:
host# mknod /dev/ide/host0/bus0/target0/lun0/part1 b 7 1 host# mknod /dev/ide/host0/bus0/target0/lun0/part2 b 7 2 |
Finally, load the necessary GFS modules and start the lock server daemon.
host# modprobe gfs host# modprobe memexp host# memexpd host# Ctrl-D |
Your host system now has GFS support.
Loopback mount the shared root.
host$ su host# losetup /dev/loop1 root_cidev host# losetup /dev/loop2 root_fs host# passemble host# mount -t gfs -o hostdata=192.168.50.1 /dev/pool/pool0 /mnt |
Install the modules into the root image.
host# make modules_install ARCH=um INSTALL_MOD_PATH=/mnt host# Ctrl-D |
You have to repeat some of the steps you did in Section 4.5. Extract another copy of the OpenGFS source. Call it opengfs-uml. Add the following line to make/modules.mk.in.
KSRC := /root/linux-ssi INCL_FLAGS := -I. -I.. -I$(GFS_ROOT)/src/include -I$(KSRC)/include \ + -I$(KSRC)/arch/um/include \ $(EXTRA_INCL) DEF_FLAGS := -D__KERNEL__ -DMODULE $(EXTRA_FLAGS) OPT_FLAGS := -O2 -fomit-frame-pointer |
Configure, build and install the GFS modules and utilities for UML.
host$ cd opengfs-uml host$ ./autogen.sh --with-linux_srcdir=UML_kernel_source_tree host$ make host$ su host# make install DESTDIR=/mnt |
host# /usr/sbin/chroot /mnt host# cluster_mkinitrd --uml initrd-ssi.img 2.4.16-21um |
host# mv /mnt/initrd-ssi.img ~username host# chown username ~username/initrd-ssi.img host# umount /mnt host# passemble -r all host# losetup -d /dev/loop1 host# losetup -d /dev/loop2 host# Ctrl-D host$ cd .. |
Stop the currently running cluster and start again.
host$ ssi-stop host$ ssi-start |
You should see a three-node cluster booting with your new kernel. Feel free to take it through the exercises in Section 3 to make sure it's working correctly.
Download the Red Hat 7.2 root image from the User-Mode Linux (UML) project. As with the root image you downloaded in Section 2.1, it is over 150MB.
Extract the image.
host$ bunzip2 -c root_fs.rh72.pristine.bz2 >root_fs.ext2 |
Loopback mount the image.
host$ su host# mkdir /mnt.ext2 host# mount root_fs.ext2 /mnt.ext2 -o loop,ro |
Make a blank GFS root image. You also need to create an accompanying lock table image. Be sure you've added support for GFS to your host system by following the instructions in Section 4.5.
host# dd of=root_cidev bs=1024 seek=4096 count=0 host# dd of=root_fs bs=1024 seek=2097152 count=0 host# chmod a+w root_cidev root_fs host# losetup /dev/loop1 root_cidev host# losetup /dev/loop2 root_fs |
Enter the following pool information into a file named pool0cidev.cf.
poolname pool0cidev subpools 1 subpool 0 0 1 gfs_data pooldevice 0 0 /dev/loop1 0 |
Enter the following pool information into a file named pool0.cf.
poolname pool0 subpools 1 subpool 0 0 1 gfs_data pooldevice 0 0 /dev/loop2 0 |
Write the pool information to the loopback devices.
host# ptool pool0cidev.cf host# ptool pool0.cf |
Create the pool devices.
host# passemble |
Enter the following lock table into a file named gfscf.cf.
datadev: /dev/pool/pool0 cidev: /dev/pool/pool0cidev lockdev: 192.168.50.101:15697 cbport: 3001 timeout: 30 STOMITH: NUN name:none node: 192.168.50.1 1 SM: none node: 192.168.50.2 2 SM: none node: 192.168.50.3 3 SM: none node: 192.168.50.4 4 SM: none node: 192.168.50.5 5 SM: none node: 192.168.50.6 6 SM: none node: 192.168.50.7 7 SM: none node: 192.168.50.8 8 SM: none node: 192.168.50.9 9 SM: none node: 192.168.50.10 10 SM: none node: 192.168.50.11 11 SM: none node: 192.168.50.12 12 SM: none node: 192.168.50.13 13 SM: none node: 192.168.50.14 14 SM: none node: 192.168.50.15 15 SM: none |
Write the lock table to the cidev pool device.
host# gfsconf -c gfscf.cf |
Format the root disk image.
host# mkfs_gfs -p memexp -t /dev/pool/pool0cidev -j 15 -J 32 -i /dev/pool/pool0 |
Mount the root image.
host# mount -t gfs -o hostdata=192.168.50.1 /dev/pool/pool0 /mnt |
Copy the ext2 root to the GFS image.
host# cp -a /mnt.ext2/. /mnt |
Clean up.
host# umount /mnt.ext2 host# rmdir /mnt.ext2 host# Ctrl-D host$ rm root_fs.ext2 |
The latest release can be found at the top of the Cluster-Tools section of this release list. At the time of this writing, the latest release is 0.6.5.
Download the latest release. Extract it.
host$ tar jxvf ~/cluster-tools-0.6.5.tar.bz2 |
Follow these instructions to do a CVS checkout of the latest Cluster Tools code. The modulename is cluster-tools.
To do a developer checkout, you must be a CI developer. If you are interested in becoming a developer, read Section 8.3 and Section 8.4.
host$ su host# cd cluster-tools host# make install_ssi_redhat UML_ROOT=/mnt |
If you built a kernel, as described in Section 4, then follow the instructions in Section 4.4 and Section 4.7 to install kernel and GFS modules onto your new root.
Otherwise, mount the old root image and copy the modules directory from /mnt/lib/modules. Then remount the new root image and copy the modules into it.
/etc/rc3.d/S25netfs /etc/rc3.d/S50snmpd /etc/rc3.d/S55named /etc/rc3.d/S55sshd /etc/rc3.d/S56xinetd /etc/rc3.d/S80sendmail /etc/rc3.d/S85gpm /etc/rc3.d/S85httpd /etc/rc3.d/S90crond /etc/rc3.d/S90squid /etc/rc3.d/S90xfs /etc/rc3.d/S91smb /etc/rc3.d/S95innd |
You might also want to copy dbdemo and its associated alphabet file into /root/dbdemo. This lets you run the demo described in Section 3.1.
host# umount /mnt host# passemble -r all host# losetup -d /dev/loop1 host# losetup -d /dev/loop2 |
Start with the installation instructions for SSI.
If you'd like to install SSI from CVS code, follow these instructions to checkout modulename ssic-linux, and these instructions to checkout modulenames ci-linux and cluster-tools. Read the INSTALL and INSTALL.cvs files in both the ci-linux and ssic-linux sandboxes. Also look at the README file in the cluster-tools sandbox.
For more information, read Section 7.
Start with the SSI project homepage. In particular, the documentation may be of interest. The SourceForge project summary page also has some useful information.
If you have a question or concern, post it to the <ssic-linux-devel@lists.sf.net> mailing list. If you'd like to subscribe, you can do so through this web form.
If you are working from a CVS sandbox, you may also want to sign up for the ssic-linux-checkins mailing list to receive checkin notices. You can do that through this web form.
Start with the CI project homepage. In particular, the documentation may be of interest. The SourceForge project summary page also has some useful information.
If you have a question or concern, post it to the <ci-linux-devel@lists.sf.net> mailing list. If you'd like to subscribe, you can do so through this web form.
If you are working from a CVS sandbox, you may also want to sign up for the ci-linux-checkins mailing list to receive checkin notices. You can do that through this web form.
SSI clustering currently depends on the Global File System (GFS) to provide a single root. The open-source version of GFS is maintained by the OpenGFS project. They also have a SourceForge project summary page.
Right now, GFS requires either a DMEP-equipped shared drive or a lock server outside the cluster. The lock server is the only software solution for coordinating disk access, and it is not truly HA. There are plans to make OpenGFS support IBM's Distributed Lock Manager (DLM), which would distribute the lock server's responsibilities across all the nodes in the cluster. If any node fails, the locks it managed would failover to other nodes. This would be a true HA software solution for coordinating disk access.
If you have a question or concern, post it to the <opengfs-users@lists.sf.net> mailing list. If you'd like to subscribe, you can do so through this web form.
The User-Mode Linux (UML) project has a homepage and a SourceForge project summary page.
If you have a question or concern, post it to the <user-mode-linux-user@lists.sf.net> mailing list. If you'd like to subscribe, you can do so through this web form.
While using the SSI clustering software, you may run into bugs or features that don't work as well as they should. If so, browse the SSI and CI bug databases to see if someone has seen the same problem. If not, either post a bug yourself or post a message to <ssic-linux-devel@lists.sf.net> to discuss the issue further.
It is important to be as specific as you can in your bug report or posting. Simply saying that the SSI kernel doesn't boot or that it panics is not enough information to diagnose your problem.
There is already some documentation for SSI and CI, but more would certainly be welcome. If you'd like to write instructions for users or internals documentation for developers, post a message to <ssic-linux-devel@lists.sf.net> to express your interest.
Debugging is a great way to get your feet wet as a developer. Browse the SSI and CI bug databases to see what problems need to be fixed. If a bug looks interesting, but is assigned to a developer, contact them to see if they are actually working on it.
After fixing the problem, send your patch to <ssic-linux-devel@lists.sf.net> or <ci-linux-devel@lists.sf.net>. If it looks good, a developer will check it into the repository. After submitting a few patches, you'll probably be invited to become a developer yourself. Then you'll be able to checkin your own work.
After fixing a bug or two, you may be inclined to work on enhancing or adding an SSI feature. You can look over the SSI and CI project lists for ideas, or you can suggest something of your own. Before you start working on a feature, discuss it first on <ssic-linux-devel@lists.sf.net> or <ci-linux-devel@lists.sf.net>.