GlusterFS 3.2.6 for XenServer 6.0
I’ve been wanting to test GlusterFS running natively under XenServer for quite some time. In order to do so, I needed to compile GlusterFS specifically for XenServer 6.0. I wouldn’t do this on a production server.
Here’s how I did it, and the resulting RPM’s.
Compiling GlusterFS
Fuse is already included in XenServer 6.0, so all we really need to do is install the compiler. In order to get the compiler I folowed instructions found on the Citrix XenServer forums. Log into your XenServer and run the following:
# cd /etc/yum.repos.d # mv CentOS-Base.repo CentOS-Base.repo-old # mv CentOS-Base.repo.orig CentOS-Base.repo # yum install gcc make automake
Once the compiler is installed use wget to fetch the latest GlusterFS source. Do not untar and ungzip the archive. We’ll also install some support utilities needed for compiling. They are not needed for installing.
# wget http://download.gluster.com/pub/gluster/glusterfs/LATEST/glusterfs-3.2.6.tar.gz # yum install flex # yum install bison # yum install python-ctypes # yum install fusermount # yum install readline # yum install rpm-build # yum install libibverbs-devel
Once that is done, we’ll build the GlusterFS RPM’s.
# rpmbuild -ba glusterfs.spec # cd /usr/src/redhat/RPMS/i386/ # rpm -ivh glusterfs-core-3.2.6-1.i386.rpm glusterfs-fuse-3.2.6-1.i386.rpm
Installing glusterfs-georeplication requires a better version of rsync than XenServer ships with, you can just ‘yum install rsync’ if you need it.
The system is installed and almost ready to go, just load the FUSE module and run the gluster commands as per normal.
The RPM’s can be installed in a standard install of XenServer, if I remember correctly.
glusterfs-core-3.2.6-1.i386.rpm
glusterfs-fuse-3.2.6-1.i386.rpm
glusterfs-geo-replication-3.2.6-1.i386.rpm
glusterfs-rdma-3.2.6-1.i386.rpm
I have done no testing on these, besides verify that *core* and *fuse* install and mount a GlusterFS export.
Using GlusterFS as a psuedo DropBox clone
Well, that didn’t last long. Gluster worked fine for awhile, then suddenly I couldn’t log in to XFCE on my laptop. I mounted the drive under Gluster as /home (removing the Gluster mount), and it all worked well again. I’m not sure what happened yet, but /home disliked being under Gluster. I’m looking for another solution to replicate my /home.
First, a bit about GlusterFS:
GlusterFS is a scale-out NAS file system developed by Gluster. It aggregates various storage servers over Ethernet or Infiniband RDMA interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design without compromising performance. It has found a variety of applications including cloud computing, biomedical sciences and archival storage. GlusterFS is free software, licensed under GNU GPL v3 license.
The Gluster company was recently purchased by RedHat.
Now a bit about DropBox:
Dropbox is a Web-based file hosting service operated by Dropbox, Inc. that uses cloud storage to enable users to store and share files and folders with others across the Internet using file synchronization.
The portion of DropBox I’ve tried to duplicate is the storage part. File Sharing with other users is not a goal here. I need access to my data no matter where I am, even if I don’t have my laptop with me, and I like my data backed up offsite. In my case, the offsite data is behind a VPN, so all I need to get access to my latest data is something that can attach to the VPN. My iPhone can do that. Data syncing works only on the following OS’s (the OS needs good FUSE support:
Recommended: Redhat Enterprise GNU/Linux on x86-64 architecture and Ext3 file system
Tested and Supported: Fedora, Debian, CentOS, Ubuntu GNU/Linux on x86-64 architecture and Ext3 (or Ext4 from Linux-2.6.30 onwards) file system, Oracle Solaris 64bit + ZFS
Known to Work: Mac OS X 64bit + HFS+, FreeBSD 64bit + UFS
GlusterFS is fairly portable to any 64-bit POSIX compliant operating system and disk file system with extended attribute support. Native client port requires FUSE device support.
I run mine on Ubuntu 64-bit with EXT4 or XFS as the underlying filesystem.
So why don’t I use DropBox? I really dislike having my data sitting on other people’s servers. My data, my responsibility.
The setup I have is currently on three systems, my laptop (thinkpad), my desktop (quad), and a virtual server sitting in a colo (geo). Thinkpad I have with me all the time. It uses GlusterFS’s ‘replica’ mode to duplicate it’s data onto quad. Quad then uses GlusterFS’s ‘geo-replication’ mode to get my data onto geo.
There are drawbacks to the method I’m using here, for example, if I’m at a client’s location and not on my internal network, thinkpad doesn’t sync with quad, and therefor, data doesn’t get to geo. This is a limitation I can work with, and everything gets in sync eventually (see below). Most of the time, I have some sort of VPN to my internal network, so syncs with quad work, albeit slowly. I’m currently running GlusterFS 3.2.5, but the new GlusterFS 3.3 (now in beta) may solve some of my syncing issues with the new pro-active self heal.
First, we need to install GlusterFS onto all three systems: (thinkpad and quad are running Xubuntu 11.10, while geo is running Ubuntu Server 10.04)
<span style="color: #0000ff;"># wget http://download.gluster.com/pub/gluster/glusterfs/LATEST/Ubuntu/11.10/glusterfs_3.2.5-1_amd64.deb</span> <span style="color: #0000ff;"># apt-get install nfs-common</span> <span style="color: #0000ff;"># dpkg --install glusterfs_3.2.5-1_amd64.deb</span>
nfs-common is installed, since GlusterFS also provides an NFS Server component. We won’t be using it, but it’s a dependency for the .deb. We need to make sure the gluster daemon starts at system boot and is currently running(on quad and thinkpad, geo does not need it).
<span style="color: #0000ff;"># update-rc.d glusterd defaults</span> <span style="color: #0000ff;"># /etc/init.d/glusterd start </span>
Once GlusterFS is installed, we need to set up replication between thinkpad and quad. In my case, I’m syncing all of /home. It’s about 230 GB right now, but after the initial sync, things run very smoothly. I started with an empty /home on quad. Having existing data in quad‘s /home may cause problems. The first thing we’ll do is tell the systems about each other:
<span style="color: #0000ff;">thinkpad# gluster peer probe quad</span> <span style="color: #0000ff;">Probe successful</span>
That’s it, both quad and thinkpad know about each other. Now we’ll create the sync volume. But first, I unmounted /home on quad and thinkpad, and mounted them in a new position, in my case /gluster/export on thinkpad and since quad didn’t have a separate disk for /home, I create a directory called /raid/gluster/export on my RAID5 array.
<span style="color: #0000ff;">thinkpad# gluster volume create sync1 replica 2 transport tcp thinkpad:/gluster/export quad:/raid/gluster/export</span> <span style="color: #0000ff;">Creation of sync1 has been successful</span> <span style="color: #0000ff;">Please start the volume to access data</span> <span style="color: #0000ff;">thinkpad# gluster volume sync1 start </span>
We’ll just do a quick check to see how things look.
<span style="color: #0000ff;">thinkpad# gluster volume info</span>
<span style="color: #0000ff;">Volume Name: sync1</span> <span style="color: #0000ff;">Type: Replicate</span> <span style="color: #0000ff;">Status: Started</span> <span style="color: #0000ff;">Number of Bricks: 2</span> <span style="color: #0000ff;">Transport-type: tcp</span> <span style="color: #0000ff;">Bricks:</span> <span style="color: #0000ff;">Brick1: thinkpad:/gluster/export</span> <span style="color: #0000ff;">Brick2: quad:/raid/gluster/export</span>
Now we’ll mount them at /home for both thinkpad and quad. I’ll show the thinkpad fstab entry, and let you figure out the quad one.
/dev/sda6 /gluster/export ext4 noatime,nodiratime 0 3 thinkpad:/sync1 /home glusterfs noatime,nodiratime 0 0
At this point, we need to force a sync to occur. Depending on the amount of data you have, it could take awhile.
<span style="color: #0000ff;">thinkpad# find /home -noleaf -print0 | xargs --null stat >/dev/null</span>
Once that’s done, both systems will remain in sync, as long as they are on the same network. I wouldn’t recommend using both systems at the same time.
One of the issues here is that thinkpad is a laptop. I take it with me when I see clients, for vacations, to the coffee shop, etc. That means it’s not always connected to a network, and if it is, not necessarily the same network quad is connected to. I need to force a sync when thinkpad is back where it can see quad. The way I deal with that is quite straight forward, I force a self heal whenever the laptop is powered back on. I initially put the self heal in /etc/rc.local, but Ubuntu seems to run that fairly early in the boot process (S03). So, I created /etc/init.d/local, and set it to run last in the start-up scripts.
#!/bin/bash
# # this is done after everything else #
mount /home
# if we are on AC power, we'll force a sync with glusterfs if /usr/bin/on_ac_power; then nice find /home -noleaf -print0 | xargs --null stat >/dev/null & fi
First I mount /home. The glusterd daemon is started after the disks are mounted, so I force it to happen here. Then, if we are plugged in to wall power, I start a self heal. I should probably do a check for the correct network as well, but the process is easy enough to kill off if I don’t need it.
I do something similar with quad, but since quad rarely power cycles, I created a cron job that forces a sync at 1:00 AM.
Now I need to get my data off to geo. Using GlusterFS’s geo-replication I have it sync the data from quad to geo. I chose quad, since it’s essentially always connected to the Internet, with a static VPN to geo.
<span style="color: #0000ff;">quad# ssh-keygen quad# cp .ssh/id_rsa /etc/glusterd/geo-replication/secret.pem quad# cp .ssh/id_rsa.pub /etc/glusterd/geo-replication/secret.pen.pub </span>
<span style="color: #0000ff;">quad# ssh-copy-id root@geo </span><span style="color: #0000ff;">password: </span><span style="color: #0000ff;">quad# gluster volume geo-replication sync1 geo:/geo </span><span style="color: #0000ff;">Starting geo-replication session between sync1 geo:/deo has been successful</span>
The main difference between replication and geo-replication is that replication is synchronous. Geo-replication is asynchronous, and the data on geo may be out of sync for short periods. It works quite well over the Internet. In fact, GlusterFS uses rsync as the underlying transport for geo-replication. A status check will show you if everything is running.
quad# gluster volume geo-replication sync1 geo:/geo status
<span style="color: #0000ff;">MASTER SLAVE STATUS </span> <span style="color: #0000ff;">--------------------------------------------------------------------------------</span> <span style="color: #0000ff;">sync1 geo:/geo OK</span>
And there you have it. I have automatic syncing of data across systems, and access to all of my data from anywhere I need it. It works well for me.
I am looking forward to GlusterFS 3.3, to see how it will help with disconnected/reconnected syncs.
XenServer, iSCSI, and GlusterFS(NFS)
I’ve been running XenServer on iSCSI for quite some time. Performance and reliability have not always been the best, and sometimes it seems as though XenServer gets confused with snapshots and it’s LVM management.
I decided I needed to either a) switch to NFS or b) change to a different iSCSI provider.
Before I decided anything, I performed some tests. What kinds of IOPS was I getting from NFS vs iSCSI? What about my transfer rate? All this information needed to be gathered from a VM. This is the result of my tests.
Before we start, I’d like to explain why I’ve chosen to use the GlusterFS NFS server over the kernel NFS server. There are three reasons: 1) the kernel NFS server was not stable in my situation, and XenServer lost the NFS share often enough to cause issues. 2) The kernel NFS server was slower in most cases than the Gluster NFS server. Unfortunately, I no longer have the numbers to prove it, so I’ll have to run those tests again another time. 3) I could get iSCSI failover using DRBD and Gluster failover using it’s replication (both over a separate NIC). In theory, I could have done that with Kernel NFS and DRBD as well, but I decided against it.
I was interested in both IOPS and speed of transfer. I needed good IOPS in order to host an SQL Server and a Zimbra server on the SR, while I needed good speeds to host a file server or two.
I did not do any tests using the GlusterFS replication or iSCSI/DRBD.
read more
Rebooting a hung HVM in XenServer
Citrix XenServer has a problem that seems to have become significantly more prominent in the latest 5.6 FP1 release.
HVM (Hardware Virtual Machine) is a true virtualized server. It uses the Intel VT-x and AMD-V extensions to provide virtualization.. XenServer provides a QEMU based emulation layer for devices.
PVM (Para-Virtualized Machine) is a virtualized server that requires modified kernel that is ‘Xen aware’ to run. This kernel has special ties into Xen’s ABI, and typically runs faster than an HVM.
Windows based VM’s are para-virtualized by default. Since Citrix/Xen cannot modify the Windows kernel, they use specific device drivers to provide the para-virtualization layer. Linux based VM’s can be PVM’s or HVM’s, based on their support for Xen. Since Citrix has access to the Linux kernel source, they can directly modify (or distributions can modify) the kernel to talk to the Xen ABI. What this basically ends up doing, is XenServer only supports para-virtualization with specific Linux distributions. If you want to run a distribution that is not supported by Citrix, it by default becomes an HVM.
The ‘hung HVM’ issue I’ve seen only affects Linux HVM’s. The Linux HVM essentially locks, becoming non-responsive to pings, local console, basically everything. Using the tools provided by XenServer (XenCenter or command line) have no effect. The VM cannot be shutdown or rebooted. If this was a real machine, you could pull the power cord and deal with the results later. At least you could get the machine up and running again.
My past efforts to reboot a hung HVM was to migrate or shutdown all the VM’s running on the XenServer, and rebooting the XenServer itself. Then bringing the VM’s back. This was a task that would usually have to wait until after business hours. The hit on server downtime and impact on users was usually too great to do the reboot during the day.
A discussion on the Citrix forums finally brought a solution to light. It’s entirely command line based, can be done during business hours, and doesn’t affect any other running VM. A good solution.
- Find the UUID of the hung VM.
You can do this via the command line with ‘xe vm-list’ or via XenCenter. - Find the Domain ID of the hung VM.
Run ‘list_domains’ from the command line, and match the UUID with the ID numberid | uuid | state
0 | 2fe455fe-3185-4abc-bff6-a3e9a04680b0 | R
47 | 267227f3-a59e-dafe-b183-82210cf51ec4 | B
59 | 298817fb-8a3e-7501-11e0-045a8aa860ff | B
60 | 46e3d5aa-2f02-dfdc-b053-9a8ac56ec5d1 | B
61 | 16cf3204-eb17-5a12-e8d0-c72087bda690 | B
62 | 1f9053b5-c6ca-40bb-504e-3017c37e7281 | H
63 | ddaec491-097a-e271-362b-f2f985e26e4a | R
65 | 55f3b225-4f65-d1ea-aa19-add44c5acce7 | B
66 | 7adef6fd-9171-5426-b333-6fb1b57b8e60 | B H
67 | 6046dc13-f70b-8398-56fb-069c22440a7c | B
68 | f201cd94-a501-00c2-d21e-8c2f03ea167b | B H
In our case, UUID 1f9053b5-c6ca-40bb-504e-3017c37e7281 is hung, which is Domain ID 62.
- Run destroy_domain on the Domain ID.
# /opt/xensource/debug/destroy_domain -domid 62
- The VM will still show itself as running, so now, we need to reboot it.
# xe vm-reboot name-label='name of the VM' --force
- The VM is now rebooted, and you can bring it up as if you had just pulled the plug. That is, check for some disk corruption, etc.
Thanks go to the Citrix Forum folks for helping us come up with a solution.
Which Hard Drive Failed in my Linux software RAID array?
One of the items I’m always asked about, when it comes to replacing a failed drive in Linux software RAID, is “Which physical drive failed?”
Linux maps it’s hard drives using UDEV, and doesn’t guarantee a drive that is mounted at sda today, will be mounted as sda tomorrow. It may mount as sdb, or sdc. So, when you get a failed drive, and /proc/mdstat tell you /dev/sdc1 failed, how do you know which physical drive it really is? In RAID-5, if you pull the wrong physical drive, chances are you’ve lost all your data. In RAID-6, you get an extra chance.
The answer all boils down to the drives serial number. You need to find the mapping between the mounted drive and it’s serial number via ‘lshw’.
# lshw -class disk*-disk:4 description: ATA Disk product: WDC WD2001FASS-0 vendor: Western Digital physical id: 0.0.0 bus info: scsi@5:0.0.0 logical name: /dev/sde version: 01.0 serial: WD-WMAUR0279543 size: 1863GiB (2TB) capabilities: partitioned partitioned:dos configuration: ansiversion=5 signature=0007bd23
The two lines from the above output that concern us are ‘logical name’ and ‘serial’ The serial number will match what is printed on the hard drives label, and the logical name will match the failed drive in /proc/mdstat.
Just to make things easier for me, rather than pull all the drives out of a box looking for the proper one, I keep a list of the drives, their positions and serial numbers. I keep this list with the system, and reference it if I need to replace a drive. I also make sure to update it with the new drives information.
Another way I do it, is to simply use a label maker to place the serial number into a more visible position on the drive.
All you have to be able to do is see the serial number clearly. It’s way better than pulling them all until you find the right one.
BackupPC with ssh/rsync/VSS on Windows Server
I back up several Windows 2003 servers with BackupPC and rsync over ssh. This is basically how I do it.
- Download cygwin 1.7+ from the regular place
- Install the default system, plus the following: cygrunsrv, openssh, and rsync
- I also install joe (my editor of choice), and procps (for top)
- Start a cygwin shell and type:
# ssh-host-config -y # cygrunsrv -S sshd
- I then attempt to ssh into my backupPC system from cygwin, just a nice test, and creates the .ssh directory for me.
- Copy BackupPC’s public key over
# scp root@backuppc:/var/lib/backuppc/BackupPCkey.pub ~/.ssh/authorized_keys
- Install the following scripts to Administrators home directory
pre-backuppc.sh
#!/bin/bash # script to create shadow copies of Windows drives and export them to a # drive letter for BackupPC backup # the shadow copies get mount to c:\shadow\(drive letter). The directory # structure must exist ############################################################################## # This program is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU # General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program. If not, see # http://www.gnu.org/licenses/ ##############################################################################
# Launches passed input via 'at' to get around $USERNAME=SYSTEM
# problem under ssh login where the shell lacks permsisions to run
# commmands like vshadow or dosdev
function at_launch ()
{
local h m s wait1 command
if [ $3 != "" ] ; then
command="${1} ${2} >> ${3}"
else
command="${1} ${2}"
fi
set -- $(date +"%H %M %S")
h=$((10#$1)) #Note explicitly use base 10 so that 08 and 09 not interpreted as bad octal
m=$((10#$2 +1)) #Advance minutes by 1
s=$((10#$3))
wait1=$((60 - $s))
[ $s -gt 55 ] && let "m += 1" "wait1 += 60" # Make sure >5 seconds left
[ $m -ge 60 ] && let "m %= 60" "h += 1" #Overflow minutes
let "h %= 24"
at $h:$m $(cygpath -w $(which bash.exe)) -c \"$command\"
# > /dev/null
echo Running \'$command\' at $h:$m
return $wait1
}
# create the command to shadow the drive, and wait 2 minutes plus the seconds
# before the at command runs before returning (make sure shadow copy is made)
function shadow_drive ()
{
date
echo Shadowing Drive $@
local wait1
drive=$@
at_launch "/cygdrive/c/WINDOWS/vshadow.exe -p" $drive "/home/Administrator/vshadow-out"
wait1=$?
let "wait1 += 120"
echo sleeping for $wait1
sleep $wait1
date
echo done sleep
}
# get the guids from the vshow-out file and place them into the shadow-guids
# file for the post-backuppc scripts use (and hours for mapping to directory)
function get_guids ()
{
echo Getting shadow copy GUIDS
cat ~/vshadow-out | grep "* SNAPSHOT ID" | awk '{print $5}' >> ~/shadow-guids
}
function map_shadow ()
{
echo Mapping GUID $1 to $2
local wait1
local guid="$1"
local dir="$2"
at_launch /cygdrive/c/WINDOWS/vshadow.exe -el=$guid,$dir "/home/Administrator/map.out"
wait1=$?
let "wait1 +=30"
sleep $wait1
}
date
# get rid of the guids file if it exists
rm ~/shadow-guids
rm ~/vshadow-out
rm ~/map.out
sleep 10
# create the snapshots
shadow_drive c:
shadow_drive h:
shadow_drive j:
# get the guids into a single file
get_guids
loop=0
# create the shadow directory structure AFTER we make the shadow copies
# the post-backuppc.sh script deletes this tree after removing the mounts
mkdir /cygdrive/c/shadow
mkdir /cygdrive/c/shadow/c
mkdir /cygdrive/c/shadow/h
mkdir /cygdrive/c/shadow/j
# loop throuh the guids and map to mount point
# assumes guids in file are in order of shadows created
while read line ;
do
if [ $loop == 0 ] ; then
map_shadow $line "c:\\\\\shadow\\\\\c"
fi
if [ $loop == 1 ] ; then
map_shadow $line "c:\\\\\shadow\\\\\h"
fi
if [ $loop == 2 ] ; then
map_shadow $line "c:\\\\\shadow\\\\\j"
fi
let "loop += 1"
done < ~/shadow-guids
<span style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 20px; font-weight: bold; line-height: 19px; white-space: normal;"><strong>post_backuppc.sh</strong></span>
#!/bin/bash # script to delete the shadow copies used by backuppc while read line ; do vshadow -ds=$line done < ~/shadow-guids # now clean up the directory structure #rmdir /cygdrive/c/shadow
- In the BackupPC config for the system (I use the web page to edit the config) Add the following
DumpPreUserCmd: $sshPath -c blowfish -q -x -l Administrator <windows server name> /usr/bin/bash -l -c /home/Administrator/pre-backuppc.sh
DumpPostUserCmd:$sshPath -c blowfish -q -x -l Administrator <windows server name> /usr/bin/bash -l -c /home/Administrator/post-backuppc.sh
* sometime between the release of Cygwin 1.7 and the current version, ssh login became case sensitive. I used to log in with ‘-l administrator’ and now I have to log in with ‘-l Administrator’. That one set me back a bit.
Once you do a change to the backups, make sure you do a full backup right away. It’s a good test, and backuppc doesn’t like it’s backup directory structure changed during incrementals.
And that’s it. I have BackupPC backup /cygdrive/c/shadow, and all is well. Now, if someone can tell me how to color/syntax highlight bash code in WordPress, I’d be happy.
Portions of GPLv3 code in this script were taken from http://sourceforge.net/apps/mediawiki/backuppc/index.php?title=User_Scripts_-_Client_-_Windows_VSS, which is a much more complicated script than I needed.

Posted by admin in