Browsing all articles from January, 2012

Using GlusterFS as a psuedo DropBox clone

Posted Posted by admin in Blog     Comments 1 comment
Jan
6

Well,  that didn’t last long.  Gluster worked fine for awhile, then suddenly I couldn’t log in to XFCE on my laptop.  I mounted the drive under Gluster as /home (removing the Gluster mount), and it all worked well again.  I’m not sure what happened yet, but /home disliked being under Gluster.  I’m looking for another solution to replicate my /home.

 

First, a bit about GlusterFS:

GlusterFS is a scale-out NAS file system developed by Gluster. It aggregates various storage servers over Ethernet or Infiniband RDMA interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design without compromising performance. It has found a variety of applications including cloud computing, biomedical sciences and archival storage. GlusterFS is free software, licensed under GNU GPL v3 license.

The Gluster company was recently purchased by RedHat.

Now a bit about DropBox:

Dropbox is a Web-based file hosting service operated by Dropbox, Inc. that uses cloud storage to enable users to store and share files and folders with others across the Internet using file synchronization.

The portion of DropBox I’ve tried to duplicate is the storage part.  File Sharing with other users is not a goal here.  I need access to my data no matter where I am, even if I don’t have my laptop with me, and I like my data backed up offsite.  In my case, the offsite data is behind a VPN, so all I need to get access to my latest data is something that can attach to the VPN.  My iPhone can do that.  Data syncing works only on the following OS’s (the OS needs good FUSE support:

Recommended: Redhat Enterprise GNU/Linux on x86-64 architecture and Ext3 file system
Tested and Supported: Fedora, Debian, CentOS, Ubuntu GNU/Linux on x86-64 architecture and Ext3 (or Ext4 from Linux-2.6.30 onwards) file system, Oracle Solaris 64bit + ZFS
Known to Work: Mac OS X 64bit + HFS+, FreeBSD 64bit + UFS

GlusterFS is fairly portable to any 64-bit POSIX compliant operating system and disk file system with extended attribute support. Native client port requires FUSE device support.

I run mine on Ubuntu 64-bit with EXT4 or XFS as the underlying filesystem.

So why don’t I use DropBox?  I really dislike having my data sitting on other people’s servers.  My data, my responsibility.

The setup I have is currently on three systems, my laptop (thinkpad), my desktop (quad), and a virtual server sitting in a colo (geo).  Thinkpad I have with me all the time.  It uses GlusterFS’s ‘replica’ mode to duplicate it’s data onto quad.  Quad then uses GlusterFS’s ‘geo-replication’ mode to get my data onto geo.

There are drawbacks to the method I’m using here, for example, if I’m at a client’s location and not on my internal network, thinkpad doesn’t sync with quad, and therefor, data doesn’t get to geo.  This is a limitation I can work with, and everything gets in sync eventually (see below).  Most of the time, I have some sort of VPN to my internal network, so syncs with quad work, albeit slowly.   I’m currently running GlusterFS 3.2.5, but the new GlusterFS 3.3 (now in beta) may solve some of my syncing issues with the new pro-active self heal.

First, we need to install GlusterFS onto all three systems: (thinkpad and quad are running Xubuntu 11.10, while geo is running Ubuntu Server 10.04)

<span style="color: #0000ff;"># wget http://download.gluster.com/pub/gluster/glusterfs/LATEST/Ubuntu/11.10/glusterfs_3.2.5-1_amd64.deb</span>
<span style="color: #0000ff;"># apt-get install nfs-common</span>
<span style="color: #0000ff;"># dpkg --install glusterfs_3.2.5-1_amd64.deb</span>

nfs-common is installed, since GlusterFS also provides an NFS Server component.  We won’t be using it, but it’s a dependency for the .deb.  We need to make sure the gluster daemon starts at system boot and is currently running(on quad and thinkpad, geo does not need it).

<span style="color: #0000ff;"># update-rc.d glusterd defaults</span>
<span style="color: #0000ff;"># /etc/init.d/glusterd start </span>

Once GlusterFS is installed, we need to set up replication between thinkpad and quad.  In my case, I’m syncing all of /home.  It’s about 230 GB right now, but after the initial sync, things run very smoothly.  I started with an empty /home on quad.  Having existing data in quad‘s /home may cause problems.  The first thing we’ll do is tell the systems about each other:

<span style="color: #0000ff;">thinkpad# gluster peer probe quad</span>
<span style="color: #0000ff;">Probe successful</span>

That’s it, both quad and thinkpad know about each other.  Now we’ll create the sync volume.  But first, I unmounted /home on quad and thinkpad, and mounted them in a new position, in my case /gluster/export on thinkpad and since quad didn’t have a separate disk for /home, I create a directory called /raid/gluster/export on my RAID5 array.

<span style="color: #0000ff;">thinkpad# gluster volume create sync1 replica 2 transport tcp thinkpad:/gluster/export quad:/raid/gluster/export</span>
<span style="color: #0000ff;">Creation of sync1 has been successful</span>
<span style="color: #0000ff;">Please start the volume to access data</span>
<span style="color: #0000ff;">thinkpad# gluster volume sync1 start </span>

We’ll just do a quick check to see how things look.

<span style="color: #0000ff;">thinkpad# gluster volume info</span>
<span style="color: #0000ff;">Volume Name: sync1</span>
<span style="color: #0000ff;">Type: Replicate</span>
<span style="color: #0000ff;">Status: Started</span>
<span style="color: #0000ff;">Number of Bricks: 2</span>
<span style="color: #0000ff;">Transport-type: tcp</span>
<span style="color: #0000ff;">Bricks:</span>
<span style="color: #0000ff;">Brick1: thinkpad:/gluster/export</span>
<span style="color: #0000ff;">Brick2: quad:/raid/gluster/export</span>

Now we’ll mount them at /home for both thinkpad and quad.  I’ll show the thinkpad fstab entry, and let you figure out the quad one.

/dev/sda6         /gluster/export   ext4         noatime,nodiratime 0 3
thinkpad:/sync1   /home             glusterfs    noatime,nodiratime 0 0

At this point, we need to force a sync to occur.  Depending on the amount of data you have, it could take awhile.

<span style="color: #0000ff;">thinkpad# find /home -noleaf -print0 | xargs --null stat &gt;/dev/null</span>

Once that’s done, both systems will remain in sync, as long as they are on the same network.  I wouldn’t recommend using both systems at the same time.

One of the issues here is that thinkpad is a laptop.  I take it with me when I see clients, for vacations, to the coffee shop, etc.  That means it’s not always connected to a network, and if it is, not necessarily the same network quad is connected to.  I need to force a sync when thinkpad is back where it can see quad.  The way I deal with that is quite straight forward, I force a self heal whenever the laptop is powered back on.  I initially put the self heal in /etc/rc.local, but Ubuntu seems to run that fairly early in the boot process (S03).  So, I created /etc/init.d/local, and set it to run last in the start-up scripts.

#!/bin/bash
#
# this is done after everything else
#
mount /home
# if we are on AC power, we'll force a sync with glusterfs
if /usr/bin/on_ac_power; then
 nice find /home -noleaf -print0 | xargs --null stat &gt;/dev/null &amp;
fi

First I mount /home.  The glusterd daemon is started after the disks are mounted, so I force it to happen here.  Then, if we are plugged in to wall power, I start a self heal.  I should probably do a check for the correct network as well, but the process is easy enough to kill off if I don’t need it.

I do something similar with quad, but since quad rarely power cycles, I created a cron job that forces a sync at 1:00 AM.

Now I need to get my data off to geo.  Using GlusterFS’s geo-replication I have it sync the data from quad to geo.  I chose quad, since it’s essentially always connected to the Internet, with a static VPN to geo.

<span style="color: #0000ff;">quad# ssh-keygen quad# cp .ssh/id_rsa /etc/glusterd/geo-replication/secret.pem quad# cp .ssh/id_rsa.pub /etc/glusterd/geo-replication/secret.pen.pub </span>
<span style="color: #0000ff;">quad# ssh-copy-id root@geo </span><span style="color: #0000ff;">password: </span><span style="color: #0000ff;">quad# gluster volume geo-replication sync1 geo:/geo </span><span style="color: #0000ff;">Starting geo-replication session between sync1 geo:/deo has been successful</span>

The main difference between replication and geo-replication is that replication is synchronous.  Geo-replication is asynchronous, and the data on geo may be out of sync for short periods.  It works quite well over the Internet.  In fact, GlusterFS uses rsync as the underlying transport for geo-replication.  A status check will show you if everything is running.

quad# gluster volume geo-replication sync1 geo:/geo status

<span style="color: #0000ff;">MASTER SLAVE STATUS </span>
<span style="color: #0000ff;">--------------------------------------------------------------------------------</span>
<span style="color: #0000ff;">sync1 geo:/geo OK</span>

 

And there you have it.  I have automatic syncing of data across systems, and access to all of my data from anywhere I need it.  It works well for me.

I am looking forward to GlusterFS 3.3, to see how it will help with disconnected/reconnected syncs.