Thursday, October 3, 2013

Backing-up a bunch of small files to a remote server

I have a directory, containing lots of files, and I want an off-site, secure backup.

Even though the remote server might be a dedicated server that only I know root password for, I still don't trust it. Because of the recent NSA revelations I no longer consider myself paranoid. Thanks guys, I can look at myself in the mirror again!

As a final restriction, I don't want to have to make any temp files locally: disk space is tight, and the files can get very big.


Here we go:

cd BASE_DIR
tar cvf - MY_FOLDER/ | gpg -c --passphrase XXX | ssh REMOTE_SERVER 'cat > ~/MYFOLDER.tar.gpg'


(the bits in capitals are the things you replace.)

Notes
  • The "v" in "tar cvf" means verbose. Once you are happy it is working you will want to use "tar cf" instead.
  • The passphrase has to be given in the commandline because stdin is being used for the data!! A better way is to put the passphrase in another file: --passphrase-file passfile.txt. However note that this is only "better" on multi-user machines; on a single-user machine there is no real difference.
  • I'm using symmetric encryption. You could encrypt with your key pair, in which case the middle bit will change to: gpg -e -r PERSON  Then you won't need to specify the passphrase.
  • In my case REMOTE_SERVER is an alias to an entry in ~/.ssh/config. If you are not using that approach, you'll need to specify username, port number, identity file, etc. By the way, I'm not sure this method will work with password login, only keypair login, because stdin is being used for the data.
  • Any previous MYFOLDER.tar.gpg gets replaced on the remote server. So, if the connection gets lost halfway during the upload then you've lost your previous backup. I suggest using a datestamp in the filename, or something like that.
What about to get the data back?

cd TMP_DIR
ssh REMOTE_SERVER 'cat ~/MYFOLDER.tar.gpg' | gpg -d --passphrase XXX | tar xf -


You should now have a directory called MYFOLDER, with all your files exactly as they were.


Outstanding questions

Is it possible to use this approach in conjunction with Amazon S3, Google Drive, Rackspace cloud files, or similar storage providers? E.g. 100GB mounted as a Rackspace drive is $15/month (plus the compute instance of course, but I already have that), whereas 100GB as cloud files is $10/month, or $5/month on google drive. ($9.50/month on S3, or $1/month for glacier storage). Up to 15x cheaper: that is quite an incentive.

2013-10-08 Update: The implicit first half of that question is: is there a way to stream stdout to the remote drive (whether using scp or a specific commandline tool).
For Amazon S3 the answer is a clear "no": http://stackoverflow.com/q/11747703/841830 (the size has to be known in advance).
For Google Drive the answer is maybe. There is a way to mount google drive with FUSE: https://github.com/jcline/fuse-google-drive   It looks very complicated, describes itself as alpha, and the URL for the tutorial is a 404.
For Rackspace CloudFiles (and this should cover all OpenCloud providers), you can use curl to stream data! See "4.3.2.3. Chunked Transfer Encoding" in the cloud files developer guide HOWEVER, note that there is a 5GB limit on a file. That is a show-stopper for me. (Though by adding a custom script instead of "ssh REMOTE_SERVER 'cat > ~/MYFOLDER.tar.gpg'", I could track bytes transferred and start a new connection and file name at the 5GB point, so there is still hope. Probably only 10 lines of PHP will do it. But if I'm going to do that, I could just as easily buffer say 512MB in memory at a time, and use S3)

NOTE: Because I've not found an ideal solution yet, I never even got to the implicit second part of the question, which is if the need to "cat" on the remote server side will cause problems. I think not, but need to try it to be sure.


1 comment:

Unknown said...

(Just updated this with research on how S3, etc. could be used.)