I'm new to system/database adminstration. I'm currently running a web application that recieves around ~55,000 hits/day, and whose database grows by about 60 MiB of plain-text per day.
I came up with a simple, intuitive database backup solution using Git to track revisions. (And I wasn't the first.) If you've got a small- to medium-sized database, this may work for you too.
Hand-versioning: really inefficient
Backing up your database is important, but one must also consider keeping revisions of those backups, since if you keep only the latest backup, but you risk overwriting a good backup with bad ones.
bad-dump.sql -> good-dump-from-yesterday.sql
You might consider writing each backup to its own file, but that means each backup will consume the entire size of the database each backup, in my case, ~10 GiB per day.
dump1.sql 10 GiB
dump2.sql 10 GiB
dump3.sql 10 GiB
If most of the data in your database is static, that's a lot of wasted space.x
Enter: Git
Git is a revision control system used primarily in software development to track changes in source code over time. A database backup, a.k.a a database dump, is just a bunch of SQL statements in a text file, so why not use Git to track changes in your database backups as well?
The backup process is simple: Dump your database over the previous backup, and commit. Here's an example backup script using MySQL:
#! /bin/sh # Settings BACKUP_DIR="/path/to/your/backup/directory" DB_USER="[your database user]" DB_PASS="[your database user's password]" DB="[database name]" DB_DUMP="$BACKUP_DIR/$DB.sql" # Create your backup directory if it doesn't exist mkdir -p $BACKUP_DIR # Dump the database mysqldump -u $DB_USER -p$DB_PASS --skip-extended-insert $DB > $DB_DUMP # Change to the backup directory and initialize a new Git repo if necessary cd $BACKUP_DIR git init # Add the database to the repo and commit git add $DB_DUMP git commit -m "Update database dump"
A couple things to note about this script:
--skip-extended-insertmakes mysqldump create an insert statement on a new line for each row. This means a larger initial commit, but all future commits will be smaller due to the fact that Git tracks line changes.- Running
git initon an existing repository is safe, and will not overwrite things that are already there.
Now, just throw this script in /etc/cron.daily/ for instant, daily versioned
database backup.
Now what?
Anything you can do with a normal Git repository, including:
- Push to a remote repository on a backup server
- Pull the latest database changes to your development server
- View changes to your database with
git whatchanged
Optimizations
As suggested by Bram Schoenmakers, here are some Git configurations that should decrease disk usage of the repository:
- core.compression = 9 - Flag for gzip to specify the compression level for blobs and packs. Level 1 is fast with larger file sizes, level 9 takes more time but results in better compression.
- repack.usedeltabaseoffset = true - Defaults to false for compatibility reasons, but is supported with Git >=1.4.4.
- pack.windowMemory = 100m - (Re)packing objects may consume lots of memory. To prevent all your resources [from going] down the drain it's useful to put some limits on that. There is also pack.deltaCacheSize.
- pack.window = 15 - Defaults to 10. With a higher value, Git tries harder to find similar blobs.
- gc.auto = 1000 - Defaults to 6700. As indicated in the article, it is recommended to run
git gcevery once in a while. Personally, I rungit gc --auto everyday, so [it] only packs things when there's enough garbage.git gc --autonormally only triggers the packing mechanism when there are 6700 loose objects around. This flag lowers this amount.- gc.autopacklimit 10 - Defaults to 50. Every time you run
git gc, a new pack is generated [of] the loose objects. Over time you get too many packs which wastes space. It is a good idea to combine all packs once in a while into a single pack, so all objects can be combined and deltified. By defaultgit gcdoes this when there are 50 packs around. But for this situation a lower number may be better.
References
- "Backup your Database in Git" by David Eisinger
- git-init man page