Ben Ward has a great post up on YDN discussing the massive
addition of microformats to the latest refresh of the Kelkoo. ~27
million hListings can’t be wrong! It’s a great read, and a real
validation of the concept.
Moreover, it’s an excellent introduction to microformats as API. There’s
no official Kelkoo API to grab and process listings or reviews or companies,
but the judicious use of microformats makes it easy for you to grab and
parse the information out of the page on your own. The article’s very
much worth paging through; I suggest you take a look.
One of the supreme pleasures of my job is the simple chance to work with real experts in the field on a daily basis. Mike Davies, for example, knows an incredible amount about building accessible websites, and never hesitates to share his opinions with the rest of us. He’s an innovative developer in many other ways, but this is a particular area of expertise.
Just this week, he’s gotten started a new site, Accessibility Tips, that is absolutely worth sticking in your RSS reader. It looks like it’s going to be a brilliant resource, codifying best practices for building clean and accessible websites in an easy to understand, and well justified manner.
He’s got 4 articles up so far, and I’m already finding myself filing bugs against the sites I work on to put in some quick fixes. “Providing Link Text” might well have been custom-written as a quick reminder that my baby needs work. :)
Nicely done, Mike.
So, the disk on which I keep the main copy of my Aperture library started
making strange clicking noises when plugged into my powerbook. It makes these
noises instead of the expected whirring and humming and actual reading of
data. This, as you may suspect, is a Bad Thing.
Thankfully, it works perfectly when plugged into my work laptop, so I’ve spent
the majority of the day descending into full-blown backup paranoia. I’ve
consolidated all my important and cloned them onto three separate hard drives.
Now I’m beginning the process of burning a million DVDs. Of course, the
aforementioned paranoia forces me to recognize that DVDs degrade; I can’t
trust them, you see. What to do?
This is thankfully a solved problem.
Usenet posters have used something called Parchive for years now to post
binary files with some guarantee of completeness in the intrinsically lossy
world of globally mirrored newsgroups. Along with the actual data that’s
written to a newsgroup, the poster will upload a number of PAR files
containing parity information that allows you to regenerate any lost data.
Without going into the details of Reed-Solomon error correction, this
means that if a few pieces of the data you’re downloading are missing, you can
generate them yourself, ensuring that the original signal gets through.
The same theory applies to DVDs. I don’t particularly trust the medium to
guarantee successful backups over time, but I do trust that they’ll
probably retain 99% of the bits I care about. Parchive, therefore, looks like
a great solution.
Installing parchive
Installing Parchive is trivial. You can grab binaries off sourceforge, or
check out the CVS tree and compile it yourself, like so:
cvs -d:pserver:anonymous@parchive.cvs.sourceforge.net:/cvsroot/parchive login
cvs -z3 -d:pserver:anonymous@parchive.cvs.sourceforge.net:/cvsroot/parchive co -P par2cmdline
cd par2-cmdline/
./configure --prefix=/usr/local
make
sudo make install
Using parchive
The main thing I want to do with Parchive is create PAR files that contain
vital parity information about my data. when doing so, there are two options
that are important to consider: % redundancy, and block size.
The former option controls how much parity information is generated, that is,
how much data you can lose while still being able to regenerate the whole.
I’ve settled on 10% as being stupendously beyond the amount of bad sectors
I’d expect to see on a DVD within a reasonable amount of time, and therefore
“safe”.
The latter option only makes sense if you know a little about how parchive
works: in a nutshell, it breaks your files up into smaller pieces, and
generates parity information for each block separately. If you lose one bit
of a block, the whole thing is invalid, and has to be regenerated. In an
ideal world, then, you’d set the block size equal to the sector size of
whatever medium you’re using for backup. In that case, you’ve got the best
protection against a single sector dying; you don’t waste any space. This
efficiency, however, is impractical for two reasons: first, the smaller the
block size, the longer it takes to process the data, and second, parchive’s
algorithm is limited to a maximum of 32,768 (coincidentally, that’s
215) blocks. If you set a 2k block size to maximize efficiency,
you’d only be able to process ~65M before parchive fell over and died. I need
to write parity information for up to a whole DVD’s worth of data, ~4Gb
(~4.7Gb total capacity - 10% redundancy). 4 millionish kilobytes / 32768
blocks = about 123kb per block. I’ll double that (and round to the nearest
power of two, because I suspect that makes things easier internally), and
end up with a block size of 262,144 bytes.
So, creating PAR files for a directory is just a matter of plugging these
values into the par2create command:
par2create -s262144 -r10 [NameOfParFile.par2] [FilesToRead]
Easy, eh?
Verifying the files in a directory is equally trivial, using par2verify:
par2verify [NameOfParFile.par2]
If par2verify tells you that you’ve got corruption in your data, you can
repair it with par2repair:
par2repair [NameOfParFile.par2]
When that’s finished, parchive will have magically regenerated your data from
the parity files you have at hand. Brilliant.
Carlo just launched his latest project, Escaloop. It’s a nicely structured way of creating a lifestream that pulls in your content from external sites (flickr, last.fm, etc), binding them together into a nicely presented, embedable “badge.”
He’s done a good job with it, and it couldn’t be easier to try out for yourself… You don’t even have to create a user account to get going (in fact, you can’t create a user account: they don’t exist!). It’s lightweight, and easy to play around with. Go try it out, you’ll like it!