how to repair corrupt tar archives


Author: Udo Rader

creation date: 2003-08-11, last revision: 2005-10-17

Every sysadmin's nightmare: You made a backup of important files using tar and for whatever reason you need to restore the files - but find the tar archive broken.

This thing happened to me once (and hopefully never again) and it took me quite a very long time to get the data back (or at least the useable part of it).

Before we start, some assumptions to make things clear:


(GNU-)tar itself has some options that claim to be suitable for recovering data from lost (you'll understand the sarcasm here if you read on ...). So let's first check what the problem is:

% tar xjf the_bad_bad_backup.tar.bz2
bzip2: Data integrity error when decompressing.
Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

You can use the `bzip2recover' program to *attempt* to recover
data from undamaged sections of corrupted files.

tar: 56 garbage bytes ignored at end of archive
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now

Now this indicates that I should use bzip2recover "to *attempt* to recover data from undamaged sections. Well, doesn't sound too bad, does it?

So I used bzip2recover:
% bzip2recover the_bad_bad_backup.tar.bz2

That way at least something happened. Depending on the size of the archive, bzip2recover produces a nice amount small 'rec*' files (typically 900K in size) which represent the default blocksize bzip2 uses per default for compression. The "nice amount of small files" however is likely to become a "huge amount of small files" if your archive is big - like mine was.

The archive I had to deal with was more than 200MB big, leaving me with several hundrets(!) of those "small files". But still I was optimistic that I could retrieve the data from the small files by finding the corrupted files. So I tried to find out, which of the small files was corrupted and which ones were good:
% bunzip2 rec*bz2

bunzip2 stops when it finds the first (and hopefully last) corrupted file, which is exactly what I wanted to know. Krush kill and destroy: No use for a corrupt file and so I deleted it and repeated the above command plus the deletion for all further bad files. The only important thing is to remember the number of the deleted files.

So now I thought it would be easy: use tar on the bunzip'ed files, but I was taught otherwise. Say that rec00199 was the first (and last) corrupted file, so starting at rec00200:
% tar xf rec00200
tar: this doesn't look like a »tar«-archive
tar: jumping to the next header
tar: error upon exit caused by previous errors

Headache time ... I could also try it with any of the >200 remaining allegedly "fixed" files, but always got the same error. Searches in google and postings in some mailinglists did not provide me with any useful results and my headeache grew.

Tar claims to have the feature to scan even corrupt files for tar headers in it but this feature has one major blemish: It only works, if no bytes are lost in the file because tar scans expects file headers to be 512 bytes in size. If only one byte is lost in such a header (or a following data block), this "recovery feature" fails and becomes an annoyance.

Luck returned a couple of weeks later when I received an email from a nice guy that had written a nice perl script that really searched a file for a tar header bytewise and not in the 512 bytes manner of tar itself. You can download it here.

In order to get things working, I joined the second part of the bunzip'ed files (the ones after the bad rec00199):
% cat rec00[2-4][0-9][0-9] > good_tail.tar

The command above joins all files starting at rec00200 up to rec004999 together in good_tail.jar.

And now the only thing I had to do was to use the script below to find the position of the first good tar header in good_tail.tar:
% perl find_tar_headers.pl good_tail.tar
good_tail.tar:17185:top/secret/warp_reactor.so:157106
good_tail.tar:75041:top/secret/kernel_injectors.so:153125
good_tail.tar:130849:top/secret/dampening_fields.so:145746
good_tail.tar:183585:top/secret/plasma_controls.so:157035
[...]

The only thing that matters is the first line of the output, it tells that the first good tar header in good_tail.tar is at position 17185. What remained was to extract the content starting at this position and then untar it:
% tail -c +17185 good_tail.tar > extracted_tail.tar
% tar xf extracted_tail.tar


Happy end of the story!
consulting IT Eclipse RCP eclipse rcp Eclipse-RCP tirol innsbruck österreich tyrol austria europa europeSoftware, Development, Softwareentwicklung, programming, Programmierung, SE, SD, Tirol, Tyrol, Innsbruck, Austria, Österreich, Europa, EuropeSystemadministration, Linux, Tirol, Sysadmin, Netzwerke, network, Linux, Windows, migration, tirol, tyrol, innsbruck, Österreich austria Europa, EuropeIT-Sicherheit, IT-Security, security network, EDV-Sicherheit, Netzwerksicherheit, tirol, tyrol innsbruck, Österreich Austria Europa, EuropeHosting, combeso, typo3, xims, Linux, server, tirol, tyrol innsbruck, Österreich, Austria Europa, EuropeLinux, Open Source, OSS, UNIX, Windows, tirol, tyrol innsbruck, Österreich, Austria, Europa, EuropeBISCAT, BiSCAT, Personenverwaltung, client administration tool, Mitarbeiterverwaltung, staff, Österreich, Austria, tirol innsbruck Europa, EuropedrawSWF, flash, grafiken, graphics, erstellung, development, tirol tyrol, innsbruck, Österreich, Austria, Europa, Europehomepage, webshop, günstig, easy cheap, umfangreich, comprehensive, tirol tyrol, innsbruck, Österreich, Austria, Europa, Europezooners die eventplattform, Orte, Veranstaltungen, Wien, Landeck, Tirol, Munich, Germany, Austria, zooncards, zooncard