[SGVLUG] diff command and binary
Emerson, Tom
Tom.Emerson at wbconsultant.com
Thu Mar 2 14:25:18 PST 2006
> -----Original Message-----
> Behalf Of matti
>
> Ok - I've encountered a frustrating problem
>
> pulled up a backup file and compared it
> the same file... they should be the same,
> but - perhaps the user did a slight modification
> this morning on one.
>
> SO I got 2 binary excel files which differ.
>
> I want to determine HOW much they differ
> and in what ways.
The simple act of OPENING the file will change it -- to test, I did the following:
* started Excel
* immediately saved the (blank) worksheet
* exited Excel
* copied the file from the command prompt
* startee Excel
* opened the COPY
* exited Excel
FC shows differences between the files -- not a lot, but "different" at a binary level:
C:\>copy book1.xls book2.xls
1 file(s) copied.
C:\>fc book1.xls book2.xls
Comparing files Book1.xls and BOOK2.XLS
***** Book1.xls
***** BOOK2.XLS
G???>???
*****
***** Book1.xls
?
***** BOOK2.XLS
W
o
r
k
b
o
o
k
*****
***** Book1.xls
W
o
r
k
b
o
o
k
>
> is there a way to do this.
> (the file sizes are the same.. )
note that "fc" [the "dos" equivalent to "diff"] also has a binary comparison mode:
C:\>help fc
Compares two files or sets of files and displays the differences between
them
FC [/A] [/C] [/L] [/LBn] [/N] [/T] [/U] [/W] [/nnnn] [drive1:][path1]filename1
[drive2:][path2]filename2
FC /B [drive1:][path1]filename1 [drive2:][path2]filename2
/A Displays only first and last lines for each set of differences.
/B Performs a binary comparison.
/C Disregards the case of letters.
/L Compares files as ASCII text.
/LBn Sets the maximum consecutive mismatches to the specified number of
lines.
/N Displays the line numbers on an ASCII comparison.
/T Does not expand tabs to spaces.
/U Compare files as UNICODE text files.
/W Compresses white space (tabs and spaces) for comparison.
/nnnn Specifies the number of consecutive lines that must match after a
mismatch.
C:\>fc /b book1.xls book2.xls
Comparing files Book1.xls and BOOK2.XLS
0000346C: 00 20
0000346D: 00 47
0000346E: 00 C6
0000346F: 00 05
00003470: 00 3F
00003471: 00 3E
00003472: 00 C6
00003473: 00 01
considering what whas zeroes before is now binary data, I'd imagine this is some form of timestamp -- sure enough, a second copy-and-compare operation yields:
C:\>copy book2.xls book3.xls
1 file(s) copied.
C:\>fc /b book2.xls book3.xls
Comparing files book2.xls and BOOK3.XLS
FC: no differences encountered
[open book3 in Excel and immediately close]
C:\>fc /b book2.xls book3.xls
Comparing files book2.xls and BOOK3.XLS
0000346C: 20 30
0000346D: 47 6A
0000346E: C6 FA
0000346F: 05 2E
00003470: 3F 40
C:\>
three bytes are still the same (perhaps they represent the day?) Note that marking the file "read only" (attrib +r) will keep Excel from updating this value.
I'd suggest copying the "current" file to a workfile, opening the workfile and closing it as I have done here, the compare backup-->current and current-->workfile; if the differences are similar to what I've shown here, I'd feel reasonably confident there is no actual change other than "having opened the file".
BUT... to get at what /actual/ changes may have occured, how about openoffice? I believe the 2.0 version saves files in XML format, (which is then gzipped to reclaim the space taken by the XML formatting) -- open the backup and "current" file in openoffice (marking them read-only prior to the start "just to be safe") and save them in the "native" openoffice format [there may even be an option to not gzip them as well, but if not I think gzunzip will work against the file] You can then use XML tools to determine the difference (or even an ascii FC/diff against the file should work as the file is essentially "human readable" at this point. [note: it may take some finessing to get the output in the first place -- you might need the -c and -S switches for gunzip...]
tom at osnut:/srv/nethome/tom2/Documents/wireless> less wfinder_data.sxc
Archive: wfinder_data.sxc
Length Method Size Ratio Date Time CRC-32 Name
-------- ------ ------- ----- ---- ---- ------ ----
10191 Defl:N 1949 81% 05-18-03 17:12 90241a31 content.xml
5433 Defl:N 1301 76% 05-18-03 17:12 1404ddb8 styles.xml
1076 Stored 1076 0% 05-18-03 17:12 f5381c62 meta.xml
7478 Defl:N 1355 82% 05-18-03 17:12 f15b0f40 settings.xml
750 Defl:N 252 66% 05-18-03 17:12 5313cb53 META-INF/manifest.xml
-------- ------- --- -------
24928 5933 76% 5 files
gunzip complains that there are "multiple entries in the file" and stops after the first one, using the "-c" flag and piping the result to a newly named file shows the file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE office:document-content PUBLIC "-//OpenOffice.org//DTD OfficeDocument1.0//EN" "office.dtd">
<office:document-content
xmlns:office="http://openoffice.org/2000/office"
[...snipped a dozen similar lines...]
office:class="spreadsheet"
office:version="1.0">
<office:script/>
<office:font-decls><style:font-decl style:name="Arial Unicode MS"
fo:font-family="'Arial Unicode
[etc.]
<office:body>
<table:table table:name="Sheet1" table:style-name="ta1">
<table:table-column table:style-name="co1" table:default-cell-style-name="Default"/>
<table:table-column table:style-name="co2" table:default-cell-style-name="Default"/>
<table:table-column table:style-name="co3" table:default-cell-style-name="Default"/>
[ditto above]
<table:table-row table:style-name="ro1">
<table:table-cell>
<text:p>Location</text:p>
</table:table-cell>
<table:table-cell>
<text:p>Operator</text:p>
</table:table-cell>
(actually, I pretty-printed this a bit -- the actual file has no line breaks or extraneous whitespace...)
More information about the SGVLUG
mailing list