[SGVLUG] How to find files/messages that are "almost" the same
Jeremy Leader
jleader at alumni.caltech.edu
Mon Apr 20 11:52:10 PDT 2009
If you're really ambitious, there's a technique called
"shingleprinting", which computes hashes for small chunks of the text,
and determines the number of identical chunks between 2 files (or email
messages, or whatever you're trying to de-dupe). The theory is that the
proportion of chunks shared between two files gives a measure of their
similarity. A quick search turned up two implementations, which I
haven't investigated further:
http://research.microsoft.com/en-us/downloads/4e0d0535-ff4c-4259-99fa-ab34f3f57d67/default.aspx
http://wiki.cs.pdx.edu/forge/simhash.html
--
Jeremy Leader
jleader at alumni.caltech.edu
matti wrote:
>
> Hmmm... interesting problem
>
> let the world of command line save you!
>
> lol! ;-)
>
> ok!
>
> this IS how I would try to solve the problem
>
> use "diff" - maybe piping to wc or something
> and seeing which files are minimally different
>
> I would also only use diff on files close to the
> same size, as obviously files of significantly
> different sizes are very different.
>
> http://en.wikipedia.org/wiki/Diff
>
> hmm, so probably use "ls -lta" and pipe to a file,
> then sort the file based on size of the files,
> extract the nearest files basesd on size, diff
> those -> use wc to determine how big the difference
> is, and then manually look and see if it worked ;)
>
> best
> matti
>
> --- On Mon, 4/20/09, Emerson, Tom (*IC) <Tom.Emerson at wbconsultant.com> wrote:
>
>> From: Emerson, Tom (*IC) <Tom.Emerson at wbconsultant.com>
>> Subject: [SGVLUG] How to find files/messages that are "almost" the same
>> To: "'SGVLUG Discussion List.'" <sgvlug at sgvlug.net>
>> Date: Monday, April 20, 2009, 8:50 AM
>> One more reason to dislike certain
>> email clients: using automation to sort e-mails can end up
>> with "duplicates" in multiple folders, however these are
>> not-quite-perfect duplicates, so a binary comparison will
>> see them as distinct messages when in fact the /content/ is
>> the same.
>>
>> Does anyone know of a product or program that would ignore
>> small differences (such as an extra space at the end of a
>> line) when comparing the body/text of a message?
More information about the SGVLUG
mailing list