Image hashing

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Prelates, Moderators General

Image hashing

Postby darkspork » Fri Jun 12, 2009 8:47 am UTC

I'm working in Python on an image downloading/storage program. One of the main goals is to remove redundant images. (I estimate that at least half of my saved images exist in a duplicate location somewhere on my HD.) I need some sort of algorithm that can give me a simple hash of the image in question, so that two extremely similar images (one pixel off. same image) that are to the human eye the same image will be given the same number. I tried just taking an md5 hash of the files, but that doesn't work. I even copied a file and it failed to recognize it as the same image.
Shameless Website Promotion: Gamma Energy
My new esoteric programming language: GLOBOL
An experiment to mess with Google Search results: HARDCORE PORNOGRAPHY HARDCORE PORNOGRAPHY
User avatar
darkspork
 
Posts: 532
Joined: Tue Sep 23, 2008 12:43 am UTC
Location: Land of Trains and Suburbs

Re: Image hashing

Postby mrbaggins » Fri Jun 12, 2009 9:32 am UTC

MD5 isn't what you want at all... one bit of difference will completely change the entire result.

I'm really not sure if what you want exists. It's tricky to do reliably, and you have to go on confidence levels because two things might be similar but not the same, or the same picture but different enough that the program gets confused...
Why is it that 4chan is either infinitely awesome, infinitely bad, or "lolwut", but never any intermediary level?
User avatar
mrbaggins
 
Posts: 1611
Joined: Tue Jan 15, 2008 3:23 am UTC
Location: Wagga, Australia

Re: Image hashing

Postby darkspork » Fri Jun 12, 2009 9:46 am UTC

mrbaggins wrote:MD5 isn't what you want at all... one bit of difference will completely change the entire result.

I'm really not sure if what you want exists. It's tricky to do reliably, and you have to go on confidence levels because two things might be similar but not the same, or the same picture but different enough that the program gets confused...

I started with the general assumption that two copies of the EXACT SAME FILE should generate the same md5 hash, but that seems to have failed.

I wonder if I can somehow resize the image to a smaller size, reduce color quality, and then generate a hash
Shameless Website Promotion: Gamma Energy
My new esoteric programming language: GLOBOL
An experiment to mess with Google Search results: HARDCORE PORNOGRAPHY HARDCORE PORNOGRAPHY
User avatar
darkspork
 
Posts: 532
Joined: Tue Sep 23, 2008 12:43 am UTC
Location: Land of Trains and Suburbs

Re: Image hashing

Postby Ephphatha » Fri Jun 12, 2009 10:05 am UTC

What you want to do is downsize each image until they're the same resolution, then compare only the pixel data of whatever format they're saved at to see how close they are. There is a tool out there somewhere that does this exact thing (try asking on 4chans /r/, it's used by those guys to detect duplicate porn images...).
I'm not lazy, I'm just getting in early for Christmas is all...
User avatar
Ephphatha
 
Posts: 625
Joined: Sat Sep 02, 2006 9:03 am UTC
Location: Bathurst, NSW, Australia

Re: Image hashing

Postby jaap » Fri Jun 12, 2009 10:12 am UTC

Ephphatha wrote:There is a tool out there somewhere that does this exact thing

DupDetector is one I've used, and it's not bad.
User avatar
jaap
 
Posts: 1815
Joined: Fri Jul 06, 2007 7:06 am UTC

Re: Image hashing

Postby MHD » Fri Jun 12, 2009 3:58 pm UTC

I'd probably go with the resizing too, or maybe construct a hashing algo that would not differ much from picture to picture (IE. little avalanche effect) and then compare the output as integers.
EvanED wrote:be aware that when most people say "regular expression" they really mean "something that is almost, but not quite, entirely unlike a regular expression"
User avatar
MHD
 
Posts: 631
Joined: Fri Mar 20, 2009 8:21 pm UTC
Location: Denmark

Re: Image hashing

Postby darkspork » Sat Jun 13, 2009 7:20 am UTC

MHD wrote:I'd probably go with the resizing too, or maybe construct a hashing algo that would not differ much from picture to picture (IE. little avalanche effect) and then compare the output as integers.

I think I've got it then. I can resize all images to 64x64 pixels, and reduce the color quality to 8x8x8 (512 colors). This should prove unique enough, and should also result in identical images receiving identical hashes.

Now I just have to figure out how to get the Python Imaging Library to work on mac...
Shameless Website Promotion: Gamma Energy
My new esoteric programming language: GLOBOL
An experiment to mess with Google Search results: HARDCORE PORNOGRAPHY HARDCORE PORNOGRAPHY
User avatar
darkspork
 
Posts: 532
Joined: Tue Sep 23, 2008 12:43 am UTC
Location: Land of Trains and Suburbs

Re: Image hashing

Postby Baxter » Sat Jun 13, 2009 6:39 pm UTC

I've done quite a bit of research into this and I found the approach of averaging out squares to be of little real use, being off by a pixel or a few too many artefacts in a dark area can throw your results I ended up using Harr Wavelet decomposition helpfully handled and demonstrated by isk-daemon: http://server.imgseek.net/
Cars Cutters and Cadavers
User avatar
Baxter
 
Posts: 46
Joined: Sun Oct 12, 2008 3:30 am UTC


Return to Coding

Who is online

Users browsing this forum: Diadem and 6 guests