Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
338 views
in Technique[技术] by (71.8m points)

jpeg - Efficient way to fingerprint an image (jpg, png, etc)?

Is there an efficient way to get a fingerprint of an image for duplicate detection?

That is, given an image file, say a jpg or png, I'd like to be able to quickly calculate a value that identifies the image content and is fairly resilient to other aspects of the image (eg. the image metadata) changing. If it deals with resizing that's even better.

[Update] Regarding the meta-data in jpg files, does anyone know if it's stored in a specific part of the file? I'm looking for an easy way to ignore it - eg. can I skip the first x bytes of the file or take x bytes from the end of the file to ensure I'm not getting meta-data?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Stab in the dark, if you are looking to circumvent meta-data and size related things:

  1. Edge Detection and scale-independent comparison
  2. Sampling and statistical analysis of grayscale/RGB values (average lum, averaged color map)
  3. FFT and other transforms (Good article Classification of Fingerprints using FFT)

And numerous others.

Basically:

  1. Convert JPG/PNG/GIF whatever into an RGB byte array which is independent of encoding
  2. Use a fuzzy pattern classification method to generate a 'hash of the pattern' in the image ... not a hash of the RGB array as some suggest
  3. Then you want a distributed method of fast hash comparison based on matching threshold on the encapsulated hash or encoding of the pattern. Erlang would be good for this :)

Advantages are:

  1. Will, if you use any AI/Training, spot duplicates regardless of encoding, size, aspect, hue and lum modification, dynamic range/subsampling differences and in some cases perspective

Disadvantages:

  1. Can be hard to code .. something like OpenCV might help
  2. Probabilistic ... false positives are likely but can be reduced with neural networks and other AI
  3. Slow unless you can encapsulate pattern qualities and distribute the search (MapReduce style)

Checkout image analysis books such as:

  1. Pattern Classification 2ed
  2. Image Processing Fundamentals
  3. Image Processing - Principles and Applications

And others

If you are scaling the image, then things are simpler. If not, then you have to contend with the fact that scaling is lossy in more ways than sample reduction.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...