Storing and searching the Signatures¶
In addition to generating image signatures, image_match
also facilitates
storing and efficient lookup of images—even for up to (at least) a billion
images. Instagram account only has a few million images? Don’t worry, you can
get 80M images here to
play with.
A signature database wraps an Elasticsearch index, so you’ll need Elasticsearch up and running. Once that’s done, you can set it up like so:
from elasticsearch import Elasticsearch
from image_match.elasticsearch_driver import SignatureES
es = Elasticsearch()
ses = SignatureES(es)
By default, the Elasticsearch index name is 'images'
and the document type
'image'
, but you can change these via the index
and doc_type
parameters.
Now, let’s store those pictures from before in the database:
ses.add_image('https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/687px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg')
ses.add_image('https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg')
ses.add_image('https://upload.wikimedia.org/wikipedia/commons/e/e0/Caravaggio_-_Cena_in_Emmaus.jpg')
ses.add_image('https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg')
Now let’s search for one of those Mona Lisas:
ses.search_image('https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg')
The result is a list of hits:
[
{'dist': 0.0,
'id': u'AVM37oZq0osmmAxpPvx7',
'metadata': None,
'path': u'https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg',
'score': 7.937254},
{'dist': 0.22095170140933634,
'id': u'AVM37nMg0osmmAxpPvx6',
'metadata': None,
'path': u'https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/687px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg',
'score': 0.28797293},
{'dist': 0.42557196987336648,
'id': u'AVM37p530osmmAxpPvx9',
'metadata': None,
'path': u'https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg',
'score': 0.0499953}
]
dist
is the normalized distance, like we computed above. Hence, lower numbers
are better with 0.0
being a perfect match. id
is an identifier assigned by
the database. score
is computed by Elasticsearch, and higher numbers are
better here. path
is the original path (url or file path). metadata
is
an optional field used for storing extra information about the image (see below).
Notice all three Mona Lisa images appear in the results, with the identical
image being a perfect ('dist': 0.0
) match. If we search instead for the
Caravaggio,
ses.search_image('https://upload.wikimedia.org/wikipedia/commons/e/e0/Caravaggio_-_Cena_in_Emmaus.jpg')
You get:
[
{'dist': 0.0,
'id': u'AVMyXQFw0osmmAxpPvxz',
'metadata': None,
'path': u'https://upload.wikimedia.org/wikipedia/commons/e/e0/Caravaggio_-_Cena_in_Emmaus.jpg',
'score': 7.937254}
]
It only finds the Caravaggio, which makes sense! But what if we wanted an even
more restrictive search? For instance, maybe we only want unmodified Mona Lisas
– just photographs of the original. We can restrict our search with a hard
cutoff using the distance_cutoff
keyword argument:
ses = SignatureES(es, distance_cutoff=0.3)
ses.search_image('https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg')
Which now returns only the unmodified, catless Mona Lisas:
[
{'dist': 0.0,
'id': u'AVMyXOz30osmmAxpPvxy',
'metadata': None,
'path': u'https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg',
'score': 7.937254},
{'dist': 0.23889600350807427,
'id': u'AVMyXMpV0osmmAxpPvxx',
'metadata': None,
'path': u'https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/687px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg',
'score': 0.28797293}
]
Distorted and transformed images¶
image_match
is also robust against basic image transforms. Take this
squashed Mona Lisa:
No problem, just search as usual:
ses.search_image('http://i.imgur.com/CVYBCCy.jpg')
returns
[
{'dist': 0.15454905655638429,
'id': u'AVM37oZq0osmmAxpPvx7',
'metadata': None,
'path': u'https://pixabay.com/static/uploads/photo/2012/11/28/08/56/mona-lisa-67506_960_720.jpg',
'score': 1.6818419},
{'dist': 0.24980626832071956,
'id': u'AVM37nMg0osmmAxpPvx6',
'metadata': None,
'path': u'https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg/687px-Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg',
'score': 0.16198477},
{'dist': 0.43387141782958921,
'id': u'AVM37p530osmmAxpPvx9',
'metadata': None,
'path': u'https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg',
'score': 0.031996995}
]
as expected. Now, consider this rotated version:
image_match
doesn’t search for rotations and mirror images by default.
Searching for this image will return no results, unless you search with
all_orientations=True
:
ses.search_image('http://i.imgur.com/T5AusYd.jpg', all_orientations=True)
Then you get the expected matches.
Adding metadata¶
Sometimes you want to store information with your images independent of the
reverse image search functionality. You can do that with the metadata=
field in the add_image
function.
Let’s add one of the images again, with some extra data:
ses.add_image('https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg', metadata={'things': 'stuff!'})
In general, any JSON-like data should work with metadata=
. Now we can search for the image:
ses.search_image('https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg')
Returns our previous results along with a new one:
[
{'dist': 0.0,
'id': u'AVYhQYhEDpLcdyATKuy-',
'metadata': None,
'path': u'https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg',
'score': 7.64685},
{'dist': 0.0,
'id': u'AVYhRvoWDpLcdyATKuzE',
'metadata': {u'things': u'stuff!'},
'path': u'https://c2.staticflickr.com/8/7158/6814444991_08d82de57e_z.jpg',
'score': 2.435569},
...
]
Where we can see a little extra info. image-match
doesn’t provide anyway to query
the metadata directly, but the user can use Elasticsearch’s QL, for example with:
ses.es.search('images', body={'query': {'match': {'metadata.things': 'stuff!'}}})