TL;DR: This article questions the current practices in image labeling for computer vision and proposes storing image annotations as EXIF metadata, eliminating the sidecar text file.

So, I want to get to down to work to understand firsthand what it would be like to deal with annotation data in this paradigm: the image states a truth about itself [EXIF], rather than being described by another entity [JSON]. That seems a pretty reasonable proposition.
The general idea is to experiment with simplifying the dataset file system — by eliminating the sidecar text files, and check if there are any important gains that justify adopting this approach, at least for small and/or proprietary datasets and fine-tuning tasks. Sidecar files by definition store data (often metadata) which is not supported by the format of a source file. That obviously isn’t true with modern day digital image files.
I’m also trying to understand the technical and conceptual – and why not say, ethical – problems related to inserting/writing/reading data in these structures/environments as well as check if there is anything to gain in the training process, least for small datasets and/or fine-tuning tasks.
It goes in the EXIF
Annotation is a fundamental step in Machine Learning. For the casual reader, computer vision annotations are stored in text files connected, or paired, with the correspondent image files within the dataset, e.g. cow0001.jpg → cow0001.json. These files are created and written by the annotation tool in a work session, as the annotator selects the region of the image containing the item to be labeled. They contain the coordinates that allow the AI system to superimpose the “bounding boxes” – those already familiar ‘squares’ that delimit the target detection items on the image – structured in a certain text format.
Some frameworks use json, others xml, and others csv [there will definitely be other formats]. Some require the presence of the annotated text file in the same directory [same level] as the image. Other frameworks require the text file to be located in its own directory within the image directory, which may [must] be called json, xml, label, and a few other requirements. This text file lives in an indissoluble marriage with the image file, and for computer vision purposes they are always referenced together. Is it worth the hassle?
A common argument is that this image/text separation scheme allows greater flexibility in annotations, because they are atomized, etc. But I counter the argument that nothing is very different when both are metadata. In a proper label tag, the text data remains compartmentalized and manipulating them will be no more difficult than manipulating a text file. It is still perfectly possible to keep tag content synchronized with text files kept outside the dataset. The dataset no longer needs a file system.
XMP comes into play
According to Wikipedia “The XMP standard was designed to be extensible, allowing users to add their own custom types of metadata.”
In a perfect world, such user-defined tag would have its own data type. For this exercise we will use the natural inclination that tags have for dealing with strings. We then need to create a tag [or rename an already defined one – there are many] to contain our label; our own EXIF tag[0].
Doing away with the text files
So let’s get rid of the text file and write our labels as an EXIF tag of the image file. There a few modules available in Python for this task, but little diversity. Many are out of date. A search of the Anaconda (conda, conda-forge) and PyPi (pip) channels consistently returns the three modules I plan t test for this project: pyexiv2; piexif and PyExifTool. The latter is a Python wrapper for ExitFool, which is a Pearl application. We will not detail the peculiarities of each one here. I’m also exploring PIL .
With ExifTool it is possible to perform advanced manipulations on tags. Let’s use it to create [or rename an already defined one] a new tag called ‘Label‘:
The process involves editing the exif.config file containing the tags we want to define, as stipulated in the documentation.
# The %Image::ExifTool::UserDefined hash defines new tags to be added
# to existing tables.
%Image::ExifTool::UserDefined = (
# All EXIF tags are added to the Main table, and WriteGroup is used to
# specify where the tag is written (default is ExifIFD if not specified):
'Image::ExifTool::Exif::Main' => {
# Example 1. EXIF:NewEXIFTag
0xd000 => {
Name => 'Label',
Writable => 'int16u',
WriteGroup => 'IFD0',
},
# add more user-defined EXIF tags here...
},
What I want is
def writeToEXIFtag (data)
#some pseudocode for now
Image.Exif.Label = data
Instead of
def writeToJSONFile(path, fileName, data):
fileName = fileName.split(".")[0]
filePathNameWExt = path + '/' + fileName + '.json'
with open(filePathNameWExt, 'w') as fp:
json.dump(data, fp)
as in what has become [by custom] the canonical process.
In this project, for greater practicality [integration with other modules, etc.], the best way to go seems to be using virtual environments, such as virtualenv and conda. It is hardly possible to bring together the exact same packages on both platforms. At the moment I’m using environments I configured with modules that I put together through the not very clean practice of mixing conda+pip. I still have things to figure out – don’t know much about Pearl and I’m having a hard time having everything [namely exiftool + pyexiftool] working together.
Pros
- More efficient processing [to be verified].
- Dataset image files can be renamed and can be used in any other dataset without additional work.
- No problems with different formats. These Xlabels [EXIF Labels] can coexist with the paired annotation files.
- Simplicity brings pedagogic gains; a less steep [human] learning curve.
- Cameras can automatically pre-annotate images – at least universal categories, like COCO’s [is that really a ‘Pro’?].
Cons
- Increased dataset size [to be verified]
- Less control over datasets and annotations [to be verified]
- The usual surveillance stuff [cameras detecting, identifying, classifying…]
- <Insert your con here>
Epilogue
I would really like to know if there is any proposal similar to this one out there, as I’d also like to know if I’m actually arriving late to an already rejected solution. I’m still in the very early stages and receiving any feedback is an integral part of the process.
I hope the idea is worthy of receiving inputs from the readers. I will be reporting any progress. I have the skeleton of the repository on GitHub[1], and will be polishing and finalizing the initial version in the next few days. It’s a modest project, as the idea is really simple as everyone can probably see. Everyone’s invited to participate.
* * *
[0] The question of creating new tags, or renaming existing ones [e.g. UserComment → Label], or both, or yet another option with another datatype, is open at this moment, as is the question of whether to use single or combined tags.



