Machine Learning – Voxle One

TL;DR: This article questions the current practices in image labeling for computer vision and proposes storing image annotations as EXIF metadata, eliminating the sidecar text file.

Zebu cows in a pasture — *One of the images in Caprichoso, our zebu cattle dataset (soon to be released) – Image: **Cownt** CC BY-NC-SA 4.0 Deed*

So, I want to get to down to work to understand firsthand what it would be like to deal with annotation data in this paradigm: the image states a truth about itself [EXIF], rather than being described by another entity [JSON]. That seems a pretty reasonable proposition.

The general idea is to experiment with simplifying the dataset file system — by eliminating the sidecar text files, and check if there are any important gains that justify adopting this approach, at least for small and/or proprietary datasets and fine-tuning tasks. Sidecar files by definition store data (often metadata) which is not supported by the format of a source file. That obviously isn’t true with modern day digital image files.

I’m also trying to understand the technical and conceptual – and why not say, ethical – problems related to inserting/writing/reading data in these structures/environments as well as check if there is anything to gain in the training process, least for small datasets and/or fine-tuning tasks.

It goes in the EXIF

Annotation is a fundamental step in Machine Learning. For the casual reader, computer vision annotations are stored in text files connected, or paired, with the correspondent image files within the dataset, e.g. cow0001.jpg → cow0001.json. These files are created and written by the annotation tool in a work session, as the annotator selects the region of the image containing the item to be labeled. They contain the coordinates that allow the AI system to superimpose the “bounding boxes” – those already familiar ‘squares’ that delimit the target detection items on the image – structured in a certain text format.

Some frameworks use json, others xml, and others csv [there will definitely be other formats]. Some require the presence of the annotated text file in the same directory [same level] as the image. Other frameworks require the text file to be located in its own directory within the image directory, which may [must] be called json, xml, label, and a few other requirements. This text file lives in an indissoluble marriage with the image file, and for computer vision purposes they are always referenced together. Is it worth the hassle?

A common argument is that this image/text separation scheme allows greater flexibility in annotations, because they are atomized, etc. But I counter the argument that nothing is very different when both are metadata. In a proper label tag, the text data remains compartmentalized and manipulating them will be no more difficult than manipulating a text file. It is still perfectly possible to keep tag content synchronized with text files kept outside the dataset. The dataset no longer needs a file system.

XMP comes into play

According to Wikipedia “The XMP standard was designed to be extensible, allowing users to add their own custom types of metadata.”

In a perfect world, such user-defined tag would have its own data type. For this exercise we will use the natural inclination that tags have for dealing with strings. We then need to create a tag [or rename an already defined one – there are many] to contain our label; our own EXIF tag[0].

Doing away with the text files

So let’s get rid of the text file and write our labels as an EXIF tag of the image file. There a few modules available in Python for this task, but little diversity. Many are out of date. A search of the Anaconda (conda, conda-forge) and PyPi (pip) channels consistently returns the three modules I plan t test for this project: pyexiv2; piexif and PyExifTool. The latter is a Python wrapper for ExitFool, which is a Pearl application. We will not detail the peculiarities of each one here. I’m also exploring PIL .

With ExifTool it is possible to perform advanced manipulations on tags. Let’s use it to create [or rename an already defined one] a new tag called ‘Label‘:

The process involves editing the exif.config file containing the tags we want to define, as stipulated in the documentation.

# The %Image::ExifTool::UserDefined hash defines new tags to be added
# to existing tables.
%Image::ExifTool::UserDefined = (
    # All EXIF tags are added to the Main table, and WriteGroup is used to
    # specify where the tag is written (default is ExifIFD if not specified):
    'Image::ExifTool::Exif::Main' => {
        # Example 1.  EXIF:NewEXIFTag
        0xd000 => {
            Name => 'Label',
            Writable => 'int16u',
            WriteGroup => 'IFD0',
        },
        # add more user-defined EXIF tags here...
    },

What I want is

def writeToEXIFtag (data)
    #some pseudocode for now
    Image.Exif.Label = data

Instead of

def writeToJSONFile(path, fileName, data):
    fileName = fileName.split(".")[0]
    filePathNameWExt =  path + '/' + fileName + '.json'
    with open(filePathNameWExt, 'w') as fp:
        json.dump(data, fp)

as in what has become [by custom] the canonical process.

In this project, for greater practicality [integration with other modules, etc.], the best way to go seems to be using virtual environments, such as virtualenv and conda. It is hardly possible to bring together the exact same packages on both platforms. At the moment I’m using environments I configured with modules that I put together through the not very clean practice of mixing conda+pip. I still have things to figure out – don’t know much about Pearl and I’m having a hard time having everything [namely exiftool + pyexiftool] working together.

Pros

More efficient processing [to be verified].
Dataset image files can be renamed and can be used in any other dataset without additional work.
No problems with different formats. These Xlabels [EXIF Labels] can coexist with the paired annotation files.
Simplicity brings pedagogic gains; a less steep [human] learning curve.
Cameras can automatically pre-annotate images – at least universal categories, like COCO’s [is that really a ‘Pro’?].

Cons

Increased dataset size [to be verified]
Less control over datasets and annotations [to be verified]
The usual surveillance stuff [cameras detecting, identifying, classifying…]
<Insert your con here>

Epilogue

I would really like to know if there is any proposal similar to this one out there, as I’d also like to know if I’m actually arriving late to an already rejected solution. I’m still in the very early stages and receiving any feedback is an integral part of the process.

I hope the idea is worthy of receiving inputs from the readers. I will be reporting any progress. I have the skeleton of the repository on GitHub[1], and will be polishing and finalizing the initial version in the next few days. It’s a modest project, as the idea is really simple as everyone can probably see. Everyone’s invited to participate.

* * *

[0] The question of creating new tags, or renaming existing ones [e.g. UserComment → Label], or both, or yet another option with another datatype, is open at this moment, as is the question of whether to use single or combined tags.

[1] https://github.com/VoxLeone/XLabel

O poder de computação ao alcance das pessoas começou a crescer rapidamente, aos trancos e barrancos, na virada do milênio, quando as unidades de processamento gráfico (GPUs) começaram a ser aproveitadas para cálculos não gráficos, uma tendência que se tornou cada vez mais difundida na última década.

Ainda não temos uma Teoria da Mente, que possa nos dar uma base para a construção de uma verdadeira inteligência senciente. Aqui a distinção entre as disciplinas que formam o campo da Inteligência Artificial

Mas as demandas da computação de “Aprendizado Profundo” [Deep Learning] têm aumentado ainda mais rápido. Essa dinâmica estimulou os engenheiros a desenvolver aceleradores de hardware voltados especificamente para o aprendizado profundo [o que se conhece popularmente como ‘Inteligência Artificial’], sendo a Unidade de Processamento de Tensor (TPU) do Google um excelente exemplo.

Aqui, descreverei resumidamente o processo geral do aprendizado de máquina. Em meio a reportagens cataclísmicas anunciando o iminente desabamento do Céu, precisamos saber um pouco sobre como os computadores realmente executam cálculos de redes neurais.

Visão geral

Quase invariavelmente, os neurônios artificiais são ‘construídos’ [na verdade eles são virtuais] usando um software especial executado em algum tipo de computador eletrônico digital.

Esse software fornece a um determinado neurônio da rede várias entradas e uma saída. O estado de cada neurônio depende da soma ponderada de suas entradas, à qual uma função não linear, chamada função de ativação, é aplicada. O resultado, a saída desse neurônio, torna-se então uma entrada para vários outros neurônios, em um processo em cascata.

As camadas de neurônios interagem entre si. Cada círculo representa um neurônio, em uma visão muito esquemática. À esquerda (em amarelo) a camada de entrada. Ao centro, em azul e verde, as camadas ocultas, que refinam os dados, aplicando pesos variados a cada neurônio. À direita, em vermelho, a camada de saída, com o resultado final.

Por questões de eficiência computacional, esses neurônios são agrupados em camadas, com neurônios conectados apenas a neurônios em camadas adjacentes. A vantagem de organizar as coisas dessa maneira, ao invés de permitir conexões entre quaisquer dois neurônios, é que isso permite que certos truques matemáticos de álgebra linear sejam usados para acelerar os cálculos.

Embora os cálculos de álgebra linear não sejam toda a história, eles são a parte mais exigente do aprendizado profundo em termos de computação, principalmente à medida que o tamanho das redes aumenta. Isso é verdadeiro para ambas as fases do aprendizado de máquina:

O treinamento – processo de determinar quais pesos aplicar às entradas de cada neurônio.

A inferência – processo deflagrado quando a rede neural está fornecendo os resultados desejados.

*Concepção do processo de treinamento de máquina, dos dados brutos, à esquerda, ao modelo completo.*

Matrizes

O que são esses misteriosos cálculos de álgebra linear? Na verdade eles não são tão complicados. Eles envolvem operações com matrizes, que são apenas arranjos retangulares de números – planilhas, se preferir, menos os cabeçalhos de coluna descritivos que você encontra em um arquivo Excel típico.

É bom que as coisas sejam assim, porque o hardware de um computador moderno é otimizado exatamente para operações com matriz, que sempre foram o pão com manteiga da computação de alto desempenho – muito antes de o aprendizado de máquina se tornar popular. Os cálculos matriciais relevantes para o aprendizado profundo se resumem essencialmente a um grande número de operações de multiplicação e acumulação, em que pares de números são multiplicados entre si e seus produtos somados.

Ao longo dos anos, o aprendizado profundo foi exigindo um número cada vez maior dessas operações de multiplicação e acumulação. Considere LeNet, uma rede neural pioneira, projetada para fazer classificação de imagens. Em 1998, demonstrou superar o desempenho de outras técnicas de máquina para reconhecer letras e numerais manuscritos. Mas em 2012 o AlexNet, uma rede neural que processava cerca de 1.600 vezes mais operações de multiplicação e acumulação do que o LeNet, foi capaz de reconhecer milhares de diferentes tipos de objetos em imagens.

Gráfico tridimensional ilustrando o processo de inferência, partindo de dados brutos dispersos (embaixo à direita) até o refinamento final (após muitas iterações de inferência), onde o resultado (ou predição) é obtido.

Aliviar a pegada de CO2

Avançar do sucesso inicial do LeNet para o AlexNet exigiu quase 11 duplicações do desempenho de computação. Durante os 14 anos que se passaram, a lei de Moore ditava grande parte desse aumento. O desafio tem sido manter essa tendência agora que a lei de Moore dá sinais de que está perdendo força. A solução de sempre é simplesmente injetar mais recursos – tempo, dinheiro e energia – no problema.

Como resultado, o treinamento das grandes redes neurais tem deixado uma pegada ambiental significativa. Um estudo de 2019 descobriu, por exemplo, que o treinamento de um determinado tipo de rede neural profunda para o processamento de linguagem natural emite cinco vezes mais CO2 do que um automóvel durante toda a sua vida útil.

Os aprimoramentos nos computadores eletrônicos digitais com certeza permitiram que o aprendizado profundo florescesse. Mas isso não significa que a única maneira de realizar cálculos de redes neurais seja necessariamente através dessas máquinas. Décadas atrás, quando os computadores digitais ainda eram relativamente primitivos, os engenheiros lidavam com cálculos difíceis como esses usando computadores analógicos.

À medida que a eletrônica digital evoluiu, esses computadores analógicos foram sendo deixados de lado. Mas pode ser hora voltar a essa estratégia mais uma vez, em particular nestes tempos em que cálculos analógicos podem ser feitos oticamente de forma natural.

Nos próximas postagens vou trazer os mais recentes desenvolvimentos em fotônica aplicada ao aprendizado de máquina – em uma arquitetura analógica! Estamos, sem dúvida, vivendo tempos interessantes neste campo promissor.

Fonte de pesquisa: spectrum.ieee.org

Voxle One

Sistemas Inteligentes

Tag: Machine Learning

CV: Labels are Metadata – let’s treat them as such

Uma (muito) Rápida Introdução à ‘Inteligência Artificial’