The Rationale Behind Deep Neural Network Decisions

By: Oladayo Ogunnoiki, Statistics Canada

Introduction

In May 2016, Microsoft introduced Tay to the Twittersphere. Tay was an experimental artificial intelligence (AI) chatbot in "conversational understanding". The more you chatted with Tay, the smarter it would become. However, it didn't take long for the experiment to go awry. Tay was supposed to be engaging people in playful conversation, but this playful banter quickly turned into misogynistic and racist commentary.

Of course, the public was perplexed by this turn of events. If this bot was inherently rude, why wouldn't other AI models also go off course? Most Twitter users felt that this bleak event was only a glimmer of what was to come if our future was indeed rich in AI models. However, most data scientists understood the real reason for Tay's negative commentary – the bot was simply repeating what it had learned from the users themselves (Vincent, 2016).

The world of AI continues to grow exponentially and with stories like this happening all the time, there's a strong need to increase the public's trust in AI products. To gain their trust, transparency and explain-ability is of the utmost importance.

One of the primary questions for anyone interacting with an AI model like Tay, is: "why did the model make that decision?" Multiple tools have been designed to explain the rationale behind these models and answer that question. It may be to no one's surprise that visual explanations are an efficient way of explaining this. In their work, Ramprasaath, et al. (2017) outline the requirements of a good visual explanation– they must be class discriminative and should have a high-resolution. These criteria serve as guidelines for identifying the challenge to be addressed: creating a solution that provides a high resolution and class discriminative visual explanation for decisions of a neural network.

Some of the techniques that provide visual explanations include deconvolution, guided backpropagation, class activation mapping (CAM), Gradient-weighted CAM (Grad-CAM), Grad-CAM++, Hi-Res-CAM, Score-CAM, Ablation-CAM, X-Grad-CAM, Eigen-CAM, Full-Grad, and deep feature factorization. For this article, we'll focus on Grad-CAM.

Grad-CAM is an open-source tool that produces visual explanations for decisions from a large class of convolutional neural networks. It works by highlighting the regions of the image that have the highest influence on the final prediction of the deep neural network, thereby providing insight into the decision-making process of the model.

Grad-CAM is based on CAM which uses the activation of the feature maps with respect to the target class. It's specific to certain types of neural networks, such as the Visual Geometry Group network and residual network (ResNet). It uses the gradient of the target class with respect to the feature maps in the final layer. Grad-CAM is a generic method that can be applied to different types of neural networks. Combining features makes Grad-CAM a reliable and accurate tool for understanding the decision-making process of deep neural networks. Guided Grad-CAM is enhanced by incorporating the gradients of the guided backpropagation process to produce a more refined heatmap. One limitation is that it's only able to visualize the regions of the image that are most important for the final prediction, rather than the entire decision-making process of the deep neural network. This means that it may not provide a complete understanding of how the model is making its predictions.

The advantages of Grad-CAM include:

  • No trade off of model complexity and performance for more model transparency.
  • It's applicable to a broad range of convolutional neural networks (CNNs).
  • It's highly class discriminative.
  • Useful for diagnosing failure modes by uncovering biases in datasets.
  • Helps untrained users to recognize a stronger network than a weaker one, even when the predictions are identical.

Methodology

Grad-CAM can be used in multiple computer vision projects such as image classification, semantic segmentation, object detection, image captioning, visual question answering, etc. It can be applied on CNNs and has recently been made available on transformer architectures.

Highlighted below is how Grad-CAM works in image classification, where the objective is to discriminate between different classes:

The process flow of Gradient-weighted class activation mapping (Grad-CAM)
Description - Figure 1The process flow of Gradient-weighted class activation mapping (Grad-CAM)

An image is passed through a CNN and a task specific network to obtain a raw score for the image's class. Next, the gradients are set to zero for all classes except for the desired class, which is set to one. This signal is then backpropagated to the rectified convolutional feature maps of interest, which are combined to compute a blue heatmap that represents where the model needs to look to decide on the class. Finally, the heatmap is pointwise multiplied with guided backpropagation, resulting in guided Grad-CAM visualizations that are high-resolution and concept-specific.

In the case of an image classification task, to obtain the Grad-CAM class-discriminative localization map,LGrad-CAMc
,  for a model on a specific class, the steps below are followed:

  • For a specific class, c, the partial derivative of the score, yc , of the class, c, in respect to feature maps, Ak , of a convolutional layer is calculated using backpropagation.
    ycAijk
  • The gradients flowing back due to backpropagation are pooled via global average pooling. This produces a set of scalars of weights. These are the neuron importance weights.
    αkc= 1ZijycAijk
  • The derived scalar weights are applied (linear combination) to the feature map. The result is passed through a Rectified Linear Unit (ReLU) activation function.
    LGrad-CAMc=ReLUkαkcAk
  • The result is scaled and applied to the image, highlighting the focus of the neural network. As seen, a ReLU activation function is applied to the linear combination of maps, because it's only interested in the pixels or features that have a positive influence on the class score, yc .

Demonstration of Grad-CAM

A pair of cats and a pair of remote controls
Description - Figure 2A pair of cats and a pair of remote controls

Image consisting of two Egyptian cats lying down on a pink sofa with remote controls on the left-hand side of each cat.

Figure 2 is an image of two Egyptian cats and two remote controls. The image was derived from the Hugging Face's cat image dataset, using their Python library. The objective is to identify the items within the image using different pretrained deep learning models. A PyTorch package called the PyTorch-GradCAM is used. The Grad-CAM feature identifies aspects of the image that activate the feature map of the Egyptian cat class and the remote-control class. After following the PyTorch-GradCAM tutorial, the Grad-CAM results are replicated for different deep neural networks.

Grad-CAM results of a pretrained Resnet-50 architecture to classify the figure 2 image. This image was generated by applying Grad-CAM to figure 2 in a Jupyter Notebook.
Description - Figure 3Grad-CAM results of a pretrained Resnet-50 architecture to classify the figure 2 image. This image was generated by applying Grad-CAM to figure 2 in a Jupyter Notebook.

Heatmap images generated from a Resnet-50 architecture using Grad-CAM for the Egyptian cat class (left) and Remote-control class (right). The intensity of the red colour shows the regions that contribute the most to the model decision. There are few intense regions for the cat, while the remotes are almost fully captured, but not highly intense.

Figure 2 is parsed through a pretrained residual neural network (Resnet-50) as per the PyTorch-Grad-CAM tutorial. Figure 3 is the image generated using Grad-CAM. For the Egyptian cat class, the leg, stripes, and faces of the cats activated the feature map. For the remote controls, the buttons and profile are what activated the feature map. The top 5k predicted classes in order of logit, are remote control, tiger cat, Egyptian cat, tabby cat, and pillow. This model seems to be more confident the image contains remote controls and cats. Though less confident, the pillow category made the top five of the listed categories. This could be because the model was trained with cat-printed pillows.

Grad-CAM results of a pretrained shifted window transformer to classify figure 2. This image was generated by applying Grad-CAM to figure 2 in a Jupyter Notebook.
Description - Figure 4Grad-CAM results of a pretrained shifted window transformer to classify figure 2. This image was generated by applying Grad-CAM to figure 2 in a Jupyter Notebook.

Heatmap images generated from a shifted window transformer using Grad-CAM for the Egyptian cat class (left) and remote-control class (right). The intensity of the red colour shows the regions that contribute the most to the model's decision. The cats show more intense regions, while the remote controls are almost fully captured with high-intensity.

Like the Resnet-50 architecture, the same image is parsed through a pretrained shifted window transformer. Figure 4 shows the cats' fur, stripes, faces, and legs as activated regions in the feature map in respect to the Egyptian cat category. The same occurs in relation to the feature map in respect to the remote controls. The top 5k predicted classes, in order of logit, are tabby cat, tiger cat, domestic cat, and Egyptian cat. This model is more confident that cats are in this image than remote controls.

Grad-CAM results of a pretrained vision transformer architecture in classifying the image in figure 2 This image was generated by applying Grad-CAM to figure 2 in a Jupyter notebook.
Description - Figure 5Grad-CAM results of a pretrained vision transformer architecture in classifying the image in figure 2 This image was generated by applying Grad-CAM to figure 2 in a Jupyter notebook.

Heatmap images generated from a Vision transformer using Grad-CAM for the Egyptian cat class (left) and remote-control class (right). The intensity of the red colour shows the regions that contribute the most to the model decision. The cats are fully captured in high intensity. The remotes are also captured but not equivalent intensity. In addition, other regions of the images are highlighted despite not being part of either class.

As seen above, more regions of the feature map are activated, including sections of the image that didn't include cat features. The same occurs for regions of the feature map in respect to the remote-control class. The top 5k predicted classes, in order of logit, are Egyptian cat, tiger cat, tabby cat, remote control, and lynx.

The Grad-CAM results with the top 5k categories for different architectures can be used to favour a selection of the vision transformer (VIT) architecture for tasks related to identifying Egyptian cats and remote controls.

Conclusion

Some of the challenges in the field of AI includes increasing the trust of people in the developed models and understanding the rationale behind the decision making of these models during development. Visualizations tools like Grad-CAM provide insight into these rationales and aid in highlighting different failure modes of AI models for specific tasks. It can be used to identify errors in the models and improve their performance. On top of Grad-CAM, there are other visualization tools that have been developed such as Score-CAM, which performs even better in interpreting the decision-making process of deep neural networks. Though Grad-CAM will be selected over Score-CAM due it's simplicity and agnosticism to model architectures. The use of tools such as Grad-CAM, should be encouraged in visually explaining the reason behind the decisions of AI models.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Thursday, June 15
1:00 to 4:00 p.m. ET
MS Teams – link will be provided to the registrants by email

Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!

Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.

References

  • S. R. Ramprasaath, C. Michael, D. Abhishek, V. Ramakrishna, P. Devi and B. Dhruv, "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization," in ICCV, IEEE Computer Society, 2017, pp. 618-626.
  • Z. Bolei, K. Aditya, L. Agata, O. Aude and T. Antonio, "Learning Deep Features for Discriminative Localization," CoRR, 2015.
  • J. Vincent, "Twitter taught Microsoft's AI chatbot to be racist in less than a day", in The Verge, 2016.
Date modified:

Introduction to Privacy-Enhancing Cryptographic Techniques

Zero knowledge proof – Proving something without exchanging evidence

By: Betty Ann Bryanton, Canada Revenue Agency

Introduction

Enormous amounts of data are collected by government agencies, search engines, social networking systems, hospitals, financial institutions, and other organizations. This data, centrally stored, is at risk of security breaches. Additionally, individuals browse the internet, accept cookies, and share personally identifiable information (PII) in exchange for services, benefits, recommendations, etc. To facilitate e-commerce and access services, individuals need to authenticate, which means providing 'evidence' to prove they are who they say they are. This may mean providing a password, a driver's license, a passport number, or another personal identifier. These could potentially be stolen, and sharing this data may compromise related PII, such as age and home address. Zero knowledge proofs can assist in these scenarios.

What is Zero Knowledge Proof?

A Zero-Knowledge Proof (ZKP) is one of the cryptographic privacy-enhancing computational (PEC) techniques and may be used to implement granular, least access privacy controls and privacy-by-designFootnote1 principles.

Typically, a proof that some assertion X is true also reveals some information about why X is true. ZKPs, however, prove that a statement is true without revealing any additional knowledge. It's important to note that ZKPs do not guarantee 100% proof, but they do provide a very high degree of probability.

ZKPs use algorithms that take data as input and return either 'true' or 'false' as output. This allows two parties to verify the truth of information without exposing the information or how the truth was determined. For example, an individual can prove the statement "I am an adult at least 21 years old" without providing data for verification to a central server.

ZKP was introduced by researchers at MIT in 1985Footnote2 and is now being used in many real-world applications.

ZKP vs other concepts

ZKP is distinct from the following concepts:

Further, ZKP should not be confused with Advanced Encryption Standard (AES), where the parties share a secret number. In ZKP, the prover demonstrates their possession of a secret number without divulging that number. In both scenarios the parties arrive at a shared secret, but with ZKP, the goal is to make claims without revealing extraneous information.

How does ZKP work?

To understand how ZKP works, consider the scenario of a prover (Peggy) and a verifier (Victor). The goal of the ZKP is to prove a statement with very high probability without revealing any additional information.

Peggy (the prover) wants to prove to Victor (the verifier, who is colour-blind and does not trust her) that two balls are of different colours (e.g., green and red). Peggy asks Victor to reveal one of the balls, then put the two balls behind his back. Then Peggy asks Victor to switch them or not, then reveal one to her. She answers if it's the same colour or different than the previous one. Of course, she could be guessing or lying, or even colour-blind, herself. Thus, in order to convince him she's telling the truth, this process must be repeated many, many times. By doing so, eventually Peggy can convince Victor of her ability to correctly identify the different colours.

This scenario satisfies the three criteria of a ZKP:

  1. Soundness (the quality of being based on valid reason): If Peggy was not telling the truth, or was colour-blind, she could only guess correctly 50% of the time.
  2. Completeness: After repeating this process ('the proof') many, many times, the probability of Peggy correctly guessing would be very low, convincing Victor that the balls are of different colours.
  3. Zero-knowledge: Victor does not learn anything additional; he never even learns which ball is green and which is red.

What is explained above is interactive proving, requiring a back-and-forth communication between two parties. Today's ZKPs employ non-interactive proving, where two parties have a shared key to transmit and receive information. For example, a government-issued key as part of a passport could be used to demonstrate citizenship without revealing the passport number or the citizen's name.

Why is it important?

ZKPs assure a secure and invisible flow of data, protecting user information from potential leaks and identity theft. This enhances e-commerce, by allowing more private and secure transactions.

The use of ZKPs not only helps combat data security risk, but this minimum viable verification technique helps prevent the disclosure of more PII than necessary. This benefits both individuals and organizations. Individuals do not have to share their PII and organizations that are facing an increase in security breaches, and thus, dealing with significant costs, harm to reputations, and loss of trust, don't receive the PII to be breached.

Another benefit for both individuals and organizations is more efficient verification, reducing bottle-necked processes that rely on manual or inefficient burden of proof.

Having positive and efficient verification between parties (even untrusted ones) opens up a variety of avenues for collaboration and enquiry.

Applications and Use Cases

ZKPs can protect data privacy in a diverse set of applications and use cases, including:

  • Finance: A mortgage or leasing applicant can prove their income falls within a certain range without revealing their salary. (Financial institution ING is already using this technology, according to Dilmegani, 2022.)
  • Online voting: ZKP can enable anonymous and verifiable voting and help prevent voting fraud or manipulation.
  • Machine Learning: A machine learning algorithm owner can convince others about the model's results without revealing any information about the model.
  • Blockchain Security: Transactions can be verified without sharing information such as wallet addresses and amounts with third party systems.
  • Identity and credential management: Identity-free verification could apply to authentication, end-to-end encrypted messaging, digital signatures, or any application requiring passwords, passports, birth certificates, driving licences, or other forms of identity verification. Fraud prevention systems could validate user credentials and PII could be anonymized to comply with regulations or for decentralized identity.
  • International security: ZKPs enable the verification of the origin of a piece of information without revealing its source. This means cyber-attacks can be attributed to a specific entity or nation without revealing how the information was obtained. This is already being used by the United States' Department of Defense  (Zero-knowledge proof: how it works and why it's important, n.d.).
  • Nuclear disarmament: Countries could securely exchange proof of disarmament without requiring physical inspection of classified nuclear facilities.
  • COVID-19 vaccine passports and travel: As currently done in Denmark, individuals could prove their vaccination status without revealing their PII (Shilo, 2022).
  • Auditing or compliance applications: Any process that requires verification of compliance could use ZKP. This could include verifying that taxes are filed, an airplane was maintained, or data is retained by a record keeper.
  • Anonymous payments: Credit card payments could be made without being visible to multiple parties such as payments providers, banks, and government authorities.

Challenges

While there are many benefits, there are also challenges that need to be taken into consideration if an organization wants to use ZKPs.

  • Computation intensity: ZKP algorithms are computationally intense. For interactive ZKPs, many interactions between the verifier and the prover are required, and for non-interactive ZKPs, significant computational capabilities are required. This makes ZKPs unsuitable for slow or mobile devices and may cause scalability issues for large enterprises.
  • Hardware costs: Applications that want to use ZKPs must factor in hardware costs which may increase costs for end-users.
  • Trust assumptions: While, some ZKP public parameters are available for reuse, and participants in the trusted setup are assumed to be honest, recipients must rely on the honesty of the developers (What are zero-knowledge proofs?, 2023).
  • Quantum computing threats: While ZKP cryptographic algorithms are currently secure, the development of quantum computers could eventually break the security model.
  • Costs of using the technology: The costs of ZKPs can vary based on setup requirements, efficiency, interactive requirements, proof succinctness and the hardness assumptions required (Big Data UN Global Working Group, 2019).
  • Lack of standards: Despite ongoing initiatives to standardize zero knowledge techniques and constructions, there is still an absence of standards, systems, and homogeneous languages.Footnote3
  • No 100% guarantee: Though the probability of verification while the prover is lying can be significantly lowFootnote4, ZKPs do not guarantee the claim is 100% valid. 
  • Skills: ZKP developers should have expertise in ZKP cryptography and be aware of the subtleties and differences between the guarantees provided by ZKP algorithms.

What's next?

In recent years there has been a strong push for adopting zero knowledge in software applications. Several organizations have built applications using ZK capabilities, and ZKPs are widely used to safeguard blockchains. For example, the city of Zug in Switzerland has registered all its citizen IDs on a blockchain (Anwar, 2018).

Though there needs to be improvements in ZK education, standardization, and privacy certifications to improve trust in ZK products and services, ZKPs have great potential in saving organizational costs due to security breaches, as well as preserving users' privacy, and reducing PII as a product for sale. ZKPs help an organization move from reacting to security breaches to preventing them.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Thursday, June 15
1:00 to 4:00 p.m. ET
MS Teams – link will be provided to the registrants by email

Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!

Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.

Related Topics

Authentication, Blockchain, Web 3.0, Privacy-Enhancing Computation (PEC) techniques: Differential Privacy, Homomorphic Encryption, Secure Multiparty Computation, Trusted Execution Environment

References

Date modified:

Ottawa to hold World Statistics Congress in July 2023

By: Bridget Duquette, Statistics Canada

This summer, Ottawa will be the backdrop for the 64th World Statistics Congress (WSC), hosted by the International Statistical Institute (ISI) from July 16 to 20 at the Shaw Centre. This event will feature a variety of panels, presentations and social events, as well as networking and recruitment opportunities. It will offer a great opportunity for knowledge sharing and collaboration between data scientists, statisticians and methodologists at the international level.

The WSC has been held every two years since 1887 and is attended by statisticians, academics and business leaders. This event helps to shape the landscape of statistics and data science worldwide. Canada has hosted this prestigious event only once before, in 1963, in Ottawa.

It’s traditional for the host country of the WSC to plan social events for attendees. This year, international guests will be offered a tour of local sites in Ottawa’s downtown core, guided by Statistic Canada’s Eric Rancourt, Assistant Chief Statistician, and Claude Girard, Senior Methodologist.

A sneak peek of the event’s congress programme is available and includes information on presentations dozens of topics of interest to data scientists.  This year, the illustrious keynote speaker will be former Director of the United States Census Bureau, Professor Robert M. Groves.

Ottawa’s Shaw Centre

Figure 1: Ottawa’s Shaw Centre.

Kenza Sallier, Senior Methodologist at StatCan and co-author of the recently published article entitled Unlocking the power of data synthesis with the starter guide on synthetic data for official statistics, is looking forward to participating once again—though this will be her first time attending in person.

“I attended the 2021 WSC, in the middle of the pandemic (and census collection),” Kenza says. “I had the great opportunity to present Statistics Canada’s achievements related to data synthesis and to also be invited to take part in a panel session to share my experience as a young female statistician in the world of official statistics. Even though it was virtual, the event supported meeting and networking with many interesting people. I am looking forward to attending the 2023 WSC as it is taking place in person. My colleague Craig Hilborn and I will be presenting our work and I hope to get feedback from our peers.”

Shirin Roshanafshar, Chief of Text Analytics and Digitalization at Statistics Canada will also be attending the conference and speaking at the session about the challenges of Natural Language Processing techniques in official statistics.

For all participants—whether they are attending for the first time or the fifth time—WSC 2023 is sure to be an exciting experience. In the words of ISI President Stephen Penneck: “The congress encourages collaboration, growth, discovery, and advancement in the field of data science. I am excited to have the 64th World Statistics Congress visit Canada and look forward to the impact it will have on the industry.”

Check back for a review of the conference and the exciting developments from this global event.

Date modified:

National Indigenous History Month 2023... by the numbers

National Indigenous History Month 2023

Demography

  • The 2021 Census counted 1.8 million Indigenous people, accounting for 5.0% of the total population in Canada.
  • For the first time, the 2021 Census counted more than 1 million First Nations people living in Canada (1,048,405).
  • There were 624,220 Métis living in Canada in 2021, up 6.3% from 2016.
  • The Inuit population in Canada numbered 70,545, with just over two-thirds (69.0%) living in Inuit Nunangat—the homeland of Inuit in Canada.
  • The average age of Indigenous people was 33.6 years in 2021, compared with 41.8 years for the non-Indigenous population.
  • The Indigenous population grew by 9.4% from 2016 to 2021, almost twice the pace of growth of the non-Indigenous population over the same period (+5.3%).
  • Population projections for First Nations people, Métis and Inuit suggest that the Indigenous population could reach between 2.5 million and 3.2 million in 2041.

Sources:

Children and Youth

  • There were 459,215 Indigenous children under 15 years old in 2021, accounting for one-quarter (25.4%) of the total Indigenous population. By comparison, 16.0% of the non-Indigenous population was under 15 years old.
  • For First Nations, Métis and Inuit families, grandparents often play an important role in raising children and passing down values, traditions and cultural knowledge to younger generations. In 2021, 14.2% of Indigenous children lived with at least one grandparent, compared with 8.9% of non-Indigenous children.
  • The majority (56.0%) of Indigenous children lived in a two-parent household in 2021. However, more than one-third (35.8%) of Indigenous children lived in a one-parent household.
  • Altogether, Indigenous children accounted for over half (53.8%) of all children in foster care, while nationally, Indigenous children accounted for 7.7% of all children 14 years of age and younger.

Sources:

Education

  • According to the 2021 Census, nearly three-quarters (73.9%) of Indigenous people aged 25 to 64 had completed high school, and 12.9% had a bachelor's degree or higher.
  • Gaps between the Indigenous and the non-Indigenous population are narrowing when it comes to high school completion. The share of Métis who had completed high school rose 4.6 percentage points from 2016, to 82.0% in 2021, while for First Nations people, the share increased 5.5 percentage points to 69.9%. For the first time, in 2021, over half (50.1%) of Inuit aged 25 to 64 had completed a high school diploma or equivalency certificate, up 4.7 percentage points from 2016. In all these cases, these increases in the share of Indigenous populations with a high school diploma were larger than among the Canadian-born non-Indigenous population (+2.0 percentage points), 88.9% of whom had completed high school.
  • However, gaps between the share of the Indigenous and the non-Indigenous population completing a bachelor's degree or higher are widening. Although the share of Indigenous people with a bachelor's degree or higher rose by 1.9 percentage points (to 12.9%) in 2021, this was less than the increase among the Canadian-born non-Indigenous population (+2.9 percentage points, to 27.8%). This trend held true among Métis (+2.5 percentage points, to 15.7%), First Nations people (+1.6 percentage points, to 11.3%) and Inuit (+0.9 percentage points, to 6.2%).

Sources:

Indigenous Languages

  • Over 70 Indigenous languages were reported in the 2021 Census, with 237,420 Indigenous people reporting that they could speak an Indigenous language well enough to conduct a conversation. This represents 13.1% of the Indigenous population.
  • The most commonly spoken Indigenous languages among Indigenous people were Cree languages, with 86,480 speakers, followed by Inuktitut and Ojibway languages.
  • The majority (72.3%) of Indigenous people who could speak an Indigenous language reported that they had an Indigenous language as their mother tongue.
  • However, the number of Indigenous people who could speak an Indigenous language but did not have an Indigenous mother tongue grew by 7.0% over the same period. This change reflects a growing share of Indigenous people who are learning an Indigenous language as a second language.

Sources:

Health and well-being

  • During the first year of the pandemic, mental health conditions such as depression and anxiety were the leading chronic conditions reported with about one in five Indigenous adults reporting these, compared with one in ten non-Indigenous adults. Mental health challenges among Indigenous people have been linked to the effects of intergenerational trauma stemming from colonialism as well as social determinants of health such as poverty, unemployment, housing and food security.
  • During the pandemic, in 2021, First Nations people (11%), Métis (8%) and Inuit (19%) were about two to five times more likely than non-Indigenous people (4%) to report exposure to some form of discrimination in health care.
  • Rates of disability among First Nations people living off reserve and Métis were higher than for non-Indigenous people in 2017. Almost one-third of both First Nations people living off reserve (32%) and Métis (30%) aged 15 and older reported having at least one disability in 2017, compared with 22% among the non-Indigenous population. Among Inuit, that proportion was lower (19%), largely because Inuit are a younger population.

Sources:

By the numbers

By the numbers features statistical information on various themes and special occasions.

Data are current as of their publishing date.

Journalists are advised to contact Media Relations for available updates.


2024

2023

2022

2021

2020

2019

2018

2017

2015

Indigenous Communities Food Receipts Crowdsourcing with Optical Character Recognition

By: Shannon Lo, Joanne Yoon, Kimberley Flak, Statistics Canada

Everyone deserves access to healthy and affordable food, no matter where they live. However, many Canadians living in northern and isolated communities face increased costs related to shipping rates and supply chain challenges. In response to food security concerns in the North, the Government of Canada established the Nutrition North Canada (NNC) subsidy program. Administered by Crown-Indigenous Relations and Northern Affairs Canada (CIRNAC), this program helps make nutritious foods like meat, milk, cereals, fruit, and vegetables more affordable and accessible. To better understand the challenges impacting food security, improved price data is needed.

On behalf of CIRNAC, and in collaboration with the Centre for Special Business Projects (CSBP), Statistics Canada’s Data Science Division conducted a proof-of-concept project to investigate crowdsourcing as a potential solution to the data gap. This project evaluated the feasibility of using optical character recognition (OCR) and natural language processing (NLP) to extract and tabulate pricing information from images of grocery receipts as well as developing a web application for uploading and processing receipt images. This article focuses on the text identification and extraction algorithm, while the web application component is not covered.

Data

The input data for the project consisted of images of grocery receipts from purchases made in isolated Indigenous regions, including photos taken with a camera and scanned images. The layout and content of the receipts varied across retailers. From these receipts, we aimed to extract both product-level pricing information as well as receipt-level information, such as the date and location of purchase, which provide important context for downstream analysis. The extracted data were compiled into a database to support validation, analysis, and search functions.

High level design

Figure 1 illustrates the data flow, from receipt submission to digitization, storage, and display. This article focuses on the digitization process.

Figure 1: Data flow

Figure 1: Data flow

This is a process diagram depicting the flow of data between the various processes in the project. It highlights the three digitalization processes that this article will focus on: Extract text using OCR, Correct spelling & Classify text, and Package data.

  1. Receipt: Take photo of receipt
    1. Forward to Web App: Upload using a web application
  2. Web App: Upload using a web application
    1. Forward to Text: Extract text using OCR
  3. Text: Extract text using OCR
    1. Forward to Classified Text: Correct spelling & Classify text
  4. Classified Text: Correct spelling & Classify text
    1. Forward to Record: Package data
  5. Record: Package data
    1. Forward to Protected Database: Save data
  6. Protected Database: Save data
    1. Forward to Web Dashboard: Display data in web dashboard
  7. Web Dashboard: Display data in web dashboard

Text extraction using OCR

We extracted text from receipts by first detecting text regions using Character-Region Awareness For Text detection (CRAFT) and then recognizing characters using Google's OCR engine, Tesseract. CRAFT was chosen over other text detection models because it effectively detected text even in blurred, low-resolution areas or those with missing ink spots. For more information on CRAFT and Tesseract, refer to the Data Science Network’s article, Comparing Optical Character Recognition Tools for Text-Dense Documents vs. Scene Text .

Tesseract recognized text from detected text boxes. Generally, Tesseract looked for English and French alphabets, digits and punctuation. However, for text boxes that started on the far right (i.e., those with a left x coordinate at least three-quarters of the way toward the maximum x coordinate in the current block), Tesseract only looked for digits, punctuation, and certain single characters used to indicate product tax type, assuming the text box contained price information. By limiting the characters to recognize, we prevented cases such as a zero from being recognized as the character “O.”

If Tesseract did not recognize any text in the text box or if the confidence of the recognition was less than 50%, we first tried cleaning the image. Texts with uneven darkness or missing ink areas were patched using Contrast Limited Adaptive Histogram Equalization (CLAHE). This method improved an image’s overall contrast by calculating the histogram of pixel intensities and distributing these pixels into buckets with fewer pixels. The image’s brightness and contrast were adjusted to make the black text stand out more. These cleaning methods allowed Tesseract to better recognize the text.  However, applying this image preprocessing method to all text boxes was not recommended because it hindered Tesseract in some images taken under different conditions. Text recognition after this image preprocessing method was only used when the text recognition probability increased. When Tesseract failed even after image preprocessing, the program used EasyOCR’s Scene Text Recognition (STR) model instead. This alternative text recognition model performed better on noisier images where the text was printed with spotty amounts of ink or where the image was blurry.

Spell check

SymSpell was trained using individual product names from the 2019 Survey of Household Spending (SHS) database. To improve the quality of the correction, the spell corrector selected the most common word based on nearby word information. For example, if the recognized line was “suo dried tomatoes,” the spelling corrector could correct the first word to “sub,” “sun,” and “sum,” but it would choose “sun” since it recognizes the bigram “sun dried,” but not “sub dried.” On the other hand, if the OCR predicted the line to be “sub dried tomatoes,” no words were corrected since each word was a valid entry in the database. We aimed to avoid false corrections as much as possible. If a character was not detected due to vertical lines of no ink, the missing character was also recovered using spell correction. For example, if the recognized line was “sun dri d tomatoes,” the spelling corrector corrected the line to “sun dried tomatoes.”

A separate spell checker corrected the spelling of store names and community names.

Text classification

To identify what each line of extracted text was describing, a receipt-level and a product-level entity classifier was built. The following sections describe the relevant entities, sources of training data, explored models, and their performances.  

Entities

Each extracted row of text was classified into one of the 11 groups shown in Table 1. This step enables sensitive information to be redacted, and the remainder of the information to be used meaningfully.

Table 1: Entities extracted from receipts
Receipt-level entities Price
(% correct)
NNC Subsidy
(% correct)

Date

Store name

Store location

Sale summary

Product

Price per quantity

Subsidy

Discount

Deposit

Sensitive information (includes customer’s address, phone number, and name)

Other

Entity classifier training data

Training data was gathered from labelled receipts, the SHS database, as well as public sources such as data available on GitHub. Refer to Table 2 for details about each training data source.

Table 2: Training data sources

Data

Records

Source

Additional Details
Labelled receipts 1,803 CIRNAC Used OCR to extract information from receipt images which were then labelled by analysts.
Products 76,392 SHS database 2 or more occurrences.
Store names 8,804 SHS database 2 or more occurrences.
Canadian cities 3,427 GitHub  
Canadian provinces 26 GitHub Full names and the abbreviated forms of 13 provinces and territories.
Communities 131 Nutrition North Canada Communities eligible for the NNC program.
Last Names 87,960 GitHub This is categorized as sensitive information.

Model selection and hyperparameter tuning

Two multiclass classifiers were used, one to classify receipt-level entities (i.e., store name and location) and the other to classify product-level entities (i.e., product description, subsidy, price per quantity, discount, and deposit). Table 3 describes the various models used in the experiment to classify receipt-level and product-level entities. The corresponding F1 macro scores for the two different classifiers are also displayed.

Table 3: Various models experimented with for the receipts and products classifier.
Experimented Models Description Receipts Classifier F1 Macro Score Products Classifier F1 Macro Score
Multinomial Naïve Bayes (MNB) The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). [1] 0.602 0.188
Linear Support Vector Machine with SGD training This estimator implements regularized linear models (SVM, logistic regression, etc.) with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate). [2] 0.828 0.899
Linear Support Vector Classification Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. This class supports both dense and sparse input and the multiclass support is handled according to a one-vs-the-rest scheme. [3] 0.834 0.900
Decision Tree Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. [4] 0.634 0.398
Random Forest A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. [5] 0.269 0.206
XGBoost XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. [6] 0.812 0.841

Before selecting the best models, hyperparameter tuning was conducted using grid search. Stratified K-Folds cross-validation was then used to train and test the models, addressing challenges with class imbalance in the training dataset which was mostly comprised of sensitive information (49%) and product names and/or prices (44%). The remaining 7% of the dataset included information such as store name, location, subsidy, discount, date, and price per quantity. After testing and training, the best performing models for receipt-level and product-level entities were selected based on the F1 macro score. The F1 macro score was used as a determinant of performance because it weighs the importance of each class equally, meaning even if a class has very few examples in the training data, the quality of predictions for that class is just as important as a class that has many examples. This is often the case in a project where the training dataset is imbalanced where some classes have few examples while other have many examples.

A rule-based approach was used to identify dates because the standard formats for dates make this a more robust method.

The Linear Support Vector Classification (SVC) classifier was chosen as the best model for both the receipt and product classifiers based on its F1 macro score of 0.834 (receipts) and 0.900 (products), which was higher than all the other models that were tested. Despite being the top performing model, it is worth noting that SVC classifiers generally take more time to train compared to MNB classifiers.

Packaging OCR’d text to a receipt record

The trained receipt-level and product-level entity classifiers were used on different parts of the receipt. Assuming the receipt was laid out as seen in figure 2, the receipt-level entity classifier predicted the class of all extracted receipt lines except for section 3: Products, and the product-level entity classifier was only used on section 3: Products. This layout worked for all receipts in our dataset. If a component, such as store names, was cropped out of the photo, that field was left empty in the final output.

Figure 2: Receipt component layout

Figure 2: Receipt component layout

This is an image of a receipt which shows an example of the different sections on a receipt.

  • Store name and address
  • Transaction record
  • Products (description, SKU, price, discount, subsidy and price per quantity)
  • Subtotal, taxes and total
  • Transaction record

The beginning of the receipt, including 1) Store name and address and 2) Transaction record, consisted of text lines found before the line that the products classifier predicted to be a product and that had a dollar value. No store name and location were returned if this part was empty, and if the first line directly described a product. Of all text recognized in this section, the text the receipt classifier predicted as the store name with the highest prediction probability was assigned as the store name. A valid community name was extracted from lines predicted as a location. Lines that the receipt classifier predicted to be sensitive information in this section were redacted.

The main body of a receipt included 3) Products list. Each line that the products classifier predicted as a Product and had a dollar value considered as a new product. Any lines of text following this product that were predicted to be a subsidy, discount, deposit, or price per quantity were added as auxiliary information for the product. Subsidies were further broken down into Nutrition North Canada (NNC) subsidy and Nunavik cost of living subsidy depending on its text description.

The end of the receipt included 4) Subtotal, taxes and total and 5) Transaction record. Nothing needed to be extracted from these two sections but lines that the receipt classifier predicted to be sensitive information in this section were redacted.

The date of purchase appeared either at the beginning or the end of the receipt. Dates were thus parsed by searching for known date format regular expression patterns in those sections of the receipt.

Results

The algorithm was evaluated using photos of grocery receipts from northern food retailers in remote Indigenous communities. Analysts from Statistics Canada’s Centre for Special Business Projects labeled products and sales information in each image.

Extracted texts, including store names, community names, product descriptions and dates, were evaluated as a similarity score. The similarity between the two texts was calculated as two times the total number of matching characters divided by the total number of characters in both descriptions. Extracted numbers, such as the product price, subsidy, discount and deposit, were each evaluated to be a match (1) or not (0).

For singular fields such as store names, it was easy to compare the predicted value with the actual value. Nevertheless, a simple one-to-one comparison was not possible for comparing the multiple items captured manually with the multiple items predicted by the OCR algorithm. Consequently, each manually captured item was first matched to the most similar item extracted by the OCR algorithm. The matched items from two sources needed to be at least 50% similar to each other. Items captured manually but not by the algorithm were called missed items. Items captured by the algorithm but not manually were called extra items. Once the common items were matched together, the similarity scores for all pairs were averaged to produce an overall similarity score for all common items found on the receipts.

The OCR algorithm excelled at identifying products from the receipts. Of the 1,637 total products listed on the receipts, 1,633 (99.76%) were captured (Table 4), with an average product description similarity of 96.85% (Table 5). The algorithm failed when text in the image was cut off, blurred, creased, or had areas with no ink. As a result, we recommended that OCR extractions be followed by human verification via the web interface. For the products in common, prices were correctly extracted 95.47% of the time, NNC subsidies were correct 99.14% of the time, Nunavik COL subsidies were correct 99.76% of the time, discounts were correct 100.0% of the time, deposits were correct 99.76% of the time, price per quantities were correct 95.71% of the time, and SKUs were correct 95.22% of the time (Table 5).

Even though product descriptions and prices were always present, other fields such as NNC subsidy were only present when applicable. For this reason, Table 5 also reports accuracies restricted to non-missing fields, in order to evaluate the OCR performance exclusively. No discount entries were included in this batch of receipts, so another batch was used to evaluate that discounts were correctly extracted 98.53% of the time. The text similarity score for fields observed and OCR’d was 87.1%.

Table 5: Accuracy of OCR on product-level information
Number of receipts Number of items Number of items extracted Number of items in common Percentage of items missed Percentage of extra items
182 1,637 1,634 1,633 0.24% (4/1,637) 0.06% (1/1,630)

Table 4. Products extracted from receipts from CIRNAC

Table 3: Training data sources
 

Product description
(Average text similarity score)

Price
(% correct)

NNC Subsidy
(% correct)

Nunavik COL Subsidy
(% correct)

Discount
(% correct)

Deposit
(% correct)

Price per quantity
(% correct)

SKU
(% correct)

Accuracy on items in common 96.85% 95.47% (1,559/1,633) 99.14% (1,619/1,633) 99.76% (1,629/1,633) 100.0% (1,633/1633) 99.76% (1,629/1,633) 95.71% (1,563/1,633) 95.22% (1,555/1,633)
Accuracy on items when fields were present 96.85% 95.47% (1,559/1,633) 99.08% (647/653) 100.0% (282/282) Not available. No actual occurrence 97.56% (160/164) 72.97% (81/111) 95.52% (1,555/1,628)

Receipt information was extracted effectively with no communities, store names, or dates being completely missed or falsely identified. The average text similarity score was consistently high: 99.12% for community, 98.23% for store names, and 99.89% for dates. Using the OCR algorithm and the receipt entity classifier to process receipts appears promising.

Additionally, 88.00% of sensitive texts were correctly redacted. Of the texts that failed to be redacted, most were cashier IDs. These were not redacted because the entity classifier had not seen this type of sensitive information before. Retraining the entity classifier with examples of cashier IDs will improve the results, much like how the classifier recognizes cashier names to be sensitive because of examples like "Your cashier today was <cashier name>" in its training data.

Table 6: Accuracy of OCR on receipt-level information
  Number of receipts

Store name
(Average text similarity score)

Community
(Average text similarity score

Date
(Average text similarity score)

Sensitive info (Recall %)
Evaluation 182 98.23% 99.12% 99.89% 88.80
Evaluation when fields were present 164 99.02% 99.88% 98.03% Not applicable

Conclusion

This project demonstrated that an entity classification and OCR algorithm can accurately capture various components of grocery receipts from northern retailers. Automating this process makes it easier to collect data on the cost of living in the North. If and when this solution goes into production, the automation should be followed with a human-in-the-loop validation process through a web interface to ensure that the receipt is correctly digitized, and corrections are iteratively used for retraining. This validation feature has been implemented but is not discussed in this article.

Aggregate anonymized data collected through crowdsourcing has the potential to provide better insight into issues associated with the high cost of food in isolated Indigenous communities and could improve the transparency and accountability of Nutrition North Canada subsidy recipients to residents in these communities. If you are interested in learning more about the web application component, please reach out to datascience@statcan.gc.ca.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Thursday, June 15
1:00 to 4:00 p.m. ET
MS Teams – link will be provided to the registrants by email

Register for the Meet the Data Scientist event. We hope to see you there!

Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.

References

  1. sklearn.naive_bayes.MultinomialNB — scikit-learn 1.2.0 documentation
  2. sklearn.linear_model.SGDClassifier — scikit-learn 1.2.0 documentation
  3. sklearn.svm.LinearSVC — scikit-learn 1.2.2 documentation
  4. 1.10. Decision Trees — scikit-learn 1.2.0 documentation
  5. sklearn.ensemble.RandomForestClassifier — scikit-learn 1.2.0 documentation
  6. XGBoost Documentation — xgboost 1.7.2 documentation
Date modified:

Tackling Information Overload: How Global Affairs Canada’s “Document Cracker” AI Application Streamlines Crisis Response Efforts

Prepared by the data science team at Global Affairs Canada

Introduction

When a global crisis hits, government officials often face the challenge of sifting through a flood of new information to find key insights that will help them to manage Canada’s response effectively. For example, following the Russian invasion of Ukraine in February of 2022, a substantial proportion of Canada’s diplomatic missions began filing situation reports (or SitReps) on local developments related to the conflict. With the sheer number of these SitReps, along with related meeting readouts, statements from international meetings and news media reports, it quickly became infeasible for individual decision makers to manually read all the relevant information made available to them.

To help address this challenge, the data science team at Global Affairs Canada (GAC) developed a document search and analysis tool called Document Cracker (hereafter “DocCracker”) that helps officials quickly find the information they need. At its core, DocCracker provides two key features: (1) the ability to search across a large volume of documents using a sophisticated indexing platform; and (2) the ability to automatically monitor new documents for specific topics, emerging trends, and mentions of key people, locations, or organizations. In the context of the Russian invasion, these features of the application are intended to allow Canadian officials to quickly identify pressing issues, formulate a preferred stance on these issues, and track the evolving stances of other countries. By providing such insights, the application can play a key role in helping officials to both design and measure the ongoing impacts of Canada’s response to the crisis.

Importantly, while DocCracker was developed specifically in response to events in Ukraine, it was also designed as a multi-tenant application that can provide separate search and monitoring interfaces for numerous global issues at the same time. For example, the application is currently being extended to support the analysis of geopolitical developments in the Middle East.

Application Overview

From a user’s perspective, the DocCracker interface is comprised of a landing page with a search bar and a variety of content cards that track recent updates involving specific geographical regions and persons of interest. The user can either drill down on these recent updates or submit a search query, which returns a ranked list of documents. Selecting a document provides access to the underlying transcript, along with the set of links to related documents. Users can also access the metadata associated with each document, which includes automatically extracted lists of topics, organizations, persons, locations, and key phrases. At all times, a banner at the top of the application page allows users to access a series of dashboards that highlight global and mission-specific trends concerning a predefined list of ten important topics (e.g., food security, war crimes, energy crisis, etc.).

To enable these user experiences, DocCracker implements a software pipeline that (a) loads newly available documents from a range of internal and external data sources, (b) “cracks” these documents by applying a variety of natural language processing tools to extract structured data, and (c) uses this structured data to create a search index that supports querying and dashboard creation. Figure 1 below provides a visual overview of the pipeline.

Figure 1: DocCracker processing pipeline

Figure 1: DocCracker processing pipeline

During the “load” stage of the pipeline, internal and external data sources are ingested and preprocessed to extract basic forms of metadata such as report type, report date, source location, title, and web URL. During the “crack” stage of the pipeline, the loaded documents are run through a suite of natural language processing tools to provide topic labels, identify named entities, extract summaries, and translate any non-English text to English. During the final “index” stage of the pipeline, the cracked documents are used to create a search index that support flexible document queries and the creation of dashboards that provide aggregated snapshots of the document attributes used to populate this search index.

Implementation Details

DocCracker is hosted as a web application in Microsoft Azure’s cloud computing environment, and makes use of Azure services to support each stage of processing.

Data Ingestion

During the “load” stage, documents are collected into an Azure storage container either via automated pulls from external sources (e.g., the Factiva news feed, non-protected GAC databases) or manual uploads. Next, a series of Python scripts are executed to eliminate duplicate or erroneous documents and perform some preliminary text cleaning and metadata extraction. Because the documents span a variety of file formats (e.g., .pdf, .txt, .msg, .docx, etc.), different cleaning and extraction methods are applied to different document types. In all cases, however, Python’s regular expression library is used to strip out irrelevant text (e.g., email signatures, BCC lists) and extract relevant metadata (e.g., title or email subject line, date of submission).

Regular expressions provide a powerful syntax for specifying sets of strings to search for within a body of text. In formal terms, a given regular expression defines a set of strings that can all be recognized by a finite state machine that undergoes state transitions upon receiving each character in span of input text; if these state transitions result in the machine entering an “acceptance” state, then the input span is a member of the set of strings being searched for. Upon detection, such a string can either be deleted (i.e., to perform data cleaning) or extracted (i.e., to collect metadata). Almost all programming languages provide support for regular expressions, and they are often a tool of first resort in data cleaning and data engineering projects.

Natural Language Processing

Once the documents have been preprocessed, they are split into chunks of text with at most 5120 characters to satisfy the input length requirements of many of Azure’s natural language processing services. Each text chunk is processed to remove non-linguistic information such as web URLs, empty white space, and bullet points. The chunks are then moved to a new storage container to undergo further processing using a variety of machine learning models.

To identify mentions of persons, organizations, and locations, each text chunk is processed using an Azure service that performs named entity recognition (NER). This service functions by mapping spans of text onto a predefined set of entity types. Next, similar services are used to extract key phrases and a few summary sentences from each document, while also performing inline translations of non-English text. Finally, a sentiment analysis service is used to provide sentiment ratings on specific organizations for display in the application’s landing page. The outputs of each Azure service are saved back to a SQL database as metadata attributes associated with the underlying documents that have been processed.

To augment these results obtained with Azure, GAC’s data science team also developed a customized topic labelling model that identifies the presence of any of ten specific topics of interest in each text chunk. This model uses a technique called “bidirectional encoder representations from transformers” BERT to analyze chunks of text and determine which of the predefined topics are present in the text. The model provides a list of topics found, which can range from zero to ten topic labels.

As shown in Figure 2 below, the model was developed iteratively with increasing amounts of labelled training data. By the third round of model training, highly accurate classification results were obtained for eight of the ten topics, while moderately accurate results were obtained for two of the ten topics. Model testing was carried out using 30% of the labelled data samples, while model training was performed using the other 70% of samples. In total, roughly 2000 labelled samples were used to develop the model.

While this is a small amount of data in context of typical approaches to developing supervised machine learning systems, one of the key advantages of using a BERT architecture is that the model is first pre-trained on a large amount of unlabelled text before being fine-tuned to perform some task of interest. During pre-training, the model simply learns to predict the identities of missing words that have been randomly blanked out of a corpus of text. By performing this task, the model develops highly accurate internal representations of the statistical properties of human language. These representations can then be efficiently repurposed during the fine-tuning stage to learn effective classification decisions from a small number of labelled examples.

Figure 2 - DocCracker AI Model Training Results

Figure 2: DocCracker AI Model Training Results

Evaluation results are shown following three rounds of training for a customized topic identification model that performs multi-label classification to identify up to ten predefined topics in a chunk of input text. With progressive increases in the amount of training data, the transformer-based neural network model is shownto achieve strong accuracy results for almost all the topics. 

Finally, the outputs of the topic model get saved back to the SQL database as additional metadata attributes for each underlying document. This database now contains all the documents that have been ingested along with a rich collection of metadata derived using the natural language processing techniques just described. With this combination of documents and metadata, it is possible to create a search index that allows users to perform flexible document searches and create informative dashboard visualizations.

Indexing

In its simplest form, a search index is a collection of one or more tables that provide links between search terms and sets of documents that match these terms. When a user provides a search query, the query is broken apart into a collection of terms that are used to look up documents in the index. A ranking algorithm is then used to prioritize the documents that are matched by each term so as to return an ordered list of the documents that are most relevant to the search query.

In DocCracker, Azure’s cognitive search service is used to automatically create an index from the SQL database produced during earlier stages of the processing pipeline. Once this index is created, it is straightforward to create a landing page that allows users to enter search queries and obtain relevant documents. The metadata used to create the index can also be exported to CSV files to create dashboards that track a range of time-varying measures of how the situation in Ukraine is unfolding. For example, by selecting the metadata fields for topic labels and dates, it is possible to display the frequency with which different topics have been mentioned over time. Similarly, by selecting on named entities, it is possible to visualize which persons or organizations have been mentioned most often over a given time range. The volume of reporting coming out of different missions can also be easily tracked using a similar method of selection.

Overall, the search index provides a structured representation of the many unstructured SitReps, reports, and news articles that have been ingested into DocCracker. With this structured representation in hand, it becomes possible to enable search and monitoring capabilities that aid the important analytical work being done by the officials at Global Affairs tasked with managing Canada’s response to the Russian invasion.

Next Steps

Given the ever-increasing speed with which international crises are being reported on, it is essential to develop tools like DocCracker that help analysts draw insights from large volumes of text data. To build on the current version of this tool, the GAC data science team is working on several enhancements simultaneously. First, Latent Dirichlet Allocation (LDA) is being assessed to automatically identify novel topics as they emerge amongst incoming documents, thereby alerting analysts to new issues that might require their attention. Second, generative pre-trained transformer (GPT) models are being used to automatically summarize multiple documents, thereby helping analysts produce briefing notes more quickly for senior decision makers. Finally, stance detection models are being developed to automatically identify the positions that individual countries are adopting with respect to specific diplomatic issues (e.g., the issue of providing advanced weapons systems to Ukraine). With such models in hand, analysts should be able to track how countries are adapting their positions on a given issue in response to both diplomatic inducements and changing geopolitical conditions.

Overall, as tools like DocCracker become more widely used, we expect to see a range of new applications for the underlying technology to emerge. To discuss such applications or to learn more about the GAC data science team’s ongoing efforts in this area, please contact datascience.sciencedesdonnees@international.gc.ca.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Thursday, June 15
1:00 to 4:00 p.m. ET
MS Teams – link will be provided to the registrants by email

Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!

Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.

Date modified:

Low Code UI with Plotly Dash

By: Jeffery Zhang, Statistics Canada

Introduction

Often with data science work, we build models that are implemented in R or Python. If these models are intended for production, they'll need to be accessible to non-technical users.

A major problem with making data models accessible to non-technical users in production is the friction of creating accessible user interfaces. While it is acceptable for a research prototype to be run via a command line, this type of interface, with all its complexities, is very daunting to a non-technical audience.

Most data scientists are not experienced user interface (UI) developers, and most projects don't have the budget for a dedicated UI developer. In this article, we introduce a tool that allows non-UI specialists to quickly create good enough data UI using Python.

What is Plotly Dash?

Plotly is an open-source data visualization library. Dash is an open-source low-code data application development framework that is built on top of Plotly. Plotly Dash offers a solution to the data UI problem. A non-UI specialist data scientist can develop good enough UI for a data app in just a few days with Plotly Dash in Python. In most projects, investing 2-5 extra work-days to develop an interactive graphical UI is well worth the investment.

How does Plotly Dash work?

Plotly and Dash can be thought of as domain specific languages (DSL). Plotly is a DSL for describing graphs. The central object of Plotly is a Figure, which describes every aspect of a graph such as the axes, as well as graphical components such as bars, lines, or pie slices. We use Plotly to construct Figure objects and then use one of the available renderers to render it to the target output device such as a web browser.

Figure 1 - An example of a Plotly figure.
Description - Figure 1: Example of a Plotly figure

This is an example of a figure generated by Plotly. It is an interactive bar chart that allows the user to hover over the individual bars with the mouse and see the data values associated with each bar.

Dash provides two DSLs and a web renderer for Plotly Figure objects.

The first Dash DSL is for describing the structure of a web UI. It includes components for HTML elements such as div, p as well as UI controls such as Slider, DropDown. One of the key components of the Dash web DSL is the Graph component, which allows us to integrate a Plotly Figure into the Dash web UI.

Here's an example of a minimal Dash application.

From dash import Dash, html, dcc, callback, Output, Input
import plotly.express as px
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/gapminder_unfiltered.csv')

app = Dash(__name__)

app.layout = html.Div([
    html.H1(children='Title of Dash App', style={'textAlign':'center'}),
    dcc.Dropdown(df.country.unique(), 'Canada', id='dropdown-selection'),
    dcc.Graph(id='graph-content')
])

if __name__ == '__main__':
    app.run_server(debug=True)

This is what it looks like in a web browser.

Figure 2 - Minimal Dash application displayed in a web browser.
Description - Figure 2: Minimal app displayed in a browser.

This is an example of a minimal application created with Plotly Dash. It is a sample application that visualizes the growth of the Canadian population from 1950 to present using a line chart. The visualization is interactive and the user can hover the mouse over points on the blue line to see the data values associated with that point.

The second Dash DSL is for describing reactive data flows. This allows us to add interactivity to the data app by describing how data flows from user input components to the data model, and then back out to the UI.

Adding the following code to the above example creates a reactive data flow between the input component dropdown-selection, the function update graph, and the output graph. Whenever the value of the input component dropdown-selection changes, the function update graph is called with the new value of dropdown-selection and the return value of update-graph is output to the figure property of the graph-content object. This updates the graph based on the user's selection in the drop-down component.

@callback(
    Output('graph-content', 'figure'),
    Input('dropdown-selection', 'value')
)
def update_graph(value):
    dff = df[df.country==value]
    return px.line(dff, x='year', y='pop')

Useful features of Dash

Below are some common data app scenarios and how Dash features support those scenarios.

Waiting for long computations

Sometimes a data model will take a long time to run. It makes sense to give the user some feedback during this process so they know the data model is running and the application hasn't crashed. It would be even more useful to give a progress update so the user knows roughly how much work has been completed and how much is remaining.

We may also realize we made a mistake when setting the parameters of a long running job, and we'd like to cancel the running job and start over after making corrections. The Dash feature for implementing these scenarios is called Background callbacks.

Here's an example of a simple Dash application that features a long running job with the progress bar and cancellation.

Figure 3 - Example of simple Dash application with progress and cancellation.
Description - Figure 3: Long running job with progress bar and cancellation

This is an example of a Plotly Dash application involving a long running task with a progress bar to display the progress of the task. It has 2 buttons. The "Run Job!" button is initially enabled and clicking it starts the task and the progress bar. Once the task is running, the "Run Job!" button becomes disabled, and the "Cancel Running Job!" button becomes enabled while the task is running. Clicking it before the task is complete will cancel the running task.

Duplicate callbacks

Normally, the value of an output is uniquely determined by one callback. If there are multiple callbacks that update the same output, we'd face the scenario that the output has multiple values at the same time, and we'd not know which is the correct one.

However, sometimes we might want to take the risk of binding multiple callbacks to the same output to make things simpler. Dash allows us to do this by explicitly specifying that we're willing to allow duplicate outputs. This feature is enabled by setting the allow duplicate parameter on Output to True. Here's an example:

app.layout = html.Div([
    html.Button('Draw Graph', id='draw-2'),
    html.Button('Reset Graph', id='reset-2'),
    dcc.Graph(id='duplicate-output-graph')
])

@app.callback(
    Output('duplicate-output-graph', 'figure', allow_duplicate=True),
    Input('draw-2', 'n_clicks'),
    prevent_initial_call=True
)
def draw_graph(n_clicks):
    df = px.data.iris()
    return px.scatter(df, x=df.columns[0], y=df.columns[1])

@app.callback(
    Output('duplicate-output-graph', 'figure'),
    Input('reset-2', 'n_clicks'),
)
def reset_graph(input):
    return go.Figure()

app.run_server(debug=True)
Figure 4 - Example of a Dash application that uses duplicate callbacks.
Description - Figure 4: Graph that is updated by two different buttons.

This is an example of a Plotly Dash application that uses duplicate callbacks. It has 2 buttons that both target the same output, which is the graph below. Clicking the "Draw Graph" button renders the graph, while clicking the "Reset Graph" button clears the graph. Since both buttons target the same output, this scenario requires the duplicate callback feature of Dash.

In this case, we have two buttons for updating a graph: Draw and Reset. The graph will be updated by the last button that was pressed. While this is convenient, there's a risk to designing UI this way. On the desktop with one mouse pointer, button clicks can be assumed to be unique in time. However, on a multi-touch screen such as a smartphone or tablet, two buttons can be clicked at the same time. In general, once we allow duplicate callbacks, the output becomes potentially indeterminate. This can lead to some bugs that are very difficult to replicate.

This feature is both convenient and potentially dangerous. So use at your own risk!

Custom components

Sometimes the set of components that come with Dash are not enough. The web UI of Dash is built with React, and Dash provides a convenient tool for integrating custom React components into Dash. It's beyond the scope of this article to go into the details of React and Dash-React integration. However, you can read more about this – see: Build your own components.

Error display

Sometimes an error occurs during computation that is due to problems with the data, the code, or user error. Instead of crashing the application, we might want to display the error to the user and provide some feedback on what they can do to rectify it.

There are two Dash features that are used for this scenario: multiple outputs and dash.no_update.

multiple outputs is a Dash feature that allows callbacks to return multiple outputs in the form of a tuple.

dash.no_update is a value that can be returned in an output slot to represent no change in that output.

Here's an example that uses both of these features to implement an error display:

@app.callback(
    Output('out', 'text'),
    Output('err', 'text'),
    Input('num', 'value')
)
def validate_num(num):
    if validate(num):
        return "OK", ""
    else:
        return dash.no_update, "Error"

Partial Updates

Since Dash callback computations occur on the server, to display the results on the client, all the return values from the callback have to be sent to the client on each update.

Sometimes these updates will involve very large Figure objects, which consume a lot of bandwidth and slow the update process. This will negatively impact the user experience. The simple way to implement callback updates is to perform monolithic updates on large data structures such as Figure even if only a small part of it has change, such as the title.

To optimize bandwidth usage and improve the user experience, Dash has a feature called Partial Update. This feature introduces a new type of return value to callbacks called a Patch. A Patch describes which subcomponents of a larger data structure should be updated. This allows us to avoid sending an entire data structure across the network when only a portion of it needs to be updated.

Here is an example of Partial Updates that updates only the font colour title for the figure instead of the whole figure:

From dash import Dash, html, dcc, Input, Output, Patch
import plotly.express as px
import random

app = Dash(__name__)

df = px.data.iris()
fig = px.scatter(
    df, x="sepal_length", y="sepal_width", color="species", title="Updating Title Color"
)

app.layout = html.Div(
    [
        html.Button("Update Graph Color", id="update-color-button-2"),
        dcc.Graph(figure=fig, id="my-fig"),
    ]
)

@app.callback(Output("my-fig", "figure"), Input("update-color-button-2", "n_clicks"))

def my_callback(n_clicks):
    # Defining a new random color
    red = random.randint(0, 255)
    green = random.randint(0, 255)
    blue = random.randint(0, 255)
    new_color = f"rgb({red}, {green}, {blue})"

    # Creating a Patch object
    patched_figure = Patch()
    patched_figure["layout"]["title"]["font"]["color"] = new_color
    return patched_figure

if __name__ == "__main__":
    app.run_server(debug=True)

Dynamic UI and pattern matching callbacks

Sometimes, we can't define the data flow statically. For example, if we want to implement a filter stack that allows the user to flexibly add filters, the specific filters that the user will add won't be known ahead of time. If we want to define data flows involving the input components that the user adds at runtime, we can't do it statically.

Here's an example of a dynamic filter stack where the user can add new filters by clicking the ADD FILTER button. The user can then select the value of the filter via the drop down that is dynamically added.

Figure 5 - Example of dynamic UI in Dash.
Description - Figure 5: Dynamic filter stack

This is an example of a Plotly Dash application that uses dynamic UI and pattern matching callbacks. Clicking the "Add Filter" button adds an additional dropdown box. Since the dropdown boxes are added dynamically, we cannot bind them to callbacks ahead of time. Using the pattern matching callback feature of Dash allows us to bind dynamically created UI elements to callbacks by using a pattern predicate.

Dash supports this scenario by allowing us to bind callbacks to data sources dynamically via a pattern matching mechanism.

The follow code implements the above UI:

From dash import Dash, dcc, html, Input, Output, ALL, Patch

app = Dash(__name__)

app.layout = html.Div(
    [
        html.Button("Add Filter", id="add-filter-btn", n_clicks=0),
        html.Div(id="dropdown-container-div", children=[]),
        html.Div(id="dropdown-container-output-div"),
    ]
)


@app.callback(
    Output("dropdown-container-div", "children"), Input("add-filter-btn", "n_clicks")
)
def display_dropdowns(n_clicks):
    patched_children = Patch()
    new_dropdown = dcc.Dropdown(
        ["NYC", "MTL", "LA", "TOKYO"],
        id={"type": "city-filter-dropdown", "index": n_clicks},
    )
    patched_children.append(new_dropdown)
    return patched_children


@app.callback(
    Output("dropdown-container-output-div", "children"),
    Input({"type": "city-filter-dropdown", "index": ALL}, "value"),
)
def display_output(values):
    return html.Div(
        [html.Div(f"Dropdown {i + 1} = {value}") for (i, value) in enumerate(values)]
    )


if __name__ == "__main__":
    app.run_server(debug=True)

Instead of defining the DropDown components statically, we create a dropdown-container-div which serves as a container for all the DropDown components that the user will create. When we create the DropDown components in display_dropdowns, each new DropDown component is created with an id. Normally this id value would be a string, but in order to enable pattern matching callbacks, Dash also allows the id to be a dictionary. This could be an arbitrary dictionary, so the specific keys in the above example are not special values. Having a dictionary id allows us to define very fine-grained patterns to be matched over each key of the dictionary.

In the above example, when the user adds new DropDown components, the id of the dynamic DropDown components are tagged with ids in sequence that looks like this:

  1. '{"type": "city-filter-dropdown", "index": 1}
  2. '{"type": "city-filter-dropdown", "index": 2}
  3. '{"type": "city-filter-dropdown", "index": 3}

Then, in the metadata for the display_output callback, we define its input as Input({"type": "city-filter-dropdown", "index": ALL}, "value") which then match all components where the id has type equal to city-filter-dropdown. Specifying "index": ALL means that we match any index value.

In addition to ALL, Dash also supports additional pattern matching criteria such as MATCH and ALLSMALLER. To learn more about this feature, visit Pattern Matching Callbacks.

Examples

Here are some examples of apps built with Dash:

Figure 6 - Dash application for object detection.
Description - Figure 6: Object detection

This is an example of a Plotly Dash application that is used for Object Detection. It visualizes the bounding boxes of the detected objects in a scene.

Figure 7 - Dash built dashboard for wind data.
Description - Figure 7: Dashboard

This is an example of a Plotly Dash dashboard application. It visualizes wind speed and direction data.

Figure 8 - Dash application for visualizing Uber rides in New York City.
Description - Figure 8: Uber Rides

This is an example of a Plotly Dash dashboard application. It visualizes the temporal and spatial distribution of Uber rides in Manhattan.

Figure 9 - Dash dashboard for US opioid data.
Description - Figure 9: Opioid map

This is an example of a Plotly Dash dashboard application. It visualizes the spatial distribution of opiod deaths in the US at the county level.

Figure 10 - Dash UI for visualizing point clouds.
Description - Figure 10: Point Cloud

This is an example of a 3D visualization application developed using Plotly Dash. It visualizes the 3D point cloud data collected from a LIDAR from the perspective of a car.

Figure 11 - Dash UI with component for visualizing 3D meshes.
Description - Figure 11: 3D Mesh

This is an example of a 3D mesh visualization application developed using Plotly Dash. It visualizes the reconstruction of the brain using MRI data.

For more examples, visit the Dash Enterprise App Gallery.

Conclusion

Good UI has the potential to add value to projects by making the project deliverables more presentable and usable. For production systems that will be used for a long time, the upfront investment in UI can pay dividends over time with a lower learning curve, reduce user confusion, and improve user productivity. Plotly Dash helps to significantly lower the cost of UI development for data apps, this can help increase the return on investment in UI development for data apps.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Thursday, June 15
1:00 to 4:00 p.m. ET
MS Teams – link will be provided to the registrants by email

Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!

Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.

Reference

  1. Plotly: Low-Code Data App Development
  2. Background callbacks: Plotly - Background Callbacks
  3. Custom Components: Plotly - Build Your Own Components
  4. Pattern matching callbacks: Plotly - Pattern-Matching Callbacks
Date modified:

Self-Supervised Learning in Computer Vision: Image Classification

By: Johan Fernandes, Statistics Canada

Introduction

Computer Vision (CV) comprises tasks such as Image Classification, Object Detection and Image SegmentationFootnote 1. Image Classification deals involves assigning an entire image to one of several finite classes. For example, if an image contains a "Dog" occupies 90% of the space, then it is labeled as a "Dog". Multiple Deep Learning (DL) models using Neural Networks (NN) have been developed to accurately classify images with high accuracy. The state-of-the-art models for this task utilize NNs of various depths and widths.

These DL models are trained on multiple images of various classes to develop their classification capabilities. Like training a human child to distinguish between images of a "Car" and a "Bike", these models need to be shown multiple images of classes such as "Car" and "Bike" to generate this knowledge. However, humans have the additional advantage of developing context through observing our surroundings. Our minds can pick up sensory signals (audio and visual) that help us develop this knowledge for all types of objectsFootnote 2. For instance, when we observe a car on the road our minds can generate contextual knowledge about the object (car) through visual features such as location, color, shape, lighting surrounding the object, and the shadow it creates.

On the other hand, a DL model specifically for CV must be trained to develop such knowledge which is stored in the form of weights and biases it utilizes in its architecture. These weights and biases are updated with this knowledge by training the model. The most popular training process, called Supervised Learning, involves training the model with the image and the corresponding label to improve its classification capability. However, generating labels for all images is time consuming and costly, as it involves human annotators manually generating labels for each image. On the other hand, Self-Supervised Learning (SSL) is a new training paradigm that can be used to train DL models to classify images without the bottleneck of having well-defined labels for each image during training. In this work I, will describe the current state of SSL and its impact on image classification.

Significance of Self-Supervised Learning (SSL)

SSL aims to set up an environment to train the DL model to extract maximum features or signals from the image. Recent studies have shown that the feature extraction capability DL models is restricted when trained with labels, as they must pick signals that will help them develop a pattern to associate similar images with that labelFootnote 2Footnote 3. With SSL the model is trained to understand the sensory signals (e.g., shape and outline of objects) from the input images without being shown the associated labels.

Additionally, since SSL does not limit the model to develop a discrete representation (label) of an image, it can learn to extract much richer features from an image than its supervised counterpart. It has more freedom to improve how it represents an image, as it no longer needs to be trained to associate a label with an imageFootnote 3. Instead, the model can focus on developing a representation of the images through the enhanced features it extracts and identifying a pattern so that images from the same class can be grouped together.

SSL uses more feedback signals to improve its knowledge of an image than supervised learningFootnote 2. As a result, the term self-supervised is being used more frequently in place of unsupervised learning as an argument can be made that DL models receive input signals from the data rather than labels.  However, they do have some form of supervision and are not completely unsupervised in the training process. In the next section I will describe the components needed for self-supervised learning.

These signals are enhanced through a technique known as data augmentation, in which the image is cropped, certain sections of the image are hidden, or the color scheme of the image is modified. With each augmentation, the DL model receives a different image of the same class or category as the original image. By exposing the model to such augmented images, it can be trained to extract rich features based on the visible sections of the imageFootnote 4. Furthermore, this training method removes the overhead of generating labels for all images, opening up the possibility of adapting image classification in fields where labels are not readily available.

Components of self-supervised learning methods: 

Encoder / Feature Extractor:

As humans, when we look at an image, we can automatically identify features such as the outline and colour of objects to determine the type of object in the image. For a machine to perform such a task, we utilize a DL model, which we refer to as an encoder or a feature extractor since it can automatically encode and extract features of an image. The encoder consists of sequentially ordered NN layers, as shown in Fig 1.

Figure 1: Components of a Deep Learning encoder / feature extractor

Figure 1: Components of a Deep Learning encoder / feature extractor.

The image describes the structure of an encoder or feature extractor along with an example of the input it receives and the output it provides. The input to the encoder is an image which is shown as an image of a dog in this instance and the output is a vector that can represent that image in a higher dimensional space. The encoder consists of multiple single layered neural layers that are stacked on top of each other or next to each other as shown in this image. Each layer consists of multiple convolutional neurons. These layers will pick essential features that will help the encoder to represent the image as a vector which is the final output of the encoder. The vector that it produces at the end will have n dimensions where each dimension will be reserved for a feature. This vector can be projected in n dimension space and can be used for clustering vectors of the same class such as a dog or a cat.

An image contains multiple features. The encoder's job is to extract only the essential features, ignore the noise, and convert these features into a vector representation. This encoded representation of the image can be projected into n-dimensional or latent space, depending on the size of the vector. As a result, for each image, the encoder generates a vector to represent the image in that latent space. The underlying principle is to ensure that vectors of images from the same class can be grouped together in that latent space. Consequently, vectors of "Cats" will be clustered together while vectors of "Dogs" will form a separate group, with both groups of vectors distinctly separated from each other.

The encoders are trained to improve their representation of images so that they can encode richer features of the images into vectors that will help distinguish these vectors in latent space. The vectors generated by encoders can be used to address multiple CV tasks, such as image classification and object detection. The NN layers in the encoder would traditionally be convolutional neural network (CNN) layers as shown in Fig 1; however, the latest DL models utilize Attention Network (AN) layers in their architecture. These encoders are called Transformers, and recent works have begun to use them to address image classification due to the impact they have provided in the field of natural language processing. The vectors can be fed to classification models, which can be a series of NN layers or a clustering-based models such as K-Nearest Neighbor (KNN) classifier. Current literature on self-supervised learning utilizes KNN classifiers to cluster images, as they only require the number of clusters as an argument and do not need labels.

Data Augmentation:

Labels of images are not provided to encoders trained in a self-supervised format. Consequently, the representation capability of encoders has to be improved solely from the images they receive. As humans, we can look at objects from different angles and perspectives to understand the shape and outline of objects. Similarly, augmented images assist encoders by providing different perspectives of the original training images. These image perspectives can be developed by applying strategies such as Resized Crop and Color Jitter to the image, as shown in Fig 2. Augmented images enhance the encoder's ability to extract rich features from an image by learning from one section or patch of the image and applying that knowledge to predict other sections of the imageFootnote 4.

Figure 2: Augmentation strategies that can be used to train encoders in a self-supervised format. These augmentation strategies are randomly applied to the image when the encoders are trained.

Figure 2: Augmentation strategies that can be used to train encoders in a self-supervised format. These augmentation strategies are randomly applied to the image when the encoders are trained.

The image contains four ways to represent an image for SSL training. An image of a Corgi dog is used as a sample in this case. The first way is the original image by itself with no additional filters to the image. The second way is to horizontally flip the image. Hence the image of the Corgi Dog which was originally looking to its left is now looking to its right. The third way is to resize the image and to crop a section of the image which has the object of interest. In this case the Corgi dog is in the center of the image so a crop of the dog’s head and part of it’s body is used as an augmented image. The last way is to change the color scale of the image through color jitter augmentation. The color of the dog which was golden in color in the original image will change to blue color as per this augmentation strategy.

Siamese Network architecture:

Many self-supervised learning methods use the Siamese Network architecture to train encoders. As shown in Fig 3, a Siamese Network consists of two encoders that could share the same architecture (example: ResNet-50 for both encoders)Footnote 3. Both encoders receive batches of images during training (training batches). From each batch, both encoders will receive an image, but with different augmentation strategies applied to the images they receive. As shown in Fig 3, we consider the two encoders E1 and E2. In this network, image (x) is augmented by two different strategies to generate x1 and x2, which are respectively are fed to E1 and E2. Each encoder then provides a vector representation of the image, which can be used to measure similarity and calculate loss.

During the training phase, the weights between the two encoders are updated through a process known as knowledge distillation. This involves a student-teacher training format. The student encoder is trained in an online format where undergoes forward and backward propagation, while the weights of the teacher encoder are updated at regular intervals using stable weights from the student with techniques such as Exponential Moving Average (EMA)Footnote 3.

Figure 3 : A Siamese Network consisting of two encoders trained in parallel to generate representations of images, ensuring that representations of images from the same class are similar to each other.

Figure 3: A Siamese Network consisting of two encoders trained in parallel to generate representations of images, ensuring that representations of images from the same class are similar to each other.

The image describes the layout of a Siamese network which is a popular technique for training self supervised encoders. The Siamese network consists of two encoders which will have the same neural network architecture. Both encoders are trained in parallel. The image shows that an image of a corgi dog is sent to both encoders. One encoder behaves as a student which is called E1 while the other encoder behaves as a teacher which is called E2. E1 receives an image of a corgi dog with resize crop and horizontal flip augmentation. E2 receives an image of the same corgi dog as E1 with resize crop and color jitter augmentation. The image also shows that both encoders share their knowledge through weights at regular intervals in the training phase. Both encoders provide vector representations as their final output. A similarity score is calculated to measure if the E1 and was able to learn from the stable weights of the E2 and improve its representational knowledge.

Contrastive vs Non-contrastive SSL methods:

All available SSL methods utilize these components, with some additional changes to improve each other's performance. These learning methods can be grouped into two categories:

Figure 4: Positive and negative pair of image patches.

Figure 4: Positive and negative pair of image patches.

The image shows how you can create positive and negative pairs of images. The image is split into two parts. In the first part there are two different images of corgi dogs. Resize crop augmentation is used to extract the important sections such as the face and body of the dogs and to create two new images. The new augmented cropped images from both corgi dog images can now be considered as a positive pair as shown in the image. In the second part of this image an example of a negative pair of images is shown. Unlike the first part there is one original image of a Corgi dog and another of a cat. After resize crop augmentation is performed on these images we get to see two new images of the originals. One has the cat's face while the other has the Corgi dog's face. These new images will be considered as negative pair of images.

Contrastive learning methods

These methods require positive and negative pairs of each image to train and improve the representation capability of encoders. They utilize contrastive loss to train the encoders in a Siamese network with knowledge distillation. As shown in Fig 4, a positive pair would be an augmented image or patch from the same class as the original image. A negative pair would be an image or patch from another image that belongs to a different class. The underlying function of all contrastive learning methods is to help encoders generate vectors so that vectors of positive pairs are closer to each other, while those of negative pairs are further away from each other in latent space.

Many popular methods such as SimCLRFootnote 4 and MoCoFootnote 5, are based on this principle and work efficiently on large natural object datasets like ImageNet. Positive and negative pairs of images are provided in each training batch to prevent the encoders from collapsing into a state where they produce vectors of only a single class. However, to train the encoders with negative pairs of images, these methods rely on large batch sizes (upwards of 4096 images in a training batch). Furthermore, many datasets, unlike ImageNet, do not have multiple images per class, making generating negative pairs in each batch a difficult, if not impossible, task. Consequently, recent research is leaning towards non-contrastive based methods.

Non-Contrastive learning methods

Methods such as DINOFootnote 3, BYOLFootnote 6 and BarlowTwinsFootnote 7 train encoders in a self-supervised format without the need to distinguish images as positive and negative pairs in their training batches. Methods like DINO continue to use the Siamese Network in a student-teacher format and rely on heavy data augmentation. However, they improve on contrastive methods with a few enhancements:

  • Patches of images provide a local view of the image to the student and a global view of the image to the teacher encoderFootnote 3.
  • A prediction layer is added to the student encoder to generate a probability-based outputFootnote 3. This layer is only used during training.
  • Instead of calculating contrastive loss between pairs of images, the output from the encoders is used to calculate a classification type of loss, such as cross entropy or L2 loss to determine if the output vectors from the student and teacher encoders are similar or not Footnote 3, Footnote 6, Footnote 7, Footnote 8.
  • Employing EMA or any moving average method to update the teacher network's weights from the online weights of the student network, while avoiding backpropagation on the teacher network Footnote 3, Footnote 6, Footnote 7, Footnote 8.

Unlike contrastive methods, these methods do not require a large batch size for training and do not need additional overhead to ensure negative pairs in each training batch. Additionally, deep learning (DL) models such as Vision Transformer, which have the capability to learn from the local view of an image and predict other similar local views while also considering at the global view have replaced conventional CNN encoders. These models have enhanced non-contrastive methods to surpass image classification accuracies of supervised learning techniques.

Conclusion

Self-supervised learning is a training process that can help DL models train more efficiently than popular supervised learning methods without the use of labels. This efficiency is evident in the higher accuracy that DL models have achieved on popular datasets such as ImageNet when trained in a self-supervised setup compared to a supervised setup. Furthermore, self-supervised learning eliminates the need for labeling images before training, providing an additional advantage. The future looks bright for solutions that adopt this type of learning for image classification tasks as more research is being conducted on its applications in fields that do not involve natural objects, such as medical and document images.

Meet the Data Scientist

Register for the Data Science Network's Meet the Data Scientist Presentation

If you have any questions about my article or would like to discuss this further, I invite you to Meet the Data Scientist, an event where authors meet the readers, present their topic and discuss their findings.

Thursday, June 15
1:00 to 4:00 p.m. ET
MS Teams – link will be provided to the registrants by email

Register for the Data Science Network's Meet the Data Scientist Presentation. We hope to see you there!

Subscribe to the Data Science Network for the Federal Public Service newsletter to keep up with the latest data science news.

Date modified:

Infosheet - By the numbers: Asian Heritage Month 2023

By the numbers: Asian Heritage Month 2023 (PDF, 15.93 MB) Asian Heritage Month 2023... by the numbers
Description: By the numbers: Asian Heritage Month 2023

Population

According to the 2021 Census of Population, 7,013,835 people in Canada have Asian origins, representing 19.3% of Canada’s population.

Mother tongue

In 2021, the top four most reported mother tongues, after English and French, were Punjabi (Panjabi), Mandarin, Arabic and Yue (Cantonese).

Employment

In 2021, the highest employment rates among Asian Canadians were seen among Filipino people (70.1%), followed by South Asian people (62.3%) and Southeast Asian people (56.7%).

Immigration

For decades, Asia (including the Middle East) has accounted for the largest share of recent immigrants in Canada. This proportion has grown, with Asian-born immigrants making up a record of 62.0% of recent immigrants admitted from 2016 to 2021.

Source: Statistics Canada, Census of Population, 2021.