Developments in Machine Learning Series: Issue two

By: Nicholas Denis, Statistics Canada

Editor's Note: This series showcases new and interesting research developments in machine learning (ML) from around the world. Hopefully you can find something that will help you in your own or your colleagues' work.

This month's topics:

Generating realistic image from user-text input

Figure 1: Four images generated from user-text input
Figure 1: Four images generated from user-text input.

Four images side by side as examples of realistic researcher-generated images.

  • The first image is of a black cat touching round pegs on a checker board. The caption reads "a surrealist dream-like oil painting by Salvador Dali of a cat playing checkers".
  • The second image is of a sun setting behind desert canyons. Caption: "a professional photo of a sunset behind the grand canyon".
  • The third image is a portrait painting of a green and blue hamster with dragon wings on a red background. Caption: "a high-quality oil painting of a psychedelic hamster dragon".
  • The fourth image is a drawing of Albert Einstein in a Superman costume. Caption: "an illustration of Albert Einstein wearing a superhero costume".

Researchers generate photo realistic images from user text input using a 3.5 billion parameter model.

What's new?: OpenAI has leveraged recent success from their popular CLIP (Contrastive Language-Image Pre-training) model to train a Gaussian diffusion model to generate realistic and nuanced images conditioned solely on a text input describing the image to be generated. The model, GLIDE (Guided Language to Image Diffusion for generation and Editing), can be accessed via Google CoLab.

How it works: When given a text input and an initially randomly-sampled vector of noise, x0, the model is able to sequentially de-noise the sample at stage t (xt), conditioned on the text input and the de-noised sample at the previous stage (xt-1). The final de-noised image, xT, is the final generated image which attempts to capture the semantics of the user-provided text input.

Why does it work?: Gaussian diffusion models are a noise-additive process (see Denoising Diffusion Implicit Models) that begin with an image and produce a Markov chain of increasingly noisy images, where the image at time t,

q(xt|xt-1)=(xt; αtxt-1,(1αt)I)

which is to say, the distribution of next images conditional on the previous image, is normally distributed, where αt is a noise parameter. The end result is a fully random image. Under mild conditions, the posterior q(xt1| xt) is well defined, and can be approximated with deep neural networks. Briefly,

  • the reverse process represents a manner of sequentially removing noise from an image, to arrive at natural photorealistic image,
  • by starting with a naturally occurring image and adding Gaussian noise to it, they are able to train a model to estimate the noise added, and
  • the authors utilize other techniques and tricks from guided diffusion using text semantics from a CLIP model.

Results: Quantitative evaluation of generative models is difficult and an open problem. However, the research does include some state-of-the-art zero-shot performance metrics. Qualitatively, the model is capable of producing incredibly nuanced and specific images such as "a crayon drawing of a space elevator" and "a stained glass window of a panda eating bamboo". Moreover, the authors had human evaluators examine the images generated by GLIDE and other state-of-the-art generative models. Humans judged images produced by GLIDE as more photorealistic than other models between 66% to 91% of the time.

But…: It's ubiquitous in publications for generative models for the authors to cherry-pick the generated data to present, however it is also quite common for authors to also include a large enough sample of randomly selected instances, as well. This paper could have included a much larger gallery of randomly selected images to share. Also, the model is 3.5 billion parameters and requires a significant amount of time (20 seconds) to generate a single image, making this approach unlikely to scale.

Our opinion: Generative models are becoming increasingly powerful, producing high quality and seemingly authentic images that fool humans – and the quality will only increase over time. See if you can tell which face is real. As techniques such as GLIDE take specific input and direction from humans to produce high quality images (and soon videos), legal, ethical and evidential issues will need to be addressed immediately.

Move over principal component analysts, make way for learning dimensionality reduction

Figure 2: Higher and lower-dimensional representation
Figure 2: Higher and lower-dimensional representation

Given a set of feature vectors in a generic input space, we use nearest neighbours to define a set of feature pairs whose proximity we want to preserve. We then learn a dimensionality-reduction function (the encoder) by encouraging neighbours in the input space to have similar representations. We learn it jointly with auxiliary projectors that produce high dimensional representations, where we compute the Barlow Twins loss over the (d' × d') cross-correlation matrix averaged over the batch. (source: Twin Learning for Dimensionality Reduction)

TLDR (Twin Learning for Dimensionality Reduction) beats principal component analysis (PCA) for small to mid-size outputs (8 to 128 dimensions).

What's new?: Naver Labs released TLDR, a general technique that uses a linear dimensionality reduction encoder that encourages the nearest neighbours in the input space to be similar in the smaller embedding space.

How it works:

  • During training, a given data-instance in some high-dimensional input space is sampled and its k-nearest neighbours are computed.
  • A linear-embedding matrix maps the data instance and its neighbours to the lower-dimensional space through the embedding matrix.
  • The lower-dimensional embeddings are projected (via a projector network) to a much higher dimensional space where their cross-correlation matrix is computed.
  • They used the recently-proposed Barlow Twins loss, which encourages the cross correlation matrix to be the identity matrix.
  • After training, the projector network is discarded and the linear encoder is used for dimensionality reduction.

Why it works:

  • The Barlow Twins loss is minimized when the cross-correlation matrix between vectors is the identity matrix. This is equivalent to a factorized representation, where each dimension of the data is completely independent.
  • By minimizing the cross-correlation across dimensions, redundant information encoded is minimized, which is a desirable property for dimensionality reduction.
  • By computing this loss on pairs of inputs that are close neighbours in the input space, the linear embedding function is learning something similar to Manifold Learning, where locally nearby points are invariant (similar) in the lower-dimensional embedding space.

Results: The authors focus on retrieval tasks. Given an input image or text document, the retrieval task aims to find the most similar instance(s) to the input within a given dataset. Note that for images, TLDR was applied to outputs produced from a pre-trained vision model, and for text, TLDR was applied to outputs produced from a pre-trained BERT language model.

  • On image retrieval datasets, TLDR improved over PCA in terms of mean average precision by 6 to 10%, across different output dimensionalities.
  • On test retrieval tasks, TLDR improved over PCA for recall scores by as much as 27%, and saw dramatic improvement as the size of the output dimension decreased.
  • Compared to other leading dimensionality reduction techniques, including manifold based techniques, TLDR consistently outperformed all other approaches at output dimension sizes eight and higher, however underperformed for dimension sizes two and four.

Our opinion: Dimensionality reduction techniques are typically discussed for tabular datasets and the use of classical machine learning techniques, but often fail to be useful for extremely high-dimensional data, such as images and text. TLDR is a linear dimensionality reduction technique that can be applied to tabular or complex data. This technique could be useful for:

  • retrieval tasks,
  • label selection strategies in active learning,
  • cluster and data exploration, and
  • explainable machine learning models

Help! Machines are learning to code!

Google DeepMind introduces AlphaCode, which performs above the 50th percentile in coding competition.

What's new: DeepMind built a new competitive programming dataset used to train a 41 billion parameter model that can take a natural language text description of a coding challenge (see figure above) and produce functional code to solve the challenge. This is bananas!

How it works: Coding competitions are quite common. DeepMind built multiple datasets to iteratively train a deep neural network that can take natural language text input describing a programming challenge, as seen above, and produce an output, character by character, of code. They did this first by sampling a large number of potential solutions, followed by a filtration and clustering process to remove weak candidate solutions, finally 10 diverse solutions were submitted to a real world competition. Below we breakdown some relevant steps:

  • They built a pre-training dataset based on a public snapshot of GitHub repositories using code from 12 popular programming languages, resulting in 715Gb of data.
  • They use a 41 billion parameter transformers model with an encoder and decoder architecture, which, given a sequence of text will output a sequence of text.
  • Pre-training the model involves predicting the next token (character/word in the code) conditioned on the input and output produced thus far
  • A standard masked language model loss (as in BERT) was used, where given a sequence of inputs (natural language text description of the problem), 15% of the text is randomly erased and the model must infer what the missing text is.
  • The pre-trained model is fine-tuned on their competitive programming dataset. Similar training objectives to the pre-training are used here.
  • Their model is able to sample millions of possible solutions for each input challenge.
  • Reinforcement Learning was used to increase the diversity of solutions sampled from the model
  • Since only 10 solutions can be submitted per challenge, the authors filter their millions of potential solutions through a number of tests, provided in each challenge, then cluster their solutions based on program behaviour. One solution from the 10 largest clusters were sampled and submitted as the solution to the challenge.

Results: The model was applied to 10 separate coding challenges.

  • The authors found that when they sample 1 million possible solutions and submit the (estimated) best 10, the code successfully solves the coding challenge over 30% of the time. Moreover, success rate scales positively with respect to the number of sample solutions produced as well as the model size. Moreover, performance vs time the model is trained for scales linearly and no plateauing was observed, signifying if they trained longer, the model would do better.
  • The authors estimate that the model, on average, performed in the top 54th percentile of the coding problems, ranking just above the median programmer.

What's the context?: Though this task is incredibly difficult and it is simply astounding that a model can solve even a single task, these results should not be a surprise. Github's Copilot can automatically suggest lines or even full functions of code. Copilot is built by OpenAI, who does plenty of work similar to this paper. Anyone can use this service, today.

Just as some fear that data science and machine learning may bring job automation, automatically generated code and AutoML might bring fear to data scientists themselves of being automated away!

Our opinion: It seemed like just yesterday that the world marvelled at the ability of ML models to infer objects such as cats and dogs in a given image. Today, we are witnessing the continued advancement of ML models that can extract and abstract incredibly complex semantic content from long blocks of text describing complicated tasks, and then output a long and specifically structured syntax that is functional programming code solving the given task.

Though the results are promising and still have room for improvement, this field is in its infancy, and you can expect they certainly will improve. This brings ethical concerns into play on many different levels. On one level, what is the responsibility of a developer, who uses this technology to write code that they themselves are unable to do themselves? Can such a developer be able to debug or ascertain that the code does, in fact, do what it is intended to do?

Figure 3: Natural language text automatically generated from code snippet

Backspace
You are given two strings s and t, both consisting of lowercase English letters. You are going to type the string s character by character, from the first character to the last one.
When typing a character, instead of pressing the button corresponding to it, you can press the "Backspace" button. It deletes the last character you have typed among those that aren't deleted yet (or does nothing if there no characters in the current string). For example, if s is "abcbd" and you press Backspace instead of typing the first and the fourth characters, you will get the string "bd" (the first press of Backspace deletes no character, and the second press deletes the character 'c'). Another example, if s is "abcaa" and you press Backspace instead of the last two letters, then the resulting text is "a".
Your task is to determine whether you can obstain the string t, if you type the string s and press "Backspace" instead of typing several (maybe zero) characters of s.

Input
The first line contains a single integer q (1  q  105 ) the number of test cases. The first line of each test case contains the string s (1  |s|  105 ). Each character of s is a lowercase English letter.
The second line of each test case contains the string t (1  |t|  105 ). Each character of t is a lowercase English letter.
It is guaranteed that the total number of characters in the strings over all test cases does not exceed 2. 105.

Output
For each test case, print "YES" if you can obtain the string t by typing the string s and replacing some characters with presses of "Backspace" button, or "NO" if you cannot.
You may print each letter in any case (YES, yes, Yes will all be recognized as positive answer, NO, no, nO will all be recognized as negative answer).


1	1=int(input( ) )
2	for i in range(t):
3		s=input( )
4		t=input( )
5		a=[ ]
6		b=[ ]
7		for j in s:
8			a.append(j)
9		for j in t:
10			b.append(j)
11		a.reverse( )
12		b.reverse( )
13		c=[ ]
14		while len(b) !=0 and len(a) !=0
15			if a [0]==b[0]:
16				c.append(b.pop(0))
17				a.pop(0)
18		elif a[0] !=b[0] and len (a) !=1:
19			a.pop(0)
20			a.pop(0)
21		elif a[0] !=b[0] and len(a) ==1:
22			a.pop(0)
23		if len (b) ==0:
24			print ("YES")
25		else:
26			print ("NO")
  
Date modified: