Contextual stories about Learners

From Algolit

Revision as of 16:20, 1 March 2019 by Cristina (talk | contribs)

Naive Bayes & Viagra

Naive Bayes is a famous learner that performs well with little data. We apply it all the time. Christian & Griffiths state in their book, 'Algorithms to Live by', that 'our days are full of small data'. Imagine for example you're standing at a bus stop in a foreign city. The other person who is standing there, has been waiting for 7 minutes. What do you do? Do you decide to wait? And if yes, for how long? When will you initiate other options? Another example. Imagine a friend asking advice on a relationship. He's been together with his new partner for 1 month. Should he invite the partner to join him at a family wedding?

Having preexisting beliefs is crucial for Naive Bayes to work. The basic idea is that you calculate the probabilities based on prior knowledge and given a specific situation.

The theorem was formulated during the 1740s by reverend and amateur mathematician Thomas Bayes. He dedicated his life to solving the question of how to win the lottery. But Bayes' rule was only made famous and known as it is today by the mathematician Pierre Simon Laplace in France a bit later in the same century. For a long time after La Place's death, the theory sunk to oblivion until it was dug out again during the Second World War in an effort to break the Enigma code.

Most people today have come in contact with Naive Bayes through their email spam folders. Naive Bayes is a widely used algorithm for spam detection. It is by coincidence that Viagra, the erectile dysfunction drug, was approved by the US Food & Drug Administration in 1997, around the same time as about 10 million users worldwide had made free web mail accounts. The selling companies were among the first to make use of email as a medium for advertising: it was an intimate space, at the time reserved for private communication, for an intimate product. In 2001, the first SpamAssasin programme relying on Naive Bayes was uploaded to SourceForge, cutting down on guerilla email marketing.


Machine Learners, by Adrian MacKenzie, The MIT Press, Cambridge, US, November 2017.

Naive Bayes & Enigma

This story about Naive Bayes is taken from the book: 'The theory that would not die', written by Sharon Bertsch McGrayne. Amongst other things, she describes how Naive Bayes was soon forgotten after the death of Pierre Simon Laplace, its inventor. The mathematician was said to have failed to credit the works of others. Therefore, he suffered widely circulated charges against his reputation. Only after 150 years the accusation was refuted.

Fast forward to 1939, when Bayes' rule was still virtually taboo, dead and buried in the field of statistics. When France was occupied in 1940 by Germany, who controlled Europe's factories and farms, Winston Churchill's biggest worry was the U-boat peril. The U-boat operations were tightly controlled by German headquarters in France. Each submarine went to sea without orders and received them as coded radio messages after it was well out into the Atlantic. The messages were encrypted by word scrambling machines, called Enigma machines. Enigma looked like a complicated typewriter. It was invented by the German firm Scherbius & Ritter after the First World War, when the need for message encoding machines had become painfully obvious.

Interestingly, and luckily for Naive Bayes and the world, at that time, the British government and educational systems saw applied mathematics and statistics as largely irrelevant to practical problem solving. So the British agency charged with cracking German military codes mainly hired men with linguistic skills. Statistical data was seen as bothersome because of its detail-oriented nature. So wartime data was often analyzed not by statisticians, but by biologists, physicists, and theoretical mathematicians. None of them knew that as far as sophisticated statistics was concerned, the Bayes rule was considered to be unscientific. Their ignorance proved fortunate.

It was the now famous Alan Turing, a mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist, who used Bayes' rules probabilities system to design the 'bombe'. This was a high-speed electromechanical machine for testing every possible arrangement that an Enigma machine would produce. In order to crack the naval codes of the U-boats, Turing simplified the 'bombe' system using Baysian methods. It turned the UK headquarters into a code-breaking factory. The story is well illustrated in a non-technical way in 'The Imitation Game', a film by Morten Tyldum of 2014.

A story on sweet peas

Throughout history, some models were invented by people with ideologies that are not to our liking. The idea of regression stems from Sir Francis Galton, an influential 19th Century scientist. He spent his life studying the problem of heredity – understanding how strongly the characteristics of one generation of living beings manifested in the following generation. He established the field of eugenics, and defined it as ‘the study of agencies under social control that may improve or impair the racial qualities of future generations, either physically or mentally.’ On Wikipedia, Galton is a prime example of scientific racism.

Galton initially approached the problem of heredity by examining characteristics of the sweet pea plant. He chose this plant because the species can self-fertilize. Daughter plants inherit genetic variations from mother plants without a contribution from a second parent. This characteristic eliminates having to deal with multiple sources.

Galton's research was appreciated by many intellectuals of his time. In 1869, in 'Hereditary Genius', Galton claimed that genius is mainly a matter of ancestry and he believed that there was a biological explanation for social inequality across races. Galton even influenced his half-cousin Charles Darwin of his ideas. After reading Galton's paper, Darwin stated, "You have made a convert of an opponent in one sense for I have always maintained that, excepting fools, men did not differ much in intellect, only in zeal and hard work." Luckily, the modern study of heredity managed to eliminate the myth of racially-based genetic difference, something Galton tried so hard to maintain.

Galton's major contribution to the field was linear regression analysis, laying the groundwork for much of modern statistics. While we engage with the field of machine learning, Algolit tries not to forget that ordering systems hold power, and that this power has not always been used to the benefit of everyone. Machine learning has inherited many aspects of statistical research, some less agreeable than others. We need to be attentive, because these world views do seep into the algorithmic models that create new orders.



We find ourselves in a moment in time in which neural networks are sparking a lot of attention. But they have been in the spotlight before. The study of neural networks goes back to the 1940s, when the first neuron metaphor emerged. The neuron is not the only biological reference in the field of machine learning - think of the word corpus or training. The artificial neuron was constructed in strong connection to its biological counterpart.

Psychologist Frank Rosenblatt was inspired by fellow psychologist Donald Hebb's work on the role of neurons in human learning. Hebb stated that "cells that fire together wire together." His theory now lies at the basis of associative human learning, but also unsupervised neural network learning. It moved Rosenblatt to expand on the idea of the artificial neuron.

In 1962, he created the Perceptron. The perceptron is a model that learns through the weighting of inputs. It was set aside by the next generation of researchers, because it can only handle binary classification. This means that the data has to be clearly separable, as for example, men and women, black and white. It is clear that this type of data is very rare in the real world. When the so-called first AI winter arrived in the 70s and the funding decreased, the Perceptron was also neglected. For 10 years it stayed dormant. When Spring settled at the end of the 80s, a new generation of researchers picked it up again and used it to construct neural networks. These contain multiple layers of perceptrons. That is how neural networks saw the light. One could say that the current machine learning season is particularly warm, but it takes another Winter to know a Summer.


Some online articles say the year 2018 marked a turning point for the field of Natural Language Processing. A series of deep-learning models achieved state-of-the-art results on tasks like question answering or sentiment classification. Google’s BERT algorithm entered the machine learning competitions of last year as a sort of “one model to rule them all.” It showed a superior performance over a wide variety of tasks.

BERT is pre-trained; its weights are learned in advance through two unsupervised tasks. This means BERT doesn’t need to be trained from scratch for each new task. You only have to finetune its weights. This also means that a programmer wanting to use BERT, does not know any longer what parameters BERT is tuned to, nor what data it has seen to learn its performances.

BERT stands for Bidirectional Encoder Representations from Transformers. This means that BERT allows for bidirectional training. The model learns the context of a word based on all of its surroundings, left and right of a word. As such, it can differentiate between 'I accessed the bank account' and 'I accessed the bank of the river'.

Some facts:

  • BERT_large, with 345 million parameters, is the largest model of its kind. It is demonstrably superior on small-scale tasks to BERT_base, which uses the same architecture with “only” 110 million parameters.
  • to run BERT you need to use TPU's. These are the Google's CPU's especially engineered for TensorFLow, the deep learning platform. TPU's renting rates range from 8$/h till 394$/h. Algolit doesn't want to work with off-the-shelf-packages, we are interested in opening the blackbox. In that case, BERT asks for quite some savings in order to be used.