# Contextual stories about Learners

### From Algolit

## Contents

## Naive Bayes & Viagra

Naive Bayes is a famous learner that performs well with little data. We apply it all the time. Christian & Griffiths state in their book, 'Algorithms to Live by', that 'our days are full of small data'. Imagine for example you're standing at a bus stop in a foreign city. The other person who is standing there, has been waiting for 7 minutes. What do you do? Do you decide to wait? And if yes, for how long? When will you initiate other options? Another example. Imagine a friend asking advice on a relationship. He's been together with his new partner for 1 month. Should he invite the partner to join him at a family wedding?

Having preexisting beliefs is crucial for Naive Bayes to work. The basic idea is that you calculate the probabilities based on prior knowledge and given a specific situation.

The theorem was formulated during the 1740s by reverend and amateur mathematician Thomas Bayes. He dedicated his life to solving the question of how to win the lottery. But Bayes' rule was only made famous and known as it is today by the mathematician Pierre Simon Laplace in France a bit later in the same century. For a long time after La Place's death, the theory sunk to oblivion until it was dug out again during the Second World War in an effort to break the Enigma code.

Most people today have come in contact with Naive Bayes through their email spam folders. Naive Bayes is a widely used algorithm for spam detection. It is by coincidence that Viagra, the erectile dysfunction drug, was approved by the US Food & Drug Administration in 1997, around the same time as about 10 million users worldwide had made free web mail accounts. The selling companies were among the first to make use of email as a medium for advertising: it was an intimate space, at the time reserved for private communication, for an intimate product. In 2001, the first SpamAssasin programme relying on Naive Bayes was uploaded to SourceForge, cutting down on guerilla email marketing.

##### Reference

Machine Learners, by Adrian MacKenzie, The MIT Press, Cambridge, US, November 2017.

## Naive Bayes & Enigma

This story about Naive Bayes is taken from the book: 'The theory that would not die', written by Sharon Bertsch McGrayne. Amongst other things, she describes how Naive Bayes was soon forgotten after the death of Pierre Simon Laplace, its inventor. The mathematician was said to have failed to credit the works of others. Therefore, he suffered widely circulated charges against his reputation. Only after 150 years the accusation was refuted.

Fast forward to 1939, when Bayes' rule was still virtually taboo, dead and buried in the field of statistics. When France was occupied in 1940 by Germany, who controlled Europe's factories and farms, Winston Churchill's biggest worry was the U-boat peril. The U-boat operations were tightly controlled by German headquarters in France. Each submarine went to sea without orders and received them as coded radio messages after it was well out into the Atlantic. The messages were encrypted by word scrambling machines, called Enigma machines. Enigma looked like a complicated typewriter. It was invented by the German firm Scherbius & Ritter after the First World War, when the need for message encoding machines had become painfully obvious.

Interestingly, and luckily for Naive Bayes and the world, at that time, the British government and educational systems saw applied mathematics and statistics as largely irrelevant to practical problem solving. So the British agency charged with cracking German military codes mainly hired men with linguistic skills. Statistical data was seen as bothersome because of its detail-oriented nature. So wartime data was often analyzed not by statisticians, but by biologists, physicists, and theoretical mathematicians. None of them knew that as far as sophisticated statistics was concerned, the Bayes rule was considered to be unscientific. Their ignorance proved fortunate.

It was the now famous Alan Turing, a mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist, who used Bayes' rules probabilities system to design the 'bombe'. This was a high-speed electromechanical machine for testing every possible arrangement that an Enigma machine would produce. In order to crack the naval codes of the U-boats, Turing simplified the 'bombe' system using Baysian methods. It turned the UK headquarters into a code-breaking factory. The story is well illustrated in a non-technical way in 'The Imitation Game', a film by Morten Tyldum of 2014.

## A story on sweet peas

In statistics, linear regression is a supervised learning method. After training with labeled data, the model tries to predict values for new unknown data. Linear Regression allows us to summarize and study relationships between two elements, to see whether there exists a correlation between them. If there is a positive correlation, the knowledge of one element helps to predict the other. For example, given a movie review, we can predict the average number of stars assigned to it, rather than just saying if the review is positive or negative.

Sometimes the figures we encounter while scratching under this area of study's surface are not to our liking. The idea of regression stems from Sir Francis Galton, an influential 19th Century scientist. He spent his life studying the problem of heredity – understanding how strongly the characteristics of one generation of living beings manifested in the following generation. He established the field of eugenics, and defined it as ‘the study of agencies under social control that may improve or impair the racial qualities of future generations, either physically or mentally.’ His name has forever marked the history and legacy of scientific racism.

Galton initially approached the problem of heredity by examining characteristics of the sweet pea plant. He chose the sweet pea because the species can self-fertilize. Daughter plants inherit genetic variations from mother plants without a contribution from a second parent. This characteristic eliminates having to deal with multiple sources.

In 1875, Galton distributed packets of sweet pea seeds to seven friends. Each friend received seeds of uniform weight, but there was substantial variation across different packets. Galton’s friends harvested seeds from the new generations of plants and returned them to him. He then plotted the weights of the daughter seeds against the weights of the mother seeds. He discovered that the median weights of daughter seeds from a particular size of mother seed approximately described a straight line with positive slope less than 1.0. Galton’s first insights about regression sprang from this two-dimensional diagram plotting the sizes of daughter peas against the sizes of mother peas. He used this representation of his data to illustrate basic foundations of what statisticians still call regression today. For Galton, it was also a way to describe the benefits of eugenics studies.

Galton's research was appreciated by many intellectuals of his time. In 1869, in 'Hereditary Genius', Galton claimed that genius is mainly a matter of ancestry. He falsely believed that there was a biological explanation for social inequality across races. Galton even persuaded his half-cousin Charles Darwin of his ideas. After reading Galton's paper, Darwin stated, *"You have made a convert of an opponent in one sense for I have always maintained that, excepting fools, men did not differ much in intellect, only in zeal and hard work."* Luckily, the modern study of heredity managed to eliminate the myth of racially-based genetic difference, something Galton tried so hard to maintain.

The reason why we bring him up in this series, is that he was among the first scientists to use statistical methods in his research. His major contribution to the field was linear regression analysis, laying the groundwork for much of modern statistical modelling. While we engage with the field of machine learning, Algolit tries not to forget that ordering systems hold power, and that this power has not really been wielded for everyone. Machine learning has inherited many aspects of statistical research, some less agreeable than others. We need to be wary, because these worldviews do seep into the algorithmic models that create new order and orders.

##### References

http://galton.org/letters/darwin/correspondence.htm

https://www.tandfonline.com/doi/full/10.1080/10691898.2001.11910537

http://www.paramoulipist.be/?p=1693

## Perceptron

We find ourselves in a decade in which neural networks are sparking a lot of attention. This was not always the case. The study of neural networks goes back to the 1940s, when the first neuron metaphor emerged. The neuron is not the only biological reference in the field of machine learning - think of the word corpus or training. The artificial neuron was constructed in strong connection to its biological counterpart.

Psychologist Frank Rosenblatt was inspired by fellow psychologist Donald Hebb's work on the role of neurons in human learning. Hebb stated that "cells that fire together wire together." His theory now lies at the basis of associative human learning, but also unsupervised neural network learning. It moved Rosenblatt to expand on the idea of the artificial neuron.

In 1962 he created the Perceptron. The perceptron is a model that learns through the weighting of inputs. It was set aside by following researchers, because it can only handle binary classification. This means that the data has to be linearly separable, as for example, men and women, black and white. It is clear that this type of data is very rare in the real world. When the so-called first AI winter arrived in 1974–1980 and the funding that went into this research decreased, the Perceptron was also neglected. For 10 years it stayed dormant. When the spring settled in, new researcher generations picked it up again and used it to construct neural networks. These contain multiple layers of perceptrons. That is how neural networks saw the light. One could say that this machine learning season is particularly warm, but it takes another winter to know a summer.

## BERT

Some online articles say the year 2018 marked a turning point for the field of Natural Language Processing. A series of deep-learning models achieved state-of-the-art results on tasks like question answering or sentiment classification. Google’s BERT algorithm entered the machine learning competitions of last year as a sort of “one model to rule them all.” It shows a superior performance over a wide variety of tasks. BERT is pre-trained; its weights are learned in advance through two unsupervised tasks. This means BERT doesn’t need to be trained from scratch for each new task. You only have to finetune its weights. This also means that a programmer wanting to use BERT, does not know any longer what parameters BERT is tuned to, nor what data it has seen to learn its performances. BERT stand for Bidirectional Encoder Representations from Transformers. This means that BERT allows for bidirectional training. The model learns the context of a word based on all of its surroundings, left and right of a word. As such, it can differenciate between 'I accessed the bank account' and 'I accessed the bank of the river'.

Some facts:

- BERT_large, with 345 million parameters, is the largest model of its kind. It is demonstrably superior on small-scale tasks to BERT_base, which uses the same architecture with “only” 110 million parameters.
- to run BERT you need to use TPU's. These are the Google's CPU's especially engineered for TensorFLow, the deep learning platform. TPU's renting rates from 8$/h till 394$/h. If you don't want to work with off-the-shelf-packages, as we do with Algolit, but are interested in opening the blackbox, BERT asks for quite some savings in order to be used.