Contextual stories about Readers

From Algolit

Revision as of 08:31, 1 March 2019 by Cristina (talk | contribs)

Naive Bayes, Support Vector Machines or Linear Regression are called classical machine learning algorithms. They perform well when learning with small datasets. But they often require complex Readers. The task the Readers do, is also called feature engineering. This means that a human needs to spend time on a deep exploratory data analysis of the dataset.

Features can be the frequency of words or letters, but also syntactical elements like nouns, adjectives, or verbs. The most significant features for the task to be solved, must be carefully selected and passed over to the classical machine learning algorithm. This process marks the difference with Neural Networks. When using a neural network, there is no need for feature engineering. Humans can pass the data directly to the network and usually achieve good performance right off the bat. This saves a lot of time, energy, and money.

The downside of collaborating with Neural Networks is that you need a lot more data to train your prediction model. Think of 1GB or more of pure text files. To give you a reference, 1 A4, a text file of 5000 characters only weighs 5 KB. You would need 8.589.934 pages. More data also means more access to useful datasets and more, much more processing power.

Character n-gram for authorship recognition

Imagine... you've been working for a company for more than ten years. You have been writing tons of emails, papers, internal notes and reports on very different topics and in very different genres. All your writings, as well as those of your colleagues, are safely backed-up on the servers of the company.

One day, you fall in love with a colleague. After some time you realize this human is rather mad and hysterical and also very dependent on you. The day you decide to break up, your now-ex creates a plan to kill you. They succeed. This is unfortunate. A suicide letter in your name is left next to your corpse. Because of emotional problems, it says, you decided to end your life. Your best friends don't believe it. They decide to take the case to court. And there, based on the texts you and others have produced, a machine learning model reveals that the suicide letter was written by someone else.

How does a machine analyse texts in order to identify you? The most robust feature for authorship recognition is delivered by the character n-gram technique. It is used in cases with a variety of thematics and genres of the writing. When using character n-grams, texts are considered as sequences of characters. Let's consider the character trigram. All the overlapping sequences of three characters are isolated. For example, the character 3-grams of 'Suicide', would be, “Sui,” uic”, “ici”, “cid” etc. Character n-gram features are very simple, they're language independent and they're tolerant to noise. Furthermore, spelling mistakes do not jeopardize the technique.

Patterns found with character n-grams focus on stylistic choices that are unconsciously made by the author. The patterns remain stable over the full length of the text, which is important for authorship recognition. Other types of experiments could include measuring the length of words or sentences, the vocabulary richness, the frequencies of function words; even syntax or semantics-related measurements.

This means not only your physical fingerprint is unique, but also the way you compose your thoughts!

The same n-gram technique discovered that The Cuckoo’s Calling, a novel by Robert Galbraith, was actually written by... J. K. Rowling!


A history of n-grams

The n-gram algorithm can be traced back to the work of Claude Shannon in information theory. In the paper, 'A mathematical theory of communication', published in 1948, Claude Shannon performed the first instance of an n-gram-based model for natural language. He posed the question: given a sequence of letters, what is the likelihood of the next letter?

If you listen to the following excerpt, can you tell who it was written by? Shakespeare or a n-gram robot?


Do I stand till the break off.


Hide thy head.


He purposeth to Athens: whither, with the vow

I made to handle you.


My good knave.

You may have guessed, considering the topic of this podcast, that an n-gram algorithm generated this text. The model is trained on the compiled works of Shakespeare. While more recent algorithms, such as the recursive neural networks of the CharNN, are becoming famous for their performance, n-grams still execute a lot of NLP tasks. They are used in statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, ...

God in Google Books

In 2006, Google created a dataset of n-grams from their digitised book collection and released it online. Recently they also created an N-gram viewer.

This allowed for many socio-linguistic investigations of debatable reliability. For example, in October 2018, the New York Times Magazine published an opinion titled It’s Getting Harder to Talk About God. The author, Jonathan Merritt, had analysed the mention of the word 'God' in Google's dataset using the N-gram viewer. He concluded that there was a decline in the word's usage since the 20th Century. Google's corpus contains texts from the 16th Century leading up to the 21st. However, what the author missed out on, was the growing popularity of scientific journals around the beginning of the 20th Century. This genre means a great offset in the dataset. When searching only through the English fiction corpus, the line describing the word frequency for 'God' is closer to a wave recuperating from a ripple.

Grammatical features taken from twitter influence the stock market

The boundaries between academic disciplines are becoming blurred. Economics research mixed with psychology, social science, cognitive and emotional concepts gives rise to a new economics subfield, called 'behavioral economics'. This means that researchers start to explain an economical behavior based on other factors than the economy only. Both economy and public opinion can influence or be influenced by each other. A lot of research is done on how to use public opinion to predict financial changes, like stock price changes.

Public opinion is estimated from sources of large amounts of public data, like tweets or news. To some extent, Twitter can be more accurate than news in terms of representing public opinion because most accounts are personal: the source of a tweet could be an ordinary person, rather than a journalist who works for a certain organization. And there are around 6,000 tweets authored per second, so a lot of opinions to sift through.

Experimental studies using machinic data analysis show that the changes in stock prices can to some degree be predicted by looking at public opinion. There are multiple papers that analyze news sentiments to predict stock trends labeling them as “Down” or “Up”. Most of the researchers used neural networks or pretrained word embeddings.

A paper by Haikuan Liu of the Australian National University states that the tense of verbs used in tweets can be an indicator of intensive financial behaviors. His idea was inspired by the fact that the tense of text data is used as part of feature engineering to detect early stages of depression.


Paper: Grammatical Feature Extraction and Analysis of Tweet Text: An Application towards Predicting Stock Trends, Haikuan Liu, Research School of Computer Science (RSCS), College of Engineering and Computer Science (CECS), The Australian National University (ANU)

Bag of words

In natural language processing, 'bag of words' is considered to be an unsophisticated model. It strips a text of its context and dismantles it into its collection of unique words. These words are then counted. In the previous sentences, for example, 'words' is mentioned three times, but this is not necessarily an indicator of the text's focus.

The first appearance of the expression 'bag of words' seems to be in 1954. Zellig Harris published a paper in the context of linguistic studies, called "Distributional Structure". In the section called "Meaning as a function of distribution", he says "for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use. The linguist's work is precisely to discover these properties, whether for descriptive analysis or for the synthesis of quasi-linguistic system."