Contextual stories about Readers
From Algolit
Naive Bayes, Support Vector Machines or Linear Regression are called classical machine learning algorithms. They perform well when learning with small datasets. But they often require complex Readers. The task the Readers do, is also called feature engineering. This means that a human needs to spend time on a deep exploratory data analysis of the dataset.
Features can be the frequency of words or letters, but also syntactical elements like nouns, adjectives, or verbs. The most significant features for the task to be solved, must be carefully selected and passed over to the classical machine learning algorithm. This process marks the difference with Neural Networks. When using a neural network, there is no need for feature engineering. Humans can pass the data directly to the network and usually achieve good performance right off the bat. This saves a lot of time, energy, and money.
The downside of collaborating with Neural Networks is that you need a lot more data to train your prediction model. Think of 1GB or more of pure text files. To give you a reference, 1 A4, a text file of 5000 characters only weighs 5 KB. You would need 8.589.934 pages. More data also means more access to useful datasets and more, much more processing power.
Contents
Character n-gram for authorship recognition
Imagine... you've been working for a company for more than ten years. You have been writing tons of emails, papers, internal notes and reports on very different topics and in very different genres. All your writings, as well as those of your colleagues, are safely backed-up on the servers of the company.
One day, you fall in love with a colleague. After some time you realize this human is rather mad and hysterical and also very dependent on you. The day you decide to break up, your now-ex creates a plan to kill you. They succeed. This is unfortunate. A suicide letter in your name is left next to your corpse. Because of emotional problems, it says, you decided to end your life. Your best friends don't believe it. They decide to take the case to court. And there, based on the texts you and others have produced over ten years, a machine learning model reveals that the suicide letter was written by someone else.
How does a machine analyse texts in order to identify you? The most robust feature for authorship recognition is delivered by the character n-gram technique. It is used in cases with a variety of thematics and genres of the writing. When using character n-grams, texts are considered as sequences of characters. Let's consider the character trigram. All the overlapping sequences of three characters are isolated. For example, the character 3-grams of 'Suicide', would be, “Sui,” uic”, “ici”, “cid” etc. Character n-gram features are very simple, they're language independent and they're tolerant to noise. Furthermore, spelling mistakes do not jeopardize the technique.
Patterns found with character n-grams focus on stylistic choices that are unconsciously made by the author. The patterns remain stable over the full length of the text, which is important for authorship recognition. Other types of experiments could include measuring the length of words or sentences, the vocabulary richness, the frequencies of function words; even syntax or semantics-related measurements.
This means not only your physical fingerprint is unique, but also the way you compose your thoughts!
The same n-gram technique discovered that The Cuckoo’s Calling, a novel by Robert Galbraith, was actually written by... J. K. Rowling!
Reference
- Paper: On the Robustness of Authorship Attribution Based on Character N-gram Features, Efstathios Stamatatos, in Journal of Law & Policy, Volume 21, Issue 2, 2013.
- News article: https://www.scientificamerican.com/article/how-a-computer-program-helped-show-jk-rowling-write-a-cuckoos-calling/
A history of n-grams
The n-gram algorithm can be traced back to the work of Claude Shannon in information theory. In the paper, 'A mathematical theory of communication', published in 1948, Claude Shannon performed the first instance of an n-gram-based model for natural language. He posed the question: given a sequence of letters, what is the likelihood of the next letter?
If you listen to the following excerpt, can you tell who it was written by? Shakespeare or an n-gram piece of code?
SEBASTIAN:
Do I stand till the break off.
BIRON:
Hide thy head.
VENTIDIUS:
He purposeth to Athens: whither, with the vow
I made to handle you.
FALSTAFF:
My good knave.
You may have guessed, considering the topic of this story, that an n-gram algorithm generated this text. The model is trained on the compiled works of Shakespeare. While more recent algorithms, such as the recursive neural networks of the CharNN, are becoming famous for their performance, n-grams still execute a lot of NLP tasks. They are used in statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, ...
God in Google Books
In 2006, Google created a dataset of n-grams from their digitized book collection and released it online. Recently they also created an N-gram viewer.
This allowed for many socio-linguistic investigations of questionable reliability. For example, in October 2018, the New York Times Magazine published an opinion article titled It’s Getting Harder to Talk About God. The author, Jonathan Merritt, had analysed the mention of the word 'God' in Google's dataset using the N-gram viewer. He concluded that there was a decline in the word's usage since the 20th Century. Google's corpus contains texts from the 16th Century leading up to the 21st. However, what the author missed out on, was the growing popularity of scientific journals around the beginning of the 20th Century. This new genre that was not mentioning the word God, shifted the dataset. If the scientific literature was taken out of the corpus, the frequency of the word 'God' would again flow like a gentle ripple from a distant wave.
Grammatical features taken from Twitter influence the stock market
The boundaries between academic disciplines are becoming blurred. Economics research mixed with psychology, social science, cognitive and emotional concepts gives rise to a new economics subfield, called 'behavioral economics'. This means that researchers start to explain an economical behavior based on factors other than the economy only. Both economy and public opinion can influence or be influenced by each other. A lot of research is done on how to use public opinion to predict financial changes, like stock price changes.
Public opinion is estimated from sources of large amounts of public data, like tweets or news. To some extent, Twitter can be more accurate than news in terms of representing public opinion because most accounts are personal: the source of a tweet could be an ordinary person, rather than a journalist who works for a certain organization. And there are around 6,000 tweets authored per second, so a lot of opinions to sift through.
Experimental studies using machinic data analysis show that the changes in stock prices can be predicted by looking at public opinion, to some degree. There are multiple papers that analyze sentiments in news to predict stock trends by labeling them as either “Down” or “Up”. Most of the researchers used neural networks or pretrained word embeddings.
A paper by Haikuan Liu of the Australian National University states that the tense of verbs used in tweets can be an indicator of intensive financial behaviors. His idea was inspired by the fact that the tense of text data is used as part of feature engineering to detect early stages of depression.
Reference
Paper: Grammatical Feature Extraction and Analysis of Tweet Text: An Application towards Predicting Stock Trends, Haikuan Liu, Research School of Computer Science (RSCS), College of Engineering and Computer Science (CECS), The Australian National University (ANU)
Bag of words
In natural language processing, 'bag of words' is considered to be an unsophisticated model. It strips text of its context and dismantles it into a collection of unique words. These words are then counted. In the previous sentences, for example, 'words' is mentioned three times, but this is not necessarily an indicator of the text's focus.
The first appearance of the expression 'bag of words' seems to go back to 1954. Zellig Harris, an influential linguist, published a paper called "Distributional Structure". In the section called "Meaning as a function of distribution", he says "for language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use. The linguist's work is precisely to discover these properties, whether for descriptive analysis or for the synthesis of quasi-linguistic system."