# Document analysis with machine learning

Cookbook recipes!

For people doing digital humanities work, the possibilities in the document embeddings corner of the machine learning world look especially promising.

I’ve been thinking about which machine learning tools can contribute the most to the field of digital humanities, and an obvious candidate is document embeddings. I’ll describe what these are below but I’ll start with the fun part: after using some document embedding Python scripts to compare the roughly 560 Wikibooks recipes to each other, I created an If you liked… web page that shows, for each recipe, what other recipes were calculated to be most similar to that recipe.

In Semantic web semantics vs. vector embedding machine learning semantics I wrote about how neural networks can assign vectors of values to words based on the relationships among words in a given text corpus. Once these word vectors are “embedded” in a common vector space, relationships between those vectors can reflect the semantics of the words. The classic examples are asking a system that has done this for a decent-sized corpus of English text “king is to queen as man is to what” or “London is to England as Berlin is to what”. By comparing the calculated vectors, it’s relatively easy for a system to answer “woman” to the first question and “Germany” to the second.

That post also mentioned how we can assign vectors to other things besides words. Plenty of code is available to generate and work with document embeddings, so I tried this with the flair Python NLP framework available on github. For an introduction to flair, I recommend the github page’s tutorial and the article Text Classification with State of the Art NLP Library — Flair.

To generated document embedding vectors for the Wikibooks recipes and then compare them all with each other I based my demo script below on the flair example at the cosine_similarity_using_embeddings git repo. My demo shown here just does a few recipes, for reasons explained further down, and outputs RDF about the similarity scores it calculated so that I could perform SPARQL queries about those similarities.

#!/usr/bin/env python

# Read Wikibook recipes, calculate document vectors for each, calculate
# all cosine similarity pairings, and output RDF about the result. Recipes
# were stripped, and then <title></title> and <url></url> tags added to each.

import glob
import re
import pickle
from flair.embeddings import Sentence, StackedEmbeddings, FlairEmbeddings,WordEmbeddings

import time
import numpy as np
import regex as re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Most of this code is based on
# https://github.com/swapnilg915/cosine_similarity_using_embeddings/blob/master/flair_embeddings.py

# initialize embeddings
glove_embedding = WordEmbeddings('glove')
flair_embedding_forward = FlairEmbeddings('news-forward')
flair_embedding_backward = FlairEmbeddings('news-backward')

class FlairEmbeddings(object):

def __init__(self):
self.stop_words = list(stopwords.words('english'))
self.lemmatizer = WordNetLemmatizer()
self.stacked_embeddings = StackedEmbeddings(
embeddings=[flair_embedding_forward, flair_embedding_backward])

def word_token(self, tokens, lemma=False):
tokens = str(tokens)
tokens = re.sub(r"([\w].)([\~\!\@\#\\$\%\^\&\*\-\+\{\}\/\"\'\:\;])([\s\w].)", "\\1 \\2 \\3", tokens)
tokens = re.sub(r"\s+", " ", tokens)
if lemma:
return " ".join([self.lemmatizer.lemmatize(token, 'v') for token in \
word_tokenize(tokens.lower()) if token not in self.stop_words \
and token.isalpha()])
else:
return " ".join([token for token in word_tokenize(tokens.lower()) if token not in \
self.stop_words and token.isalpha()])

def cos_sim(self, a, b):
return np.inner(a, b) / (np.linalg.norm(a) * (np.linalg.norm(b)))

def getFlairEmbedding(self, text):
sentence = Sentence(text)
self.stacked_embeddings.embed(sentence)
return np.mean([np.array(token.embedding) for token in sentence], axis=0)

#################

if __name__ == '__main__':
recipeDirectory = ''

# For this demo, just get the recipes whose titles begin with "J".
filenameArray = glob.glob('/home/bob/temp/wprecipes/data/g-p/Cookbook:J*')

print('# start: ' + time.strftime('%H:%M:%S'))

recipeDataArray = []   # each entry will be an array with the following entries so
# that they can be referenced like this: recipeDataArray[3][recipeTitleField]
recipeTitleField = 0
urlField = 1
recipeField = 2
recipeEmbeddingField = 3

obj = FlairEmbeddings()

for file in filenameArray:
recipeContent = ''
input = open(file, "r")
for line in input:
if ("<title>" in line):
title = re.sub(r'^\s*<title>','',line)  # Remove title tags.
title = re.sub(r'\s*</title>\s*','',title)
recipeContent = recipeContent + line
if ("<url>" in line):
url = re.sub(r'^\s*<url>','',line)  # Remove url tags.
url = re.sub(r'\s*</url>\s*','',url)
##print(file + ': ' + url)
recipeContent = recipeContent + line
input.close()
recipeDataArray.append([title,url,recipeContent])

print('# starting to calculate embeddings: ' + time.strftime('%H:%M:%S'))

# Calculate and save embeddings
for r in recipeDataArray:
recipeEmbedding = obj.getFlairEmbedding(r[recipeField])
r.append(recipeEmbedding)

print('# starting comparisons: ' + time.strftime('%H:%M:%S'))

print("@prefix d: <http://learningsparql.com/data#> .")
print("@prefix m: <http://learningsparql.com/model#> .")
print("@prefix dc: <http://purl.org/dc/elements/1.1/> .\n")

# Find the cosine similarity of all the combinations
recipesToCompare = len(recipeDataArray)  # or some small number for tests
i1 = 0
while i1 < recipesToCompare:
title = recipeDataArray[i1][recipeTitleField].replace('\"','\\"')
# Output a triple with the recipe's title.
print('<' + recipeDataArray[i1][urlField] + '>  dc:title "' + title + '" .')
i2 = i1 + 1;
while i2 < recipesToCompare:
# output triples like [ m:doc recipeN, recipeN+1 ; m:recipeCosineSim 0.8249611 ]
recipeCosineSim = \
obj.cos_sim(recipeDataArray[i1][recipeEmbeddingField],
recipeDataArray[i2][recipeEmbeddingField])

print('[ m:doc <' + recipeDataArray[i1][urlField] +
'>, <' + recipeDataArray[i2][urlField] +
'> ; m:recipeCosineSim ' + str(recipeCosineSim) + ' ] . ')
i2 += 1
i1 += 1

print('# finished: ' + time.strftime('%H:%M:%S'))


On my Dell XPS 13 9350 laptop it took about 30 seconds to calculate each embedding. For 546 recipes, that is several hours, and my laptop was running very hot after half an hour of that. (This got Start Me Up completely stuck in my head throughout the experiment.) The script above demonstrates the steps of what I did at a small scale, but to create the full “If you liked…” recipe comparison page I did the following. (You can find the scripts and queries results for this in my own github repository.)

Instead of reading all the recipes, calculating their embeddings, and calculating their similarities in one run, I split the script above in half. The first half performed three steps:

1. Read a third of the recipes. Without the “J” in data/g-p/Cookbook:J* above, that’s the middle third of the recipe collection; the other two thirds were in data/a-f and data/q-z.

2. Calculate embeddings for each recipe.

3. Store the resulting array in a Python pickle file.

(If I pursue this more I plan to do all that in one batch on an Amazon AWS EC2 instance. Machine learning in the cloud is a topic you hear about often, and when you could fry an egg on your own laptop it starts to look especially appealing.)

After running that first script on the three batches of recipes, my second script read the pickle files that the three runs created into one big recipeDataArray array and then did the “Find the cosine similarity of all the combinations” part of the script above on that array. Even with 546 recipes, that only took two seconds. It’s nice to know that if a linear increase in the number of documents to compare turns out to mean a geometric increase in the number of comparisons to make, the calculation of each pair’s similarity is so quick that the geometric increase is not a big deal–at least at this scale. (Some of the embedding vectors didn’t come out because the input was apparently an empty string, according to the error messages. This resulted in 1% of the cosine similarity figures being vectors of “nan”, or Not a Number, values. If I was doing this for a paying client I would find the input that caused these problems and do something about it, but for a personal project fun demo I just removed the offending vectors moving on to the next step.)

After this script output RDF about the recipe similarities I could then explore the results. The following excerpt from that RDF gives you the flavor of what the queries had to work with. Each comparison is a blank node connecting up information about what two documents were compared and their comparison score. The dc:title triples show the actual titles of recipes:

[ m:doc
<https://en.wikibooks.org/wiki/Cookbook:Apple_Raisin_Oat_Muffins>,
m:recipeCosineSim 0.90684676 ] .

<https://en.wikibooks.org/wiki/Cookbook:Apple_Raisin_Oat_Muffins>
dc:title "Cookbook:Apple Raisin Oat Muffins" .


The following SPARQL query lists all the pairings in ascending order of cosine similarity:

PREFIX m: <http://learningsparql.com/model#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?score ?title1 ?title2 WHERE {
?comparison m:doc ?recipe1, ?recipe2;
m:recipeCosineSim ?score .
?recipe1 dc:title ?title1 .
?recipe2 dc:title ?title2 .
FILTER (?recipe1 != ?recipe2)
}
ORDER BY ?score


It turns out that the two most similar recipes are Wonton Soup and the Egg Roll recipe, with a cosine similarity score of 0.9928804. The pairing of Pork Pot Pie and Chicken Pot Pie II came in second. (I was relieved to see that there was no Chicken Pot Pie I, because if there had been and it wasn’t more similar to its sequel than the Pork Pot Pie, then the whole model’s ability to determine similarities would be much more questionable. As you’ll see below, it’s questionable anyway, but there are actions I can take to try to improve it.)

A slight variation on the above query created the basis of my “If you liked…” page. It sorts the results by the recipe titles and then by descending order of what is most similar to each recipe.

PREFIX m: <http://learningsparql.com/model#>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

# A comparison object looks like this:
# [ m:doc <https://en.wikibooks.org/wiki/Cookbook:Apple_Raisin_Oat_Muffins>,
#   <https://en.wikibooks.org/wiki/Cookbook:Adobo> ; m:recipeCosineSim 0.8590696 ] .

SELECT ?score ?doc1URL ?doc1title ?doc2URL ?doc2title WHERE {
?comparison m:doc ?doc1URL, ?doc2URL; m:recipeCosineSim ?score .
?doc1URL dc:title ?doc1title .
?doc2URL dc:title ?doc2title .

FILTER(?doc1URL != ?doc2URL)

# Experimenting with the cutoff that lead to a figure of .975:
# .92: 189970 result lines; .95: 76866; .97: 8562; .98: 544; .975: 2604
FILTER(?score > .975)
}
ORDER BY ?doc1title desc(?score)


When I ran this with arq I asked it for tab-separated value output, and I wrote a perl script to convert that to HTML. I also did a little hand-editing at the top of the HTML file to add the introduction, and that’s what you see at the If you liked… page. This web page is hopefully useful because you can easily look up a given recipe and find out which others are most similar to it. All recipe names there are links to the original recipes; their URLs were easy to store throughout the various steps because I treated each recipe’s URL as the URI for that document. This whole Linked Data thing is pretty useful sometimes!

You can find the perl script and queries, along with the scripts I used to pull down and pepare the recipe data, in the git repository I created.

A great feature of the Wikibooks recipe collection is how many different cuisines are represented, so I was hoping for some interesting cross-cultural pairings, but so many of the pairings make so little sense that it’s difficult to take the non-obvious ones seriously. This starts with the very first one on the list: I have no idea why it rates Macaroni and Cheese as the closest thing to Bánh mì. Ranking Guacamole as very similar to A Nice Cup of Tea is even worse.

This reminds us that, as VCs throw their money at AI startups who promise easy, plug-and-play machine learning, we must remember what machine learning people call the “no free lunch” principal: no single model is going to do everything well. Getting good results means tweaking parameters for how tools do their work, and knowing what tweaks to make requires some study.

I have ideas for experiments to get the cosine similarity scores to make more intuitive sense. Many of the values set in the “initialize embeddings” and “class FlairEmbeddings” sections of the script above were choices made from a particular selection. (I didn’t choose them myself, but just copied them from othe examples.) For example, instead of using news-forward and news-backward as the character-level language models, I could have selected from other choices described in the source code. For the word embeddings that the flair document embeddings build on, the same source code shows other alternatives to GloVe.

But, the time it takes to calculate all of those document embeddings makes it difficult to quickly churn through different combinations of initialization settings. I started up a few small cheap AWS Amazon Machine Images and was unable to install flair on them, so my next line of research is to keep looking for a good one for this. (I appreciate any suggestions…)

Still, the fact that I could take someone else’s 63-line script, modify it a bit, and use machine learning to create an HTML index of recipe similarity in a good-sized yet diverse cookbook means that getting and then tweaking such tools is not a super difficult thing to do with a collection of documents. For people doing digital humanities work, the possibilities in the document embeddings corner of the machine learning world look especially promising.