“Embed, embed! There’s knocking at the gate.”
Detecting Intertextuality with Embeddings and the Vectorian
Bernhard Liebl & Manuel Burghardt
Computational Humanities Group, Leipzig University
1. Introduction
The detection of intertextual references in text corpora is a topic in digital humanities that has gained a lot of attention in recent years (for instance Bamman & Crane, 2008; Burghardt et al., 2019; Büchler et al., 2013; Forstall et al., 2015; Scheirer et al., 2014). While intertextuality – from a literary studies perspective – describes the phenomenon of one text being present in another text (cf. Genette, 1993), the computational problem at hand is the task of text similarity detection (Bär et al., 2012), and more concretely, semantic similarity detection.
In the following example of Shakespearean intertextuality, the words bleed and leak are semantically (and phonetically) similar, demonstrating that Star Trek here is quoting Shakespeare without any doubt:
Shylock: If you prick us, do we not bleed.
(Shakespeare; The Merchant of Venice)
Data: If you prick me, do I not leak.
(Star Trek: The Next Generation; The Measure of a Man)
1.1 Enter: word embeddings
Over the years, there have been various attempts at measuring semantic similarity, some of them knowledge-based (e.g. based on WordNet), others corpus-based, like LDA (Chandrasekaran & Vijay, 2021). The advent of word embeddings (Mikolov et al., 2013) has changed the field considerably by introducing a new and fast way to tackle the notion of word meaning. On the one hand, word embeddings are building blocks that can be combined with a number of other methods, such as alignments, soft cosine or Word Mover's Distance, to implement some kind of sentence similarity (Manjavacas et al., 2019). On the other hand, the concept of embeddings can be extended to work on the sentence-level as well, which is a conceptually different approach (Wieting et al., 2016).
We introduce the Vectorian as a framework that allows researchers to try out different embedding-based methods for intertextuality detection. In contrast to previous versions of the Vectorian (Liebl & Burghardt, 2020a/b) as a mere web interface with a limited set of static parameters, we now present a clean and redesigned API that is showcased in this interactive Jupyter notebook.
We will first use the Vectorian to build queries where we plug in pre-trained static word embeddings such as fastText (Mikolov et al., 2018) and GloVe (Pennington et al., 2014). We evaluate the influence of computing similarity through alignments such as Waterman-Smith-Beyer (WSB; Waterman et al., 1976) and two variants of Word Mover’s Distance (WMD; Kusner et al., 2015). We also investigate the performance of state-of-art sentence embeddings like Siamese BERT networks (Reimers & Gurevych, 2019) for the task - both on a document level (as document embeddings) and as contextual token embeddings. Overall, we find that WSB with fastText offers highly competitive performance. We find some slight indication that POS tag-weighted WSB might offer further benefits in some scenarios. Readers can upload their own data for performing search queries and try out additional vector space metrics such as p-norms or improved sqrt‐cosine similarity (Sohangir & Wang, 2017).
1.2 Outline of the notebook
In the notebook, we will go through different examples of intertextuality to demonstrate and explain the implications of different embeddings and similarity measures. To achieve this, we provide a small ground truth corpus of intertextual Shakespeare references that can be used for some controlled evaluation experiments. Our main goal is to provide an interactive environment, where researchers can test out different methods for text reuse and intertextuality detection. This notebook thus adds to a critical reflection of digital methods and can help to shed some light on their epistemological implications for the field of computational intertextuality detection. At the end of the notebook, researchers can also easily import their own data and investigate all the showcased methods for their specific texts.
1.3 Technical setup
We import a couple of helper functions for visualizations and various computations (nbutils
), a wrapper to load our gold standard data (gold
), and finally the Vectorian library (vectorian
), through which we will perform searches and evaluations later on.
In nbutils.initialize
we check whether there is a bokeh server available. This typically is the case for local Jupyter installations, but is not the case for notebooks running on mybinder. In the latter case, the notebook has some limitations regarding interactivity.
import sys; sys.path.append("code") # make importable
import nbutils, gold, vectorian
import ipywidgets as widgets
from ipywidgets import interact
nbutils.initialize("auto", export=True)
2. Data and Tools
2.1 Introducing the gold standard dataset
In the following we use a collection of 100 short text snippets (=documents) that quote a total of 20 different Shakespeare phrases. All of these documents were derived from the WordWeb IDEM portal, where literary scholars collect intertextual references in a freely accessible database (Hohl-Trillini et al., 2020). Each document quotes exactly one of the 20 phrases. For some phrases, e.g. "to be or not to be", there are more quoting documents than for others (see interactive overview of documents below). If there are multiple documents that quote the same phrase, we selected them in a way that each of them does this in a different way. There are no verbatim quotes in the documents, but always more or less complex variations of the original phrase.
We use this collection of documents containing quotes as a gold standard in order to assess how well different embeddings and search algorithms are able to detect rephrasings of different types of quotes.
In technical terms, the gold standard data is represented as a directed graph, where nodes are phrases - e.g. "to be or not to be" - and edges model intertextuality - i.e. one phrase re-occurring in a different context. For example, Shakespeare's "to be or not to be" will have several outgoing edges that reference other phrases from other works that we consider intertextually related. Edges are directed, and start from the work containing query phrases (which is always by William Shakespeare in this notebook's gold standard data) and go to the work that contains dependent rephrasings. Note that this relationship is purely conceptual and does not imply a chronological timeline of text reuse. For example, "The rest is silence" occurs in Hamlet (1623 for the First Folio), whereas the rephrasing "the rest is all but wind" occurs in A Fig for Fortune (1596).
Nodes contain additional information on a phrase's context (i.e. surrounding text) and the containing work (and author), both of which allow us to understand where the phrase comes from and where it is used.
The visualization below shows the full gold data graph. Nodes are represented as circles. The 20 larger red nodes are source nodes, i.e. those nodes by Shakespeare that serve as queries for our investigations. The 100 smaller orange nodes are phrases that are related to the original Shakespeare phrase. By hovering over nodes, you are able to see the phrase itself, the work it occurs in, and the full context where it is embedded. Re-occurences of phrases are highlighted in bold.
gold_data = gold.load_data("data/raw_data/gold.json")
nbutils.plot_gold(gold_data, title=f"The gold data is a {gold_data}")
The browser widget below lets the reader explore the same graph data through a different UI. The specific example shown by default is the rephrasing of the Shakespeare phrase "to be or not to be" in a non-Shakespeare work titled "The Phoenix" by Thomas Middleton. For a deeper discussion of the intertextual provenience of this special phrase see (Trillini, 2020).
The phrase in Middleton's work is "to be named or not be named". The context, in which this rephrasing is embedded, is the whole line by "Fidelio".
nbutils.Browser(gold_data, "to be or not to be", "The Phoenix");
While the structure of the gold data has been geared towards our specific use case in this notebook, the graph-based format of gold.json should be obvious to understand and easy to replace with custom datasets. Note that the loader inside gold.py is very simple and essentially just building a graph. Also note that only nodes with in-degree 0 are considered as base nodes that are converted into queries later on in the notebook.
2.2 Overview of different types of embeddings
Word embeddings take up the linguistic concept of collocations. The other words with which each word occurs in a corpus are recorded. These collocation profiles are then represented as vectors. If, for example, two words (e.g. "car" and "truck") occur with very similar words (e.g. „wheels, drive, street, etc.”) then they would also have very similar word vectors, i.e. they would be semantically - or at least structurally - very similar.
There are various established ways to compute embeddings for word similarity tasks. A first important distinction to be made is between token / word embeddings and document embeddings. While token embeddings model one embedding per token, document embeddings try to map an entire document (i.e. an ordered sequence of tokens) into one single embedding. There are two common ways to compute document embeddings. One way is to derive them from token embeddings - for instance through averaging token embeddings vectors. More complex approaches train dedicated models that are optimized to produce good document embeddings.
All in all, we can distinguish three types of embeddings:
- original token embeddings (these can be either static or contextual)
- document embeddings derived from token embeddings (e.g. through averaging)
- document embeddings from dedicated models, such as Sentence-BERT (Reimers & Gurevych, 2019).
The diagram below shows this taxonomy. Orange arrows indicate specific embeddings used in this notebook.
nbutils.plot_dot("miscellaneous/diagram_embeddings_1.dot")
The following diagram showcases various options for token embeddings. The most recent option is using contextual token embeddings (also sometimes called dynamic embeddings), which will incorporate a specific token's context and can be obtained from architectures like ELMO or BERT. Another option is using static token embeddings, which map one token to one embedding, independent of its specific occurrence in a text. For an overview of static and contextual embeddings, and their differences, see (Wang et al. 2020).
We have a variety of established options for static embeddings like fastText or GloVe. We can also combine several embeddings into one single embedding - a common mechanism used for this is stacking, i.e. concatenating embedding vectors.
nbutils.plot_dot("miscellaneous/diagram_embeddings_2.dot")
In this notebook, we showcase the following four variations of embeddings:
- Static token embeddings: these operate on the token level. We experiment with GloVe (Pennington et al. 2014), fastText (Mikolov et al., 2018) and Numberbatch (Speer et al, 2017). We use these three embeddings to compute token similarity and combine them with alignment algorithms (such as Waterman-Smith-Beyer) to compute document similarity. We also investigate the effect of stacking two static embeddings (fastText and Numberbatch) into a single new embedding.
- Contextual token embeddings: these also operate on the token level, but embeddings can change according to a specific token instance's context. In this notebook we experiment with using such token embeddings from the Sentence-BERT model (Reimers & Gurevych, 2019). Note that this model is usually used to produce document embeddings. For our experiments in this variant, we ignore this layer and access its underlying token embeddings.
- Document embeddings derived from specially trained models: document embeddings represent one document via one single embedding. Again, we use Sentence-BERT (Reimers & Gurevych, 2019), but this time we extract document embeddings. More specifically, we will use two Sentence-BERT models trained specifically for the semantic textual similarity (STS) task (Reimers & Gurevych, 2019).
- Document embeddings derived from token embeddings: We also experiment with averaging different kinds of token embeddings (static and contextual) to derive document embeddings.
2.3 "Shapespeare in the Vectorian Age" – Meet the Vectorian framework
To conduct our actual investigations, we rely on a framework called the Vectorian, which we first introduced in 2020 (Liebl & Burghardt, 2020a/b). Using highly optimized algorithms and data structures, the Vectorian enables interactive real-time searches over text corpora using a variety of approaches and strategies.
In order to use the Vectorian, we need to map the gold standard data to Vectorian API concepts (which we highlight like this
). As a first step, we take all contexts from the 100 gold standard phrases and use these as Documents
in the Vectorian.
A Document
in Vectorian terminology is something we can perform a search on. Documents
in the Vectorian are created using different kinds of Importers
that perform necessary natural language processing tasks using an additional NLP
class. Since this step can be time-consuming, we pre-computed this step and use the Corpus
class to quickly load these pre-processed Documents into the notebook. For details about the pre-processing, see code/prepare_corpus.ipynb
.
Note that using the phrase contexts as Documents
is a simplification of the search process necessary for a clean evaluation. While using a full book or work as Document
and searching over its parts (e.g. over all sentences or over a sliding window of its tokens) would be a more realistic setting, we had to manually re-check all results classified as false positives in such a setting, since the automatic search might reveal new correct text reuse references which we were previously unaware of.
In contrast, our gold standard has been manually curated such that there is one and only one text reuse reference per context. By searching over contexts that carry exactly one correct text reuse reference, we can ensure that our performance evaluation of a search strategy is sound.
Using the loaded Documents
and a set of Embeddings
, we then create a Session
that allows us to perform searches for instances of intertextuality. More details about the technical architecture we build on in this notebook can be found in the source code and the API Documentation for the Vectorian.
2.3.1 Loading word embeddings
In terms of static embeddings, we will work with pre-trained versions of GloVe, fastText and Numberbatch. GloVe uses a form of matrix factorization on a global co-occurence matrix to compute embeddings for a finite set of predefined tokens (Pennington et al. 2014). In contrast, fastText training operates on local context windows (Mikolov et al., 2018). In contrast to GloVe and the earlier word2vec, fastText additionally computes embeddings on character n-grams instead of tokens, which means there are no out-of-vocabulary tokens (Mikolov et al., 2018). GloVe and fastText only use data from a corpus, whereas Numberbatch embeddings additionally incorporate information from a knowledge graph (Speer et al, 2017).
For reasons of limited RAM in the interactive Binder environment (and to limit download times), we use small or compressed versions of the official pre-trained versions:
- for GloVe, we use the official 50-dimensional version of the 6B variant
- for fastText we use a version that was trained on Common Crawl and Wikipedia using CBOW, and then compressed using the standard settings in https://github.com/avidale/compress-fasttext
- for Numberbatch we use version 19.08 that was reduced into a 50-dimension version using a standard PCA
We also use one stacked embedding, in which we combine fastText and Numberbatch. We will call this embedding fasttext_numberbatch
.
Finally we will use contextual embeddings based on the Sentence-BERT architecture (Reimers & Gurevych, 2019). We use two models, with the second one being the newer one and - as it has been trained for asymmetric search - more suitable to the task at hand:
- the pre-trained English paraphrase_distilroberta_base_v1 model, which is trained for symmetric semantic search
- the pre-trained English msmarco-distilbert-base-v4 model, which is trained for asymmetric semantic search.
We refer to both models as sbert
variants. Note that many other models can be trained for the Sentence-BERT architecture, that might perform differently on the tasks at hand.
Also note that all embeddings we use in this notebook were trained from large generic corpora, i.e. no embedding was trained from the documents we search over.
We first need to instantiate a NLP parser that is capable of providing us with standard NLP tasks such as tokenization and POS tagging. Internally, we use spaCy to construct a suitable parser. nlp.pipeline
will return the fully constructed NLP pipeline in case the reader is interested.
nlp = nbutils.make_nlp()
We now create the desired sbert
embeddings (more specifically, suitable class instances compatible with the Vectorian) as well as the other static embeddings we described earlier. The embeddings.yml file referenced below contains a detailed technical description of what is loaded exactly.
the_embeddings = nbutils.load_embeddings("data/raw_data/embeddings.yml")
print("loaded:", ", ".join(the_embeddings.keys()))
loaded: glove, fasttext, numberbatch, sbert_paraphrase, sbert_msmarco, fasttext_numberbatch
2.3.2 Creating the session
The following code creates a Session
in the Vectorian framework that will allow us to perform searches over the gold standard corpus using the desired embeddings:
session = vectorian.session.LabSession(
vectorian.corpus.Corpus("data/processed_data/corpus", mutable=False),
embeddings=the_embeddings.values())
Finally, the following code will speed up searches later in the notebook by loading all contextual embedding vectors into RAM.
session.cache_contextual_embeddings()
3. Embeddings as a tool for intertextuality research
3.1 Exploring word embeddings
3.1.1 An introduction to word embeddings and token similarity
Before we dive into the actual analyses (of the instances of intertextuality), we first take a brief look at the inner workings of embeddings. Mathematically speaking, a word embedding is a vector x of dimension n, i.e. a vector consisting of n scalars.
x = (x1,x2,...,xn − 1,xn)
For example, the compressed numberbatch embedding we use has n=50 and thus represents the word "coffee" with the following 50 scalar values:
widgets.GridBox(
[
widgets.Label(f"{x:.2f}")
for x in session.word_vec(the_embeddings["numberbatch"], "coffee")
],
layout=widgets.Layout(grid_template_columns="repeat(10, 50px)"),
)
Since the above representation is difficult to understand, we visualize the values of
x1, x2, ..., xn − 1, xn
through different colors. By default, all values are normalized by ||x||₂, i.e. the dot product of these vectors gives the cosine similarity.
@interact(
embedding=widgets.Dropdown(
options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
value=the_embeddings["numberbatch"],
),
normalize=True,
)
def plot(embedding, normalize):
nbutils.plot_embedding_vectors_val(
["sail", "boat", "coffee", "tea", "guitar", "piano"],
get_vec=lambda w: session.word_vec(embedding, w),
normalize=normalize,
)
By looking at these color patterns, we can gain some intuitive understanding of why and how word embeddings are appropriate for word similarity calculations. For example, sail and boat both show a strong activation for dimension 27. Similarly, guitar and piano share similar values for dimension 24. The words coffee and tea also share similar values in dimensions 1 and 2, which slightly set them apart from the other four words.
A common approach to compute the similarity between two word vectors u and v in this kind of high-dimensional vector spaces is to compute the cosine of the angle θ between the vectors, which is called cosine similarity:
A large positive value (i.e. a small θ between u and v) indicates higher similarity, whereas a small or even negative value (i.e. a large θ) indicates lower similarity. For a discussion of issues with this notion of similarity, see Faruqui et al. (2016).
The visualization below encodes
for different i, 1 ≤ i ≤ n, through colors to illustrate how different vector components (i.e. which values of i) contribute to the cosine similarity for two words. Brighter colors (orange/yellow) indicate dimensions with higher contribution.
@interact(
embedding=widgets.Dropdown(
options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
value=the_embeddings["numberbatch"],
)
)
def plot(embedding):
nbutils.plot_embedding_vectors_mul(
[("sail", "boat"), ("coffee", "tea"), ("guitar", "piano")],
get_vec=lambda w: session.word_vec(embedding, w),
)
As in the earlier plot, dimension 27 pops out as a strong link between sail and boat.
A comparable investigation of fastText shows similar spots of strong contributions. The plot here is somewhat more complex due to the higher number of dimensions (n = 300).
@interact(
embedding=widgets.Dropdown(
options=[(k, v) for k, v in the_embeddings.items() if not v.is_contextual],
value=the_embeddings["fasttext"],
)
)
def plot(embedding):
nbutils.plot_embedding_vectors_mul(
[("sail", "boat"), ("coffee", "tea"), ("guitar", "piano")],
get_vec=lambda w: session.word_vec(embedding, w),
)
Computing the overall cosine similarity for two words is mathematically equivalent to summing up the terms in the diagram above. The overall similarity between guitar and piano is approx. 68% with the fastText embedding we use. For guitar and coffee it is significantly lower with a similarity of approx. 20%.
from vectorian.sim.token import EmbeddingTokenSim
from vectorian.sim.vector import CosineSim
token_sim = EmbeddingTokenSim(the_embeddings["fasttext"], CosineSim())
[session.similarity(token_sim, "guitar", x) for x in ["piano", "coffee"]]
[0.68097234, 0.19857687]
Note that for contextual embeddings, we need to compute the similarity between the actual instances of tokens within a text document.
token_sim = EmbeddingTokenSim(the_embeddings["sbert_paraphrase"], CosineSim())
a = list(session.documents[0].spans(session.partition("document")))[0][2]
b = list(session.documents[6].spans(session.partition("document")))[0][10]
[a.text, b.text, session.similarity(token_sim, a, b)]
['dare', 'bear', 0.42446673]
3.1.2 Detecting Shakespearean intertextuality through word embeddings
We now explore the usefulness of embeddings and token similarity with the gold standard dataset that was introduced earlier. In the following example the phrase "the rest is silence" is quoted as "the rest is all but wind". While the syntactic structure is mirrored between original phrase and its re-occurrence, the term "silence" is replaced with "all but wind".
vis = nbutils.TokenSimPlotterFactory(session, nlp, gold_data)
plotter1 = vis.make("rest is silence", "Fig for Fortune")
Intuitively, we expect "silence" and "wind" to be related to a certain degree. To investigate how well this intuition transfers to our measurements through embeddings, we inspect the cosine similarity of the token "silence" with other tokens in the document's ("A Fig for Fortune, 1596") context for three different embedding models.
It becomes clear that for all three embeddings there is a strong connection between "silence" and "wind". The cosine similarity is particularly high with the Numberbatch model. Nevertheless, the absolute value of 0.3 for Numberbatch is still in a rather low range. Interestingly, GloVe associates "silence" with "action", which can be understood as quite the opposite of silence. The phenomenon that embeddings sometimes cluster opposites is a common observation and can be a problem when trying to distinguish between synonyms and antonyms.
plotter1("silence")
Another quote example involving the phrase "sea of troubles" is shown below. We see that the word "sea" is paraphrased as "waves", whereas "troubles" gets substituted by "troublesome". If we take a closer look at the cosine similarities of the tokens "sea" and "troubles" with all the other tokens in the document's context, we see that they are – expectedly – rather high, which means we should be able to detect such kinds of rephrasing.
plotter2 = vis.make("sea of troubles", "Book of Common Prayer")
plotter2("sea")
plotter2("troubles")
It is also interesting to investigate how out-of-vocabulary words like "troublesomest" produce zero similarities with standard key-value embeddings, whereas fastText is still able to produce a vector thanks to subword information.
plotter2("troublesomest")
3.2 Exploring document embeddings
Next, we consider the representation of each document with a single embedding to gain an understanding of how different embedding strategies relate to document similarity. We will later return to individual token embeddings.
For this purpose, we will use the two strategies already mentioned for computing document embeddings:
- averaging over token embeddings
- computing document embeddings through a dedicated model
In order to achieve the latter, we compute document embeddings through Sentence-BERT encoders.
doc_encoders = nbutils.make_doc_encoders(the_embeddings, session)
embedder = nbutils.DocEmbedder(
session=session,
nlp=nlp,
doc_encoders=doc_encoders,
encoder="paraphrase [doc]",
)
embedder.display()
Similar to the investigation of token embedding values, we now look at the feature dimensions of the document embeddings. In the following plot we observe that the phrase "an old man is twice a child" and the corresponding text reuses from the gold standard (i.e. the true positives) show some salient contribution around dimensions 25 and 300 (see the 5 upper rows and contrast them to the lower 5 rows). When comparing the same pattern with non-matching text reuse occurrences from the "go, by Saint Hieronimo" pattern on the other hand (see the 5 lower rows), there is less activation in these areas. Therefore these areas seem to offer some good features to differentiate the matching of a pattern with the correct occurrences.
bars = nbutils.DocEmbeddingBars(embedder, session, gold_data)
bars.plot("an old man is twice a child", "Saint Hieronimo")
Instead of focusing on only one phrase, we now look at a plot of the embeddings of all documents in our gold standard data. The plot uses a dimensionality reduction technique known as t-Distributed Stochastic Neighbor Embedding (t-SNE) and allows us to reduce multiple dimensions to just two dimensions.
doc_embedding_explorer = nbutils.DocEmbeddingExplorer(
session=session,
nlp=nlp,
gold=gold_data,
doc_encoders=doc_encoders,
)
doc_embedding_explorer.plot(
[
{
"encoder": "paraphrase [doc]",
"locator": ("fixed", "carry coals")
},
{
"encoder": "paraphrase [doc]",
"locator": ("fixed", "an old man is twice"),
},
]
)
pass
In the t-SNE visualization above, the dots represent documents and the colors represent the phrase that is linked to this document in our gold standard (more details on the underlying documents are shown when hovering the mouse cursor over the nodes). Dots that are close to each other indicate that the underlying documents share a certain similarity. Nearby dots of the same color indicate that the embedding tends to cluster documents in a way that mirrors the ground truth in our gold standard.
In the left plot, we searched for the phrase "we will not carry coals" (visualized as large yellow circle with a cross). The plot shows that the query is in fact part of a document cluster (smaller yellow circles) that contains a variation of that phrase. Similarly, on the right we see that the phrase "an old man is twice a child" loosely clusters with the actual (green) documents we associate with it in our gold standard.
In summary, for these phrases and documents, the paraphrase_distilroberta
model automatically produces a document embedding that replicates some structure of our gold standard ground truth.
In the following plot we look at token-based embeddings, document embeddings and how the two are related. The document embeddings on the left are averaged from token embeddings. On the right side, we see a t-SNE plot of the token embeddings that make up the document embeddings that are currently selected on the left. The colors differentiate which token embedding belongs to which document embedding. By showing the constituents of the document embeddings, this visualization makes more transparent how such document embeddings come to be and why certain documents on the left are clustered.
doc_embedding_explorer.plot(
[
{
"encoder": "numberbatch",
"selection": [
"ww_32c26a7909c83bda",
"ww_b5b8083a6a1282bc",
"ww_9a6cb20b0b157545",
"ww_a6f4b0e3428ad510",
"ww_8e68a517bc3ecceb",
],
}
]
)
pass
In the specific example shown above, the red circles on the left represent contexts that our gold standard lists as containing rephrasings of the phrase "a horse, a horse, my kingdom for a horse". We included two other unrelated documents that are color-coded as lilac and light rose.
To understand the document clustering on the left, we might expect that the term "horse" from the investigated phrase plays a central role. Indeed, the red token embeddings in the right plot show a cluster around "horse" in the lower left. However, it seems that this is not the main ingredient of the cluster of these documents on the left. On the contrary, we find that - for example - the documents represented by the two close red dots in the upper left corner of the document embeddings view, do not refer to "horse", but instead refer to a topic of water and ships. Looking at all three documents shown in red, we observe these terms:
- The term "boat" from "A boat, a boat" in the document "Eastward Ho!"
- The term "boat" in "muscle boat" in the document "The Poor Man's Comfort"
- The term "swim" from "To swim the river villain" in the document "The Battle of Alcazar"
To reiterate: the three red documents do not seem to be clustered around a concept of "horse" or "kingdom", as might be expected from their grouping in our gold standard. Instead, all three red documents seem to get clustered through some common notion of ship or water, which is not useful for recovering them when querying for the phrase 'a horse, a horse, my kingdom for a horse'.
Note that there is a token cluster of sailing and water (e.g. "boat", "swim", "sail" and "river") on the left side of the right plot, which shows these terms are considered similar.
Especially concerning with the resulting document clustering is the fact that "The Battle of Alcazar" contains the "swim" term rather randomly and not as annotated part of the rephrased quote. Still, this term "swim", and not the term "horse" seems to make it cluster to the other two documents, making it a "swim" cluster and not a "horse" cluster.
This short investigation is a caveat into the effects of unsupervised document clustering. We do see groups that form due to inherent qualities, but these qualities (e.g. "horse" vs. "water") might not at all mirror what we expect.
Note that the plot above is interactive and can be customized (simply drag the mouse to lasso different documents).
3.3 Exploring word mappings: WSB vs. WMD
So far, we have experimented with different token embeddings and seen how similarity comparison can be implemented for single tokens. We have also looked at document embeddings to compare documents. We now return to token embeddings, but instead of comparing single tokens, we now turn to the detection of intertextual references by comparing longer token sequences with each other. In contrast to document embeddings, we will work with one embedding per token.
The problem when comparing token sequences for this task is to identify the relevant parts or segments in a sequence. For example, a quotation like "to be or not to be" will occur as a local phenomenon, i.e. only at a certain position in a document. The rest of the document will likely be sentences that do not match with the quote phrase at all. Furthermore the phrase might be changed through the insertion, deletion or mutation of tokens.
In order to compute document similarity based on token embeddings, we turn to two kinds of approaches.
One popular class of techniques are sequence alignment algorithms as well as adjacent approaches like Dynamic Time Warping, see Kruskal (1983). In this section, we introduce the Waterman-Smith-Beyer (WSB) algorithm, which produces optimal local alignments and provides a general (e.g. non-affine) cost function (Waterman, Smith & Beyer, 1974). Other commonly used global alignment algorithms - such as Smith-Waterman and Gotoh - can be regarded as special cases of WSB. Unlike the popular Needleman-Wunsch global alignment algorithm, WSB produces local alignments. In contrast to classic formulations of WSB - which often use a fixed substitution cost - we use the word distance from word embeddings to compute the substitution penalty for specific pairs of words.
Another approach to compute a measure of similarity between documents - more specifically their bag of words (bow) representation - is the so-called Word Mover's Distance introduced by Kusner et al. (2015). The main idea is computing similarity through finding the optimal solution of a transportation problem between words.
In the following, we will experiment with two variants of the WMD. In addition to the classic WMD, where a transportation problem is solved over the normalized bag of words (nbow) vector, we also introduce a new variant of WMD that keeps the bag of words (bow) unnormalized, i.e. we pose the transportation problem on absolute word occurrence counts.
Note: document embeddings do not need any of the above techniques, as they embed documents into a vector space in a way that queries and target documents that share similar features are close to each other in that space.
3.3.1 Mapping quote queries to longer text documents
def make_index_builder(nlp, **kwargs):
return nbutils.InteractiveIndexBuilder(
session, nlp,
partition_encoders=dict((k, v.encoder) for k, v in doc_encoders.items()),
**kwargs)
index_builder = make_index_builder(nlp)
index_builder
What can be seen above is the description of a search strategy that we will employ in the following sections of this notebook. By switching to the "Edit" part, it is possible to explore the settings in more detail and even change them to something completely different. Note that various parameters in the "Edit" GUI - e.g. mixing of embeddings - are beyond the scope of this notebook. For for more details see Liebl & Burghardt (2020a/b).
Example: For the phrase "old men's crotchets" we find the following top matches ("We old men have our crotchets"), with a similarity score of 77.7%. By increasing the n value we can always display more ranked results.
index_builder.build_index().find("old men's crotchets", n=3)
3.3.2 Evaluation: Plotting the nDCG over the corpus
In the following we will systematically evaluate different strategies for identifying intertextuality in our gold standard data. We investigate WSB and the two variants of WMD (bow vs. nbow). To compute token embeddings, we use (compressed) fastText. As another variant we evaluate the performance of Sentence-BERT, when computing one embedding per document. The evaluation metric is normalized discounted cumulative gain (nDCG), which we already used in earlier similar studies (also see Liebl & Burghardt, 2020b). It is computed as follows.
Each specific query (using a specific phrase) operates on a corpus consisting of a set of documents
D = d1, ..., dn
In our case n = 100. We call the set of relevant document for this query R, with
R = r1, ..., rk
R models the ground truth encoded in the gold standard, i.e. the results we regard optimal for a specific query. In terms of the graph description of our gold standard, R is the set of nodes directly connected to the query node.
If the documents we actually retrieve through a search algorithm are numbered
x1, ..., xn
in order of their score (highest first), then the nDCG for that specific retrieval is defined as follows:
In the summary below you will find more detailed descriptions of the search strategies that will be evaluated in the following. By using "Edit", it is possible to change these settings to something else - a rerun of the following sections of the notebook would then be necessary.
import collections
def strategy_evaluation(embedding, show_ui=True, doc=True):
index_builders = collections.OrderedDict(
{
"wsb": make_index_builder(
nlp,
strategy="Alignment",
strategy_options={
"alignment": vectorian.alignment.LocalAlignment(
gap={
"s": vectorian.alignment.smooth_gap_cost(5),
"t": vectorian.alignment.smooth_gap_cost(5)
}
),
"similarity": {"embedding": embedding}
},
),
"wmd nbow": make_index_builder(
nlp,
strategy="Alignment",
strategy_options={
"alignment": vectorian.alignment.WordMoversDistance.wmd("nbow"),
"similarity": {"embedding": embedding}
},
),
"wmd bow": make_index_builder(
nlp,
strategy="Alignment",
strategy_options={
"alignment": vectorian.alignment.WordMoversDistance.wmd("bow"),
"similarity": {"embedding": embedding}
},
)
})
if doc:
index_builders["doc sbert paraphrase"] = make_index_builder(
the_embeddings["sbert_paraphrase"].nlp,
strategy="Partition Embedding",
strategy_options={"encoder_index": 0})
index_builders["doc sbert msmarco"] = make_index_builder(
the_embeddings["sbert_msmarco"].nlp,
strategy="Partition Embedding",
strategy_options={"encoder_index": 1})
if show_ui:
# present UI of various options that allows for editing
accordion = widgets.Accordion(children=[x.displayable for x in index_builders.values()])
for i, k in enumerate(index_builders.keys()):
accordion.set_title(i, k)
display(accordion)
def make_plotter():
return nbutils.plot_ndcgs(
gold_data, dict((k, v.build_index()) for k, v in index_builders.items()))
return make_plotter
make_p1 = strategy_evaluation(the_embeddings["fasttext"])
With the following command we will get an overview of the quality of the results we obtain when using the index configured with index_builder
by computing the nDCG over the 20 queries in our gold standard with regard to the known optimal results (this may take a few seconds).
p1 = make_p1()
p1.plot()
p1.plot_hist()
In terms of overall performance (mean and median) Waterman-Smith-Beyer (WSB) performs better than all other tested approaches.
As the histogram above shows, we see (WSB) especially outperforms other approaches in terms of the number of fully correct (nDCG=100%) queries. On the other hand, WSB also suffers from a low 30% nDCG that other approaches lack.
One advantage of WSB over the full WMD variants is how easy it is to interpret the results. WSB produces an alignment that relates one document token to at most one query token. For WMD, this assumption often breaks down, which makes the results harder to understand. We use this characteristic of WSB in the following section to illustrate which mappings actually occur.
The experiments above were performed with fastText. Note that running the same evaluation on the GloVe embeddings shows different results. WMD now outperforms WSB in the mean performance. The absolute performance is worse than fastText though.
make_p2 = strategy_evaluation(the_embeddings["glove"], show_ui=False, doc=False)
make_p2().plot()
For the sake of completeness, here are the results for numberbatch, which are somewhat similar to those of GloVe.
make_p3 = strategy_evaluation(the_embeddings["numberbatch"], show_ui=False, doc=False)
make_p3().plot()
3.3.3 Focussing on single queries
We now investigate some queries, for which the performance for WSB is bad, in order to better understand why our search fails to obtain the optimal (true positive) results at the top of the result list.
index_builder = make_index_builder(nlp)
index_builder
We turn to the query that scored lowest in the previous evaluation ("though this be madness, yet there is a method in it"), and look at its results in some more detail.
plot_a = nbutils.plot_results(
gold_data, index_builder.build_index(), "though this be madness", rank=7
)
The best match obtained here (red bar on rank 7) is anchored on two word matches, namely madness
(a 100% match) and methods
(a 72% match). The other words are quite different and there is no good alignment.
plot_b = nbutils.plot_results(
gold_data, index_builder.build_index(), "though this be madness", rank=3
)
Above we see the rank 3 result from the same query, which is a false positive - i.e. our search claims it is a better result than the one we saw before, but in fact this result is not relevant according to our gold standard. If we analyze why this result gets such a high score nevertheless, we see that "is" and "in" both contribute 100% scores. In contrast to the scores before, 100% for "madness" and 72% for "methods", this partially explains the higher overall score (if we assume for now that the contributions from the other tokens are somewhat similar).
We will now try to understand why the true positive results are ranked rather low. The following plot breaks down how the overall scores are composed from single token scores:
nbutils.vis_token_scores(
plot_b.matches[:50],
highlight={"token": ["madness", "method"], "rank": [7, 21, 35, 46]},
)
{"model_id":"e73eb8cd5e2e4bf5a9a119c5f9435504","version_major":2,"version_minor":0}
<function nbutils.vis_token_scores.
The true positive results are marked with black triangles. We see that our current search strategy isn't doing a very good job of ranking them highly. Looking at the score composition of the relevant results, we can identify two distinct features: all relevant results show a rather large contribution of either "madness" (look at ranks 7 and 35, for example) and/or a rather large contribution of "method" (ranks 7 and 46). However, these contributions do not lead to higher ranks necessarily, since other words such as "is", "this" and "though" score higher for other results: for example, look at the contribution of words like "in" and "is" in ranks 1, 3 and 5.
In the plot below, we visualize this observation using ranks 1, 7 and 35. Comparing the rank 1 result on the left - which is a false positive - with the two relevant results on the right, we see that "in", "through" and "is" make up for large parts of the score for rank 1, whereas "madness" is a considerable factor for the two relevant matches. Unfortunately, this contribution is not sufficient to bring these results to higher ranks.
@widgets.interact(plot_as=widgets.ToggleButtons(options=["bar", "pie"], value="bar"))
def plot(plot_as):
nbutils.vis_token_scores(
plot_b.matches, kind=plot_as, ranks=[1, 7, 35], plot_width=800
)
The distributions of score contributions we just observed are the motivation for our approach to tag-weighted alignments as they are described in Liebl & Burghardt (2020a/b). Nagoudi and Schwab (2017) used a similar idea of POS tag-weighting for computing sentence similarity, but did not combine it with alignments.
We now demonstrate (POS) tag-weighted alignments, by using a tag-weighted alignment that will weight nouns like "madness" and "method" 3 times higher than other word types. "NN" is a Penn Treebank tag and identifies singular nouns.
tag_weighted_index_builder = make_index_builder(
nlp, strategy="Tag-Weighted Alignment", strategy_options={"tag_weights": {"NN": 3}}
)
tag_weighted_index_builder
nbutils.plot_results(
gold_data, tag_weighted_index_builder.build_index(), "though this be madness"
)
<nbutils.ResultScoresPlotter at 0x7fb1fcf133d0>
Tag-weighting moves the correct results far to the top, namely to ranks 1, 2, 4 and 6. By increasing the NN weight to 5, it is possible to promote the true positive on rank 67 to rank 11. This is a bit of an extreme measure though and we will not investigate it further here. Instead we investigate how the weighting affects the other queries. Therefore, we re-run the nDCG computation and compare it against unweighted WSB.
index_builder_unweighted = make_index_builder(nlp)
index_builder_unweighted
tw_plot = nbutils.plot_ndcgs(
gold_data,
{
"wsb_unweighted": index_builder_unweighted.build_index(),
"wsb_weighted": tag_weighted_index_builder.build_index(),
})
tw_plot.plot()
tw_plot.plot_hist()
The evaluation shows that even though there is an improvement in the mean nDCG when employing POS tag weighting for the corpus, the effect vanishes when computing the median. The histogram above shows that POS tag weighting manages to shift two low quality queries that are present in the unaligned version (at the 30% and 60% bands) into higher bands, which considerably improves the mean. Apart from these two outliers however, POS tag weighting cannot show improvement in the higher bands, as it accumulates more queries into the 90% band, but loses some in the 100% band as opposed to unweighted alignment. Therefore the median is not improved.
The two improved low nDCG queries are listed in the column called better below. The first query is the one we used as rationale for the whole design. More research is needed to understand if tag weighted alignment might in general benefit some larger subset of queries.
nbutils.eval_strategies(tw_plot.data, gold_data)
3.4 The influence of different embeddings
While we have experimented with different strategies like WSB and WMD in the previous sections, we will now use tag-weighted alignments only, and take a look into the effect that different embeddings might have on the results.
In contrast to earlier experiments we only use token-based embeddings. For example, for Sentence-BERT we extract one embedding per token, instead of computing one embedding per document as in the experiments before. Thus, these embeddings only serve as input to a local alignment computation here.
One caveat here is that we are using compressed embeddings, i.e. we would need to verify these results with uncompressed embeddings. Still, the performance even of compressed fastText seems quite solid.
index_builders = {}
# for each embedding, define a search strategy based on tag-weighted alignments
for e in the_embeddings.values():
index_builders[e.name] = make_index_builder(
nlp=e.nlp if e.is_contextual else nlp,
strategy="Tag-Weighted Alignment",
strategy_options={"tag_weights": {"NN": 3}, "similarity": {"embedding": e}},
)
# present an UI to interactively edit these search strategies
accordion = widgets.Accordion(children=[x.displayable for x in index_builders.values()])
for i, k in enumerate(index_builders.keys()):
accordion.set_title(i, k)
accordion
p2 = nbutils.plot_ndcgs(
gold_data, dict((k, v.build_index()) for k, v in index_builders.items()))
p2.plot()
p2.plot_hist()
Some observations to be made here (based on the detailed plots that are available in the interactive version):
In a few queries ("llo, ho, ho my lord", "frailty, thy name is woman", "hell itself should gape"), GloVe gives slightly better results than fastText, but this cannot be generalized to the overall performance.
For some queries ("I do bear a brain.", "O all you host of heaven!") the embedding does not seem to matter at all.
A real competitor for fastText are contextual token embeddings from the Sentence-BERT msmarco model, which achieves a stunning 99.8% as median performance when used as token embedding with tag-weighted alignments, and outperforms its own document embedding median of 91.6% that we obtained earlier. Note that the mean performance does not show this effect though. As the histogram above explains, the effect can be attributed to the fact that the msmarco model achieves a 100% nDCG for over half of the performed queries (which dominate the median), while the other half has low outliers (e.g. in the range of 60%) which in turn affect the mean.
4. Conclusion
In this interactive notebook we have demonstrated how different types of word embeddings can be used to detect intertextual phenomena in a semi-automatic way. We also provide a basic ground truth dataset of 100 short documents that contain 20 different quotes from Shakespeare's plays. This setup enables us to investigate the inner workings of different embeddings and to evaluate their suitability for the case of Shakespearean intertextuality.
The following main findings – that also open up perspectives for future research – were obtained from this evaluation study:
POS tag-weighted alignments achieve the highest overall mean nDCG for our specific data - however this is not true for the median nDCG. Since tag-weighting seems to improve the results of only a small subset of queries considerably, our future work will focus on better understanding the structure of queries and the role of outliers.
Document embeddings show a strong performance. A special appeal of these models lies in their ease of use (assuming a pre-trained model), as they do not rely on additional WSB or WMD mappings.
In terms of static embeddings, compressed fastText embeddings clearly outperform compressed GloVe and Numberbatch embeddings in our evaluations. Since we used low-dimensional embeddings for this notebook, these results are only a first hint and need to be verified through the use of full (high-dimensional) embeddings.
Contextual token embeddings extracted from Sentence-BERT, when used as an input to tag-weighted alignments, seem to offer the best overall performance. The msmarco model's result of 98.8% seems to indicate that future research might want to focus on hybrid methods combining high-quality contextual token embeddings with other approaches such as alignments.
While we were able to gather some interesting insights for the specific use case of Shakespearean intertextuality, the main goal of this notebook is to provide an interactive platform that enables other researchers to investigate other parameter combinations as well. The code blocks and widgets above offer many ways to play around with the settings and explore their effects on the ground truth data.
More importantly, researchers can also import their own (English language) data to the notebook and experiment with all of the functions and parameters that are part of the Vectorian API. We hope this notebook provides a low-threshold access to the toolbox of embeddings for other researchers in the field of intertextuality studies and thus adds to a critical reflection of these new methods.
5. Interactive searches with your own data
This section allows you to upload your own text corpora and search in them.
First, specify the text documents you want to search through the upload widget seen below. Note that it expects plain text files with a .txt
extension. The upload widget is provided through a helper class called CustomSearch
that knows about the embeddings used for searching.
A good source for obtaining some public domain plain text files to search in for this demo is wikisource.org. For example, you can download Charles Darwin's "The Origin of Species" via wikisource. To download other titles, you need to enter the exact name identifier in wikisource. Always make sure that you are getting the plain text format.
search = nbutils.CustomSearch(
[the_embeddings[x] for x in ["numberbatch", "fasttext"]])
search
From the file or files stored in the upload widget above, we now build a Vectorian Session
. For this, we need an nlp
instance for importing the text documents. Depending on the size and number of documents, the initial processing can take some time.
Once processing has finished, you are presented the full interactive search interface the Vectorian offers (we have hidden it so far and focused on a subset). Note that in contrast to our earlier experiments, we do not search on the document level by default, but rather the sentence level - i.e. we split each document into sentences and then check each sentence whether it contains an occurrence of the given query phrase. You can change this behavior in the "Partition" dropdown.
Use the "Query" edit field and the "Search" button to perform searches.
search.interact(nlp)
6. References
Bamman, David & Crane, Gregory (2008). The logic and discovery of textual allusion. In Proceedings of the 2008 LREC Workshop on Language Technology for Cultural Heritage Data.
Bär, Daniel, Zesch, Torsten & Gurevych, Iryna (2012). Text reuse detection using a composition of text similarity measures. In Proceedings of COLING 2012, p. 167–184.
Büchler, Marco, Geßner, Annette, Berti, Monica & Eckart, Thomas (2013). Measuring the influence of a work by text re-use. Bulletin of the Institute of Classical Studies. Supplement, p. 63–79.
Burghardt, Manuel, Meyer, Selina, Schmidtbauer, Stephanie & Molz, Johannes (2019). “The Bard meets the Doctor” – Computergestützte Identifikation intertextueller Shakespearebezüge in der Science Fiction-Serie Dr. Who. Book of Abstracts, DHd.
Chandrasekaran, Dhivya & Mago, Vijay (2021). Evolution of Semantic Similarity – A Survey. ACM Computing Surveys (CSUR), 54(2), p. 1-37.
Faruqui, Manaal, Tsvetkov, Yulia, Rastogi, Pushpendre & Dyer, Chris (2016). Problems With Evaluation of Word Embeddings Using Word Similarity Tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, p. 30-35.
Forstall, Christopher, Coffee, Neil, Buck, Thomas, Roache, Katherine & Jacobson, Sarah (2015). Modeling the scholars: Detecting intertextuality through enhanced word-level n-gram matching. Digital Scholarship in the Humanities, 30(4), p. 503–515.
Genette, Gérard (1993). Palimpseste. Die Literatur auf zweiter Stufe. Suhrkamp.
Hohl-Trillini, Regula & Burghardt, Manuel & Molz, Johannes & Pichler, Alex & Reiter, Nils & Sulzbacher, Ben & Nantke, Julia (2020). "Intertextualität in literarischen Texten und darüber hinaus." Book of Abstracts, DHd 2020, Paderborn.
Kusner, Matt, Sun, Yu, Kolkin, Nicholas & Weinberger, Kilian (2015). From word embeddings to document distances. In International conference on machine learning, p. 957-966.
Liebl, Bernhard & Burghardt, Manuel (2020a). „The Vectorian” – Eine parametrisierbare Suchmaschine für intertextuelle Referenzen. Book of Abstracts, DHd 2020, Paderborn.
Liebl, Bernhard & Burghardt, Manuel (2020b). “Shakespeare in The Vectorian Age” – An Evaluation of Different Word Embeddings and NLP Parameters for the Detection of Shakespeare Quotes”. Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LateCH), co-located with COLING’2020.
Manjavacas, Enrique, Long, Brian & Kestemont, Mike (2019). On the feasibility of automated detection of allusive text reuse. Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature.
Mikolov, Tomas, Chen, Kai, Corrado, Greg & Dean, Jeffrey (2013). Efficient estimation of word representations in vector space. In Proceedings of International Conference on Learning Representations (ICLR 2013). arXiv preprint arXiv:1301.3781.
Mikolov, Tomas, Grave, Edouard, Bojanowski, Piotr, Puhrsch, Christian & Joulin, Armand (2018). Advances in pretraining distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). arXiv preprint arXiv:1712.09405.
Nagoudi, El Moatez Billah & Schwab, Didier (2017). Semantic Similarity of Arabic Sentences with Word Embeddings. In Proceedings of the 3rd Arabic Natural Language Processing Workshop, Association for Computational Linguistics, 2017, p. 18–24.
Pennington, Jeffrey, Socher, Richard & Manning, Christopher D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), p. 1532-1543.
Reimers, Nils & Gurevych, Iryna (2019). Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
Scheirer, Walter, Forstall, Christopher & Coffee, Neil (2014). The sense of a connection: Automatic tracing of intertextuality by meaning. Digital Scholarship in the Humanities, 31(1), p. 204–217.
Sohangir, Sahar & Wang, Dingding (2017). Document Understanding Using Improved Sqrt-Cosine Similarity. In Proceedings of the 2017 IEEE 11th International Conference on Semantic Computing (ICSC), p. 278-279.
Speer, Robyn, Chin, Joshua & Havasi, Catherine (2017). Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), p. 4444–4451.
Trillini, Regula (2020). "Casual Shakespeare: Three Centuries of Verbal Echoes." Routledge.
Wang, Yuxuan, Hou, Yutai, Che, Wanxiang & Liu, Ting (2020). From static to dynamic word representations: A survey. International Journal of Machine Learning and Cybernetics 11, p. 1611–1630.
Waterman, Michael S., Smith, Temple F. & Beyer, William A. (1976). Some biological sequence metrics. Advances in Mathematics 20(3), p. 367-387.
Werner, Matheus & Laber, Eduardo (2020). Speeding up Word Mover's Distance and its variants via properties of distances between embeddings. In Proceedings of the 24th European Conference on Artificial Intelligence (ECAI 2020), p. 2204-2211.
Wieting, John, Bansal, Mohit, Gimpel, Kevin & Livescu, Karen (2015). Towards universal paraphrastic sentence embeddings. Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico.