This post describes a simple, practical method to extract word alignment from off-the-shelf, unmodified transformer encoder/decoder models. It extracts both word alignment and segmentation already learned by any model and requires virtually no additional training or inference cost.
While many works claim state-of-the-art results on word alignment, such as those described in Li (2022), factors limiting practical utility include:
To see this method in action, visit LLMReader, an AI-powered multilingual reading tool with texts such as the Bible and Le Petit Prince, the two most translated works in the world.
Here are the cross attention weights of an NLLB-200 translation model by Meta AI visualized with BertViz. In this example, some attention heads in layer 0 have clearly learned to align the English word "light" with the French word "lumière". Note that here, the model is forced to produce tokens in a "teacher-forcing" manner. We are not using the model to translate, although we could if desired. Also, this phenomenon is not limited to NLLB-200 and can be seen for other models like M2M-100 and MarianMT. This is not a surprise given that the development of attention mechanisms was closely related to word alignment.
How can we extract a single alignment matrix from all of the attention layers and heads? While manual selection of attention heads works decently, an automated approach that generalizes across models would be better, and Kobayashi et al. (2020) points out that we should consider the magnitudes of the transformed vectors when analyzing attention weights. Here, I use logistic regression with a very small dataset of about 10 Chinese/English sentence pairs and Lasso (L1) regularization so that only a limited number of attention heads are used. These particular choices are not very important and many other things also work. The training here has negligible cost and takes a few seconds. Although only Chinese/English examples were used for training, the resulting alignment matrix also seems to work well for French/English and may also work well for other languages of the 200 supported by NLLB-200. A training example looks like the following:
[
"Then the Lord rained on Sodom and Gomorrah sulfur and fire from the Lord out of heaven.",
"当时,耶和华就使硫磺与火,从天上耶和华那里降与所多玛和蛾摩拉;",
[
("Lord", 0, "耶和华", 0),
("Lord", 1, "耶和华", 1),
("Sodom", "所多玛"),
("Gomorrah", "蛾摩拉"),
("sulfur", "硫磺"),
("fire", "火"),
("heaven", "天上"),
],
]
A visualization of the resulting alignment matrix shows that it has learned both alignment and segmentation of the word 所多玛 (Sodom). Words with unknown tokens, such as 蛾摩拉 (Gomorrah) shown here, seem to be usually aligned and segmented correctly but with reduced accuracy.
Since this method can use the alignment to improve the segmentation and vice versa, it may have an advantage over separate steps. Despite high accuracy claims, none of the Chinese models provided by the popular spaCy library, including the transformer model, can correctly segment sentences like the ones above with unusual names. CKIP Transformers is better, but Chinese-specific and still not always correct. On the other hand, this method does not require a separate segmenter at all.
Now that we have an alignment matrix, we can use a greedy algorithm to generate word alignments and segmentations. An obvious approach would be to align the encoder/decoder tokens in descending order of weight until all tokens are aligned. This works decently, but it can be improved with a few heuristics that help produce smoother alignments/segmentations and fix errors in the matrix:
When viewing an alignment in LLMReader in debug mode, the letters M, A, and G are used to indicate the heuristic used for each alignment. You can also play with the confidence level threshold using a slider.
Here is a description of the algorithm. It can be efficiently implemented in a similar manner to Kruskal's algorithm for minimum spanning forest by using a disjoint-set data structure, and a heap or binary search tree. The cost of this algorithm is negligible compared to model inference.
While this algorithm produces both alignments and segmentations, it can also incorporate segmentation information if known by enforcing restrictions on the alignment. The alignments shown in LLMReader were generated from raw unsegmented text.