Practical word alignment using off-the-shelf translation LLMs
Sep 29, 2024

This post describes a simple, practical method to extract word alignment from off-the-shelf, unmodified transformer encoder/decoder models. It extracts both word alignment and segmentation already learned by any model and requires virtually no additional training or inference cost.

While many works claim state-of-the-art results on word alignment, such as those described in Li (2022), factors limiting practical utility include:

  • It's generally assumed that the text is already perfectly segmented into words, which is non-trivial for languages without spaces like Chinese.
  • Methods are generally specific to certain models/languages and require expensive training/finetuning of an alignment model.
  • Similar to translation, high evaluation scores do not necessarily imply useful alignments in practice.

To see this method in action, visit LLMReader, an AI-powered multilingual reading tool with texts such as the Bible and Le Petit Prince, the two most translated works in the world.

Extracting alignment matrix from transformer cross attention

Here are the cross attention weights of an NLLB-200 translation model by Meta AI visualized with BertViz. In this example, some attention heads in layer 0 have clearly learned to align the English word "light" with the French word "lumière". Note that here, the model is forced to produce tokens in a "teacher-forcing" manner. We are not using the model to translate, although we could if desired. Also, this phenomenon is not limited to NLLB-200 and can be seen for other models like M2M-100 and MarianMT. This is not a surprise given that the development of attention mechanisms was closely related to word alignment.

How can we extract a single alignment matrix from all of the attention layers and heads? While manual selection of attention heads works decently, an automated approach that generalizes across models would be better, and Kobayashi et al. (2020) points out that we should consider the magnitudes of the transformed vectors when analyzing attention weights. Here, I use logistic regression with a very small dataset of about 10 Chinese/English sentence pairs and Lasso (L1) regularization so that only a limited number of attention heads are used. These particular choices are not very important and many other things also work. The training here has negligible cost and takes a few seconds. Although only Chinese/English examples were used for training, the resulting alignment matrix also seems to work well for French/English and may also work well for other languages of the 200 supported by NLLB-200. A training example looks like the following:

[
    "Then the Lord rained on Sodom and Gomorrah sulfur and fire from the Lord out of heaven.",
    "当时,耶和华就使硫磺与火,从天上耶和华那里降与所多玛和蛾摩拉;",
    [
        ("Lord", 0, "耶和华", 0),
        ("Lord", 1, "耶和华", 1),
        ("Sodom", "所多玛"),
        ("Gomorrah", "蛾摩拉"),
        ("sulfur", "硫磺"),
        ("fire", ""),
        ("heaven", "天上"),
    ],
]

A visualization of the resulting alignment matrix shows that it has learned both alignment and segmentation of the word 所多玛 (Sodom). Words with unknown tokens, such as 蛾摩拉 (Gomorrah) shown here, seem to be usually aligned and segmented correctly but with reduced accuracy.

Since this method can use the alignment to improve the segmentation and vice versa, it may have an advantage over separate steps. Despite high accuracy claims, none of the Chinese models provided by the popular spaCy library, including the transformer model, can correctly segment sentences like the ones above with unusual names. CKIP Transformers is better, but Chinese-specific and still not always correct. On the other hand, this method does not require a separate segmenter at all.

Generating alignment and segmentation from matrix

Now that we have an alignment matrix, we can use a greedy algorithm to generate word alignments and segmentations. An obvious approach would be to align the encoder/decoder tokens in descending order of weight until all tokens are aligned. This works decently, but it can be improved with a few heuristics that help produce smoother alignments/segmentations and fix errors in the matrix:

  • Max heuristic: Initially, we only consider the maximum weight edge for each encoder and decoder token.
  • Adjacent heuristic: After selecting an edge, we add adjacent edges to the set under consideration, and optionally increase the weight of these edges by raising them to the power of α. Since edge weights are between 0 and 1, lower α values increase the preference for adjacent tokens.
  • Gap heuristic: Given tokens A B C, if A and C are already aligned, align B as well.

When viewing an alignment in LLMReader in debug mode, the letters M, A, and G are used to indicate the heuristic used for each alignment. You can also play with the confidence level threshold using a slider.

Here is a description of the algorithm. It can be efficiently implemented in a similar manner to Kruskal's algorithm for minimum spanning forest by using a disjoint-set data structure, and a heap or binary search tree. The cost of this algorithm is negligible compared to model inference.

  1. Add the max edge for each encoder/decoder token to a set.
  2. While there are still unaligned tokens:
    1. Select the edge with the highest weight from the set, where the encoder or decoder token is not aligned, and add it to the alignment.
    2. Add adjacent edges to the set, optionally increasing their weight.
    3. Check the gap heuristic and add alignments if necessary.

While this algorithm produces both alignments and segmentations, it can also incorporate segmentation information if known by enforcing restrictions on the alignment. The alignments shown in LLMReader were generated from raw unsegmented text.

References

Kobayashi, G., Kuribayashi, T., Yokoi, S., & Inui, K. (2020). Attention is Not Only a Weight: Analyzing Transformers with Vector Norms. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 7057–7075). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.574
Li, B. (2022). Word Alignment in the Era of Deep Learning: A Tutorial. https://arxiv.org/abs/2212.00138