Multilingual Text Representation

Summary

This survey explores the evolution of multilingual text representation in NLP, tracing progress from early word-frequency models to modern transformer-based systems. Initially, models like n-grams and neural networks (e.g., RNNs, CNNs) focused on monolingual tasks. The transformer architecture, with its attention mechanism, enabled shared representation spaces across languages, leading to models like mBERT and XLM-R, which support over 100 languages. These models use pre-training on monolingual and parallel corpora, often with objectives like masked language modeling (MLM) and translation language modeling (TLM). Challenges remain, including capacity dilution when adding more languages, uneven performance across low- and high-resource languages, and the need for equitable representation. Tokenization is a key issue, with subword methods like BPE and SentencePiece widely used but struggling with low-resource languages. Open problems include developing tokenization-free models, improving adaptability to new languages, and enhancing model interpretability and fairness. The survey highlights the need for research to address these gaps and achieve truly inclusive multilingual NLP.

PDF viewer

Chunks(40)

Chunk 0 · 1,990 chars

arXiv:2309.00949v1 [cs.CL] 2 Sep 2023
1
Multilingual Text Representation
Fahim Faisal
Department of Computer Science
George Mason University
ffaisal@gmu.edu
Abstract—Modern NLP breakthrough includes large multi-
lingual models capable of performing tasks across more than
100 languages. State-of-the-art language models came a long
way, starting from the simple one-hot representation of words
capable of performing tasks like natural language understanding,
common-sense reasoning, or question-answering, thus capturing
both the syntax and semantics of texts. At the same time, language
models are expanding beyond our known language boundary,
even competitively performing over very low-resource dialects
of endangered languages. However, there are still problems to
solve to ensure an equitable representation of texts through a
unified modeling space across language and speakers. In this
survey, we shed light on this iterative progression of multilingual
text representation and discuss the driving factors that ultimately
led to the current state-of-the-art. Subsequently, we discuss how
the full potential of language democratization could be obtained,
reaching beyond the known limits and what is the scope of
improvement in that space.
Index Terms—NLP, Multilinguality, Language Model
I. INTRODUCTION
Natural language processing (NLP) primarily involves mak-
ing linguistic-specific applications for machines to understand
language. Earlier days of NLP development mainly focused
on the idea of distributional hypothesis [1] that is, "words
occurring in the same context tend to have a similar meaning
or closely related meaning". From there, NLP has come a
long way in the modeling language. The tasks NLP tries
to solve are complex and multidimensional if we put them
into the perspective of the machine and numerical mapping.
For example, there are multiple dynamics to deal with here,
like various languages and dialects to consider and tasks
to solve with different levels of

Chunk 1 · 1,984 chars

g way in the modeling language. The tasks NLP tries
to solve are complex and multidimensional if we put them
into the perspective of the machine and numerical mapping.
For example, there are multiple dynamics to deal with here,
like various languages and dialects to consider and tasks
to solve with different levels of granularity. Combining all
these dynamics in a unified representation space is a complex
problem. The base starting point could be just words, as the
word is one such unit of language that is quite universal.
Following this thought, the most straightforward idea could
be to compute the word frequency, thus constructing a count-
based numerical mapping that we can think of as a compu-
tationally feasible language representation. N-gram language
models are representations that have led the domain of NLP for
a substantial time. Later on, advancing over the distributional
scheme of language representation coupled with the availabil-
ity of huge computation resources, neural models became
the go-to approach for all kinds of NLP tasks. Researchers
investigated different theories and directions in this domain
of neural language modeling before transformer-based neural
networks revolutionized NLP. The transformer-based model
provides exceptional text representation utilizing the multi-
layered encoder blocks [2]. This is useful as text gets different
meanings based on how it is used in a context. In addition,
multiple languages can share a single representation space
using transformers. This effectively led to the path for multilin-
gual text representation, where data collected from languages
existing all over the world can be accumulated in a single
setting, and models can learn and perform actionable inference
on a wide variety of tasks comprising language and dialectal
varieties. Though a monolingual or region-focused transformer
still vastly outperforms a more generalized multilingual model
on most tasks, it is not always feasible to train

Chunk 2 · 1,985 chars

umulated in a single
setting, and models can learn and perform actionable inference
on a wide variety of tasks comprising language and dialectal
varieties. Though a monolingual or region-focused transformer
still vastly outperforms a more generalized multilingual model
on most tasks, it is not always feasible to train multiple
versions of the domain-focused model. The idea is to make
a shared representation space that effectively works for many
languages, while the resource scarcity of specific languages
should benefit from other high-resource languages. mBERT [3]
and xlm-r [4] trained with multiple objective functions on more
than hundreds of languages came a long way in achieving this
vision. However, the full potential of a unified multilingual
text representation is still a significant research problem to
solve. Because, quite regularly, the inclusion of new tasks and
languages in the modeling paradigm points out the fact that,
when these models move beyond the monolingual scheme, the
total capacity of the model gets distributed across languages,
thus often resulting in capacity dilution [5]. An ideal scenario
would be to have no amount of negative interference, such that
we get an equitable performance across languages. Another
important direction for multilingual models is to ensure the
easy-expand-ability to new languages and adaptability to new
tasks.
Keeping all these advanced development of multilingual text
representation in context, in this survey, we provide insight
into the open problems and questions to look for. In addition,
we discuss how the text representation model starting from
the count-based vector representation of words, came to the
point of a multilingual text representation model capable of
performing across more than 100 languages. We structure the
contents based on the following contexts: (1) How did the text
representation model make the iterative progress, and which
were the driving factors in each step? (2) What are the

Chunk 3 · 1,989 chars

words, came to the
point of a multilingual text representation model capable of
performing across more than 100 languages. We structure the
contents based on the following contexts: (1) How did the text
representation model make the iterative progress, and which
were the driving factors in each step? (2) What are the primary
building blocks of a unified multilingual text representation
model, and how do they vary given the difference in scenarios?
(3) What are the current barriers that limit the progression
of full-scale multilingual text representation? (4) The fairness
and interpretability of currently available models and how
equitable they serve the intended user’s utility.

-- 1 of 11 --

2
II. PRELIMINARIES
A. Terminologies
a) Tokenization: : A process of transforming the text into
tokens of words is generally known as tokenization [6]. Not
necessarily; it needs to be in words, but it can be divided into
words, symbols, phrases, or sub-word tokens.
b) Word segmentation: Separating the phrase, content,
and keywords. Other steps include stemming, lemmatizing,
handling negation, and separating punctuation.
c) Embedding: Embedding is a relatively low-
dimensional space where we can place the transformed
high-dimensional vectors. Word embedding generally means
a type of embedding space where the close-meaning words
would be grouped, maintaining a close distance.
d) Attention: We can interpret this mechanism as the
vector of important weights [7]. An attention vector provides
insight into how words in a sentence might be correlated with
other words in that sentence. In addition, it determines the
most useful blocks of the input in terms of transferring the
contextual insights to actionable information for the target
output. The attention mechanism mitigates the long-lasting
disadvantage of fixed-length vectors in sequence-to-sequence
models: the incapability of remembering long sentences.
B. Early days of Multilingual Word Representation
Multilingual Word

Chunk 4 · 1,999 chars

ansferring the
contextual insights to actionable information for the target
output. The attention mechanism mitigates the long-lasting
disadvantage of fixed-length vectors in sequence-to-sequence
models: the incapability of remembering long sentences.
B. Early days of Multilingual Word Representation
Multilingual Word representation involves representing
words from different languages through a shared space. The
bilingual word embedding model [8] is one of the earlier works
that uses machine translation word alignment and embeds
words from the source and target language into a single
vector space. During training, it imposes constraints over
the distance based on the translation pairs of two languages.
Later on, [9] proposed an approach that can learn bilingual
word embedding given the lexicon of bilingual word pairs.
It performs a linear transformation to transform the source
language vectors to the target language vector space. However,
when provided a lexicon of small size, this model performs
poorly. To tackle this issue, [10] introduced a matching score
mechanism along with the original bilingual lexicon pairs to
bring the embeddings closer.
C. Neural Language Model Basics
In neural language modeling (NLM), we train a probabilistic
classifier to predict a probability distribution where the condi-
tional probability of selecting word wi is learned using various
kinds of neural networks (e.g., feed-forward, recurrent, etc.):
P (s) =
l	∏
i=1
P (wi|wi−1
1 )
a) Feed-forward NLM:: A feed-forward NLM
(FFN) [11] adopts the notion of an n-gram language
model by assuming each word in a sequence depends
on the words closer to it statistically though it fails to
consider the long-term dependency. Instead of considering
the dependence of the whole previous sequence, a context
window (i − n + 1 to i − 1) is used for better approximation:
P (wi|W i−1
1 = P (wi|W i−1
i−n+1). The context word sequence
x = [wi−n+1..., wi−1] is fed into a FFN, and later, a softmax
layer over the

Chunk 5 · 1,998 chars

to
consider the long-term dependency. Instead of considering
the dependence of the whole previous sequence, a context
window (i − n + 1 to i − 1) is used for better approximation:
P (wi|W i−1
1 = P (wi|W i−1
i−n+1). The context word sequence
x = [wi−n+1..., wi−1] is fed into a FFN, and later, a softmax
layer over the final output matrix is used to get the output
probability of P (wi|W i−1
i−n ).
b) Convolutional NLM:: This one enhances the ffn by
injecting a CNN layer on top of the input representation [12],
which involves a sliding window of the input vectors centered
on each word vector and later on, performing max-pooling on
it.
c) Recurrent NLM: Recurrent Neural Network (RNN)
based LM [13] addresses the issue of long-term dependency
problem. At every time step of RNN, the input is just the
previous word vector instead of the concatenated vectors of
n previous words. Meanwhile, the information of all previous
words is preserved by the internal state of RNN. The most
common RNN types are LSTM [14] and GRU [15]. The key
problem with RNN-based language models is falling into the
vanishing gradient (taking the multiplication of a large number
of derivatives eventually results in a value close to zero, which
can further not be used in error function calculation). This
failure leads to the problem of not capturing the dependencies
among words in a long sentence as the amount of computation
using RNN increases when the distance among words increase
in a sentence.
d) Transformer Language Models: This is a non-
recurrent encoder-decoder architecture with a series of
attention-based blocks. An encoder prepares a contextual
representation of the given input, whereas the decoder can
generate output based on the output segments that are already
generated. The attention mechanism is at the core of the Trans-
former architecture [2]. An encoder-decoder-based transformer
contains three types of attention units facilitated with the help
of queries, keys, and values.
Key K: This

Chunk 6 · 1,996 chars

whereas the decoder can
generate output based on the output segments that are already
generated. The attention mechanism is at the core of the Trans-
former architecture [2]. An encoder-decoder-based transformer
contains three types of attention units facilitated with the help
of queries, keys, and values.
Key K: This is a label of a word and is used to distinguish
between different words.
Query Q: It represents an active request that checks all the
keys and selects the one that matches the request.
Value V : A value is always paired with a key, and when
the query selects a key, its value is the one that propagates. A
value is an information a word contains.
When k, Q, &V all get generated from the same source
sequence, it is self-attention (exist one in the encoder, one
in the decoder). When they come from different sequences,
they form a cross-attention mechanism (happens in between
encoder-decoder interactions). The other type is the scaled-dot
product attention mechanism that performs dot product calcu-
lation between Q and K matrices. Before that, the k, Q, &V
are calculated by the matrix multiplication with learned word
vectors. The scaled-dot product attention then ensures that we
select more information from the values where the key and
the query are more similar. On top of these mechanisms, we
have multi-head attention where multiple scaled dot product
attentions run in parallel, thus helping the network to attend
to multiple pieces of information simultaneously. In addition,
there is no recurrent element in transformers. So, to learn
the position information of each word, a combination of sine
and cosine waves of different frequencies is used, as each
position would have a unique combination of values. There

-- 2 of 11 --

3
are other essential segments like residual connections and layer
normalization. Residual or skip connection adds the input to
the output after a layer, allowing the gradients to flow directly
through the network. Whereas the layer

Chunk 7 · 1,999 chars

sed, as each
position would have a unique combination of values. There

-- 2 of 11 --

3
are other essential segments like residual connections and layer
normalization. Residual or skip connection adds the input to
the output after a layer, allowing the gradients to flow directly
through the network. Whereas the layer normalization keeps
the mean of each training sample close to 0 and the standard
deviation close to 1, thus stabilizing the training as well as
reducing the training time.
In practice, a layer from the encoder contains multi-head at-
tention sub-layers and a position-wise feed-forward sub-layer.
The sub-layers are connected using the residual connections.
The multi-head attention sub-layer contains several attention
heads. A head is a scaled dot product attention structure taking
the query matrix Q, the key matrix k, and the value matrix V .
So the output,
Attention(Q, K, V ) = Sof tmax( QKT
√dk
)V
M ultilhead(Q, K, V ) = [head1, head2, ..headh]W 0
where, headi = Attention(QW Q
i , KW K
i , V W V
i )
and W Q
i = linear projections,
and dk = is the dimension of the query matrix.
The fully connected position-wise feed-forward sub-layers
contain two linear transformations with RELU:
F F N (x) = W2max(0, W1x + b1) + b2
Transformers are better for dealing with long-term dependency
than RNNs. Originally it was proposed to solve the problem
of machine translation, but later on, it became the backbone
of all kinds of NLP applications.
D. Multilingual Language Modeling
A general Multilingual Language Model (MLLM) archi-
tecture contains an input layer, transformer layer, and output
layer.
• input layer: sequence of tokens from a representation of
one-hot subword token vocabulary (concatenated from all
languages).
• transformer layer: each layer contains k attention heads,
followed by a feed-forward neural network.
• output layer: contextual representation for each token. It
contains a simple linear transformation followed by a
Softmax that takes the last layer

Chunk 8 · 1,999 chars

one-hot subword token vocabulary (concatenated from all
languages).
• transformer layer: each layer contains k attention heads,
followed by a feed-forward neural network.
• output layer: contextual representation for each token. It
contains a simple linear transformation followed by a
Softmax that takes the last layer token representation and
produces the probability distribution.
III. AREA TAXONOMY
The multilingual text representation area taxonomy is pre-
sented in Figure 1. Here we divide the taxonomy into certain
primary parts, thus representing the building blocks of text
representation from different perspectives. For example, Repre-
sentation Type defines ways of representing text starting from
the classical models to recent Transformer breakthroughs. Fur-
thermore, we can divide a standard NLP training paradigm into
parts like pre-training and fine-tuning. We believe tokenization,
being a step from text preprocessing, deserves a separate
branch of discussion because of its impact on multilingual
settings. In the next section, we report a detailed discussion
of the existing research in every direction of this taxonomy.
IV. TAXONOMY-BASED SURVEY
A. Representation Type
The simplest form of text representation would be the word-
count-based one that fully relies on statistical information.
From there, text representation has advanced to a label, where
we can represent texts from different languages in a shared
space. The progression is described in detail here.
a) Classical Models: Word frequency is the basis of the
classical text representation model used in earlier days. We
can divide these models into two parts: (1) Categorical and
(2) Weighted. One-hot-encoding and Bag-of-Words (BoW) are
the categorical models. In one-hot encoding, the dimension to
represent texts is equal to the terms present in the vocabulary,
where binary values are used to define the presence or absence
of a particular term. Whereas, BoW is just an updated one-
hot-encoding where we sum

Chunk 9 · 1,995 chars

One-hot-encoding and Bag-of-Words (BoW) are
the categorical models. In one-hot encoding, the dimension to
represent texts is equal to the terms present in the vocabulary,
where binary values are used to define the presence or absence
of a particular term. Whereas, BoW is just an updated one-
hot-encoding where we sum all the one-hot-representation
of words in a sentence. However, these categorical models
fail to capture semantic relations and the order of the words.
A weighted text representation model known as Tf-Idf was
introduced to solve this problem. A Tf (i.e., term-frequency)
matrix just divides the word count by the length of a document,
thus identifying how often a word occurs in a document.
Whereas, Idf (i.e., inverse document frequency) matrix tries to
reduce the effect of common words by putting more weights
on the critical words (words that are not equally frequent in
all documents, like stop words).
b) Continuous Representation: One popular approach to
represent text is to present it as vectors where each dimension
corresponds to the frequency of words, thus resulting in a
word vector. The pitfall is that the vector length might be
substantial depending on the vocabulary size. In that case,
adopting a dimensionality reduction procedure becomes a de-
fault choice. Though these reduced vectors might be compact
and efficient to compute, they contain less of the original
information. Moreover, the individual dimensions no longer
preserve the interpretable features that could be mapped back
to the original textual building blocks. In one way, the context
gets distributed throughout the vector length, thus making it
a distributed representation of continuous values. In this dis-
tributed/continuous representation, each dimension of a word
type vector becomes a parameter to be learned and optimized
based on the observable patterns in the data. We can see these
parameters as continuous values that can be learned using a
continuous objective function using

Chunk 10 · 1,997 chars

tion of continuous values. In this dis-
tributed/continuous representation, each dimension of a word
type vector becomes a parameter to be learned and optimized
based on the observable patterns in the data. We can see these
parameters as continuous values that can be learned using a
continuous objective function using iterative algorithms like
gradient descent. Word2vec [35] is one such distributed vector
representation of text. Word2vec considers similar meaning
words like "small" and "smaller" comparatively closer in the
vector space. There are two types of word2vec algorithms
in practice: (1) Continuous bag of words (CBOW) (2) Skip-
gram. In CBOW, context is considered the input. The neural
network tries to correlate the weight matrix with each word,
thus improving the representation of words through backprop-
agation of the error gradient. In skip-gram, the context is
estimated based on the given the word. However, both CBOW
and Skip-gram were very time-consuming in practice. To
solve this issue, Hierarchical softmax and negative sampling
approaches were introduced. Negative sampling restricts the

-- 3 of 11 --

4
Taxonomy
Evaluation xtreme [16]
Tokenization
Subword
Segmentation
wordpiece [17]
bpe [18]
unigram [19]
sentencepiece [20]
Character-
Based
[21], [22], [23]
Tasks
Semantic multi-nli [24]
xnli [25]
Syntactic pos [26] dep [26]
ner [27]
Data and
Language Pretraining Data
Parallel [4]
Monolingual [3], [28]
Language
LM Objectives
Monolingual mlm [3]
clm [28]
nsp [3]
Parallel tlm [29]
Training
Fine-tuning
Multi-Task [30], [28]
Parameter-
Effieient
adapters [31]
Gradient-Based bert [3]
Pre-training
Encoder-Decoder T5 [30]
Decoder Only gpt [28]
bloom [32]
mgpt [33]
Encoder Only bert [3]
xlm-r [4]
Representation
Type
Contextual elmo [34]
mbert [3] gpt [28]
Continuous word2vec [35]
fasttext [36]
glove [37]
Classical one-hot
tf-idf [38]
Fig. 1. Area Taxonomy of Multilingual Text Representation
output sum so that only a subset of the vectors get updated in
each

Chunk 11 · 1,996 chars

bloom [32]
mgpt [33]
Encoder Only bert [3]
xlm-r [4]
Representation
Type
Contextual elmo [34]
mbert [3] gpt [28]
Continuous word2vec [35]
fasttext [36]
glove [37]
Classical one-hot
tf-idf [38]
Fig. 1. Area Taxonomy of Multilingual Text Representation
output sum so that only a subset of the vectors get updated in
each step, whereas, in hierarchical softmax, words are chosen
based on their count-based conditional probabilities. Later on,
GLOVE [37] improves significantly on word2vec by using
the global contextual information by constructing the global
co-occurrence matrix and factorizing it later on. However,
these distributed representation methods failed to consider the
out-of-vocabulary (OOV) words. This is when Fasttext [36]
was proposed. Fasttext breaks the words to n-gram instead of
using the full-word representation at once, thus solving the
OOV issue. These models can extract syntactic and semantic
information while dealing with specific corner cases. However,
there is this existing issue of keeping the full context-specific
representation of a document because understanding the actual
context is required for most downstream tasks in NLP.
c) Contextual Word Vectors: The weakness of the con-
tinuous word representation model was to failing to capture
the global context information on a low-dimensional scale. To
solve this issue, contextual word representation models like
Context2vec [39], Cove [40] and ELMO [41] were proposed
later on. These advanced models solved many of the existing
issues but still face catastrophic forgetting. Context2vec model
is based on Word2vec’s CBOW model but replaces its average
word representation within a fixed window with a better and

-- 4 of 11 --

5
more powerful Bi-directional LSTM neural network. Whereas
Cove uses machine translation instead of the approach used
in Word2vec (skip-gram or CBOW) or GloVe (Matrix fac-
torization). They pre-train a two-layer BiLSTM for attending
sequence-to-sequence translation, starting from

Chunk 12 · 1,996 chars

dow with a better and

-- 4 of 11 --

5
more powerful Bi-directional LSTM neural network. Whereas
Cove uses machine translation instead of the approach used
in Word2vec (skip-gram or CBOW) or GloVe (Matrix fac-
torization). They pre-train a two-layer BiLSTM for attending
sequence-to-sequence translation, starting from GloVe word
vectors. Then they took the output of the sequence encoder and
called it a CoVe, followed by combining it with GloVe vectors,
and used it in a downstream task-specific mode using transfer
learning. On top of these models, ELMO (embeddings from
language model) [34] was the one to successfully advance the
contextualization of word vectors. The key idea was: while a
word token will have its vector, this vector needs to depend on
the nearby word contexts. This is similar to the distributional
hypothesis [1]. However, unlike word-type vectors, these word
token vectors combine the word-level vectors with neural
network parameters going beyond the lookup table of word-
type vectors. ELMO contains two neural networks: One for the
left context (start of the sentence to the token) and another for
the right context (from the token to the end of the sentence).
It uses a recurrent neural network that was the most advanced
neural network at that time and was a novel thing to introduce
in language modelling. Later on, ULMFiT [42] successfully
adopt the concept of transfer-learning in the contextual text
representation. It segments the end-to-end process into three
steps then: (i) General LM pre-training, (ii) Target task LM
fine-tuning, and (iii) Target task classifier fine-tuning. Then
finally, the transformer-based fine-tuning appeared in the pic-
ture which proved to be more efficient and faster than LSTM
or CNN for language modelling.
d) Transformer Based Language Models: Transformer-
based language models are more efficient than the ones with
LSTM or CNN for language modeling. The first model to
efficiently use the transformer architecture was GPT

Chunk 13 · 1,987 chars

n the pic-
ture which proved to be more efficient and faster than LSTM
or CNN for language modelling.
d) Transformer Based Language Models: Transformer-
based language models are more efficient than the ones with
LSTM or CNN for language modeling. The first model to
efficiently use the transformer architecture was GPT [28]
which is a decoder-based model. GPT was trained with Causal
Language Modeling. The next one to efficiently represent the
semantics and context was BERT [3]: Bidirectional Encoder
Representations from Transformers. BERT uses a parallel
attention layer instead of using sequential recurrent layers. It
is an encoder model that uses masked language modelling
(MLM) and next-sentence prediction (nsp) objectives in a self-
supervised manner for being trained on a large pool of textual
data. Here, the next sentence prediction is used to collect long-
term or pragmatic information.
B. Training
We can largely divide the current practice of transformer-
based model training into pre-training and fine-tuning. Pre-
training is the training part of preparing a base language model
that can further be transferred and adapted to any downstream
task through fine-tuning.
1) Pre-training: Pre-training is a common step in making
large language models in the current NLP paradigm. Pre-
training means training a language model on extensive textual
data in a self-supervised manner. Self-supervised because the
pre-training objective looks for the data labels to predict,
which are automatically contracted from the data itself (e.g.,
masked language modeling, next sentence prediction). Here
we discuss some common types of pretrained transformer-
based models and their specific approaches to performing the
training.
a) Decoder only: GPT [28] is the most common exam-
ple of the decoder-only model. It uses the causal language
modeling objective. Other advanced decoder only models
are OPT [43], BLOOM [32], mGPT [33]. Recently, these
decoder-only models are also going

Chunk 14 · 1,995 chars

d models and their specific approaches to performing the
training.
a) Decoder only: GPT [28] is the most common exam-
ple of the decoder-only model. It uses the causal language
modeling objective. Other advanced decoder only models
are OPT [43], BLOOM [32], mGPT [33]. Recently, these
decoder-only models are also going multilingual (e.g., mGPT,
BLOOM) and increasing the span of their parameters.
b) Encoder Only: The most common example of encoder
only models are bert [3]. It uses masked language modeling
(MLM) and next sentence prediction (NSP) objectives to
train on large monolingual texts. Several interesting works
investigate improving the self-supervised pre-training objective
of encoder-only models. For example, in [44], the authors
find out that using dynamic masking during training(randomly
masking 15% of token each step) time instead of static
masking improves the performance. In addition, they show that
NSP can be dropped completely from the training objective to
get better performance.
c) Encoder-Decoder: T5 [30] is a encoder-decoder
model. The objective of T5 is closely related to the MLM
and word dropout techniques. The difference with the original
MLM is the consecutive span of corrupted or dropped tokens
is replaced with one single sentinel token. The output sequence
contains these dropped-out spans delimited by the sentinel
tokens that replace the original text. Using this technique
of denoising sequence-to-sequence objective, the decoder can
predict the span of tokens in the masked position instead of
just a single token.
2) Fine-tuning: Fine-tuning means adopting a pre-trained
model for any downstream task. Previously, fine-tuning was
only used to indicate the gradient-based training of the com-
plete model. Here, we will consider any updates that make
a base model further usable for any downstream task as fine-
tuning.
a) Gradient-based: The gradient-based finetuning means
producing a whole set of new parameters by training the entire
model and

Chunk 15 · 1,998 chars

as
only used to indicate the gradient-based training of the com-
plete model. Here, we will consider any updates that make
a base model further usable for any downstream task as fine-
tuning.
a) Gradient-based: The gradient-based finetuning means
producing a whole set of new parameters by training the entire
model and updating all parts for a downstream task. In the
early days of transformer and monolingual language modeling,
this was the go-to approach to try any model on a new task
or language [3].
b) Parameter efficient: The gradient-based finetuning is a
massive bottleneck if we consider the number of downstream
tasks. Then there is parameter-efficient finetuning named as
adapters which include updating only parts of model pa-
rameters [31]. Adapters proved to help perform effective
cross-lingual transfer while reducing the amount of negative
interference [45].
c) Multitask and zero-shot learning: Multitask learning
includes training the model on a wide variety of tasks simulta-
neously instead of just one task [30]. Following this scheme,
there would be less necessity for further finetuning. However,
till now, the utilization of this scheme is still confined to only
computationally intensive models and not in a general-purpose
setting.

-- 5 of 11 --

6
C. Language Modeling Objectives
We can think of the learning objectives of language mod-
eling in a data-driven manner. One is monolingual objectives
which generally work on monolingual data trying to figure out
the representation of a missing token. Whereas in bilingual
objectives, data comes from a parallel corpus. The aim is to
represent similar-meaning words as close as possible.
a) Monolingual Objectives: Masked language model-
ing(MLM) is monolingual but surprisingly helps learn mul-
tilingual models. Consider, a sentence X = (x1, x2, ..xs). In
masked language modeling, token xi is replaced with ¯xi. So
the input to the model becomes X = (x1, ... ¯	xi, ..xs). Now
the prediction task ins Y = (xi) where xi

Chunk 16 · 1,988 chars

l Objectives: Masked language model-
ing(MLM) is monolingual but surprisingly helps learn mul-
tilingual models. Consider, a sentence X = (x1, x2, ..xs). In
masked language modeling, token xi is replaced with ¯xi. So
the input to the model becomes X = (x1, ... ¯	xi, ..xs). Now
the prediction task ins Y = (xi) where xi can be predicted
from the final representation of ¯	xi [3]. Previously it was
thought that MLM works well because of its ability to discover
syntactic and semantic mechanisms in the pre-training stage.
However, recent findings suggest MLM learns the higher order
distributional statistics, thus making it a very useful prior
for further fine-tuning [46]. Another common one is Causal
Language Modeling (CLM). CLM is the traditional learning
objective in a language model that includes estimating the
probability of a word given the previous sequence of words
in a sentence, thus P (xt|x1, x2.., xt−1, θ). In other words,
a model can be trained with inputs X = (x1, , , xt−1) and
outputs Y = (xt). Here xt is the output label which can be
predicted from the final layer representation of token xt−1.
b) Parallel-corpora based Objectives: This is mainly a
bilingual objective, where the model learns to reduce the
distance of similar meaning text given a parallel corpus. As the
amount of parallel corpus is much less than monolingual one,
a joint objective utilizing both monolingual and parallel corpus
is used. Translation language modeling (TLM) [47] is the most
common one, where xA
1 , xA
2 , ...xA
n is a sentence in language A
and xB
1 , xB
2 , ...xB
m is a sentence in language B. Now the MLM
masks tokens from both sentences A and B. It tries to predict
the missing token by utilizing either the surrounding tokens in
A or the translation in B. Another useful objective is to use the
contextual representation from the base language model given
word alignment [48], [49], [50] between translations and ask
the model explicitly to reduce the distance of similar

Chunk 17 · 1,998 chars

s to predict
the missing token by utilizing either the surrounding tokens in
A or the translation in B. Another useful objective is to use the
contextual representation from the base language model given
word alignment [48], [49], [50] between translations and ask
the model explicitly to reduce the distance of similar meaning
word representation [51].
D. Data and Language
a) Pre-training Data: Multilingual Language Models
explore two data types primarily for pre-training: (1) Large
Monolingual Data and (2) Parallel Corpus. Bert [3] uses mono-
lingual data from Wikipedia, whereas XLM-R [4] uses a much
larger common-crawl corpus to train the model. Language
family or region-specific models [52] have their focused source
of data to explore.
b) Languages: BERT [3] and XLM-R [4] are two multi-
lingual models having trained in more than 100 languages. One
problem is that the data availability is not evenly distributed
across languages. Usually, models use exponential smoothing
to make the data ratio fair across high-resource and low-
resource languages. The main idea is if a language i contains
m% of total pre-training data, then the probability of that
language is pi = k/100 where pi is exponentiated by a
factor α. Then the resulting values are normalized to give
a probability value to all the languages. α < 1 means the
high-resourced ones will be under-sampled, whereas the low-
resourced ones will be over-sampled.
E. Tasks
The range of mainstream NLP tasks requires models to
perform transfer learning at different difficulty levels, thus
requiring word, phrase, or sentence-level understanding. Es-
sentially, we can frame the tasks into two main groups:
a) Syntactic Task: These tasks focus on the sentence-
level or word-level structure of the languages. The most
common example of this type of task includes dependency
parsing (DEP) [53], named entity recognition (NER) [27] and
parts of speech tagging (POS) [26]. Universal dependency
project [26] contains a dataset of DEP

Chunk 18 · 1,999 chars

Syntactic Task: These tasks focus on the sentence-
level or word-level structure of the languages. The most
common example of this type of task includes dependency
parsing (DEP) [53], named entity recognition (NER) [27] and
parts of speech tagging (POS) [26]. Universal dependency
project [26] contains a dataset of DEP and POS tasks covering
more than 100 languages as part of the evaluation.
b) Semantic Task: Some tasks require models to perform
language-level understanding or inference that can not be
answered by just following the sentence structure. Natural
language inference [24], [25] that tries to predict whether a
premise sentence entails, contradicts, or is neutral toward a
hypothesis sentence is one such task.
F. Tokenization
Tokenization is splitting a sequence of characters given a
document unit into a piece of singular units (e.g., tokens)
based on some level of heuristics. The simplest form of
heuristics would be to cut into words based on white space,
thus preserving the meaning at the unit level. However, this
comes with issues like phrase-level segmentation and differ-
ent characterization of the similar-meaning word because of
minor spelling variations. Moreover, words form differently in
different languages. For example, in french, the use of apos-
trophe sometimes works like the mention of a definite article
(l’ensemble). In German, compound words are written with-
out white-space (computational modeling → Computermodel-
lierung). Thus, researchers focus on sub-word and character-
based splitting instead of white-space tokenization to perform
a more usable segmentation. Tokenization does the text-to-
numerical mapping of input, making it one of the primary steps
before feeding a text distribution to a computation model. As
languages can be of different scripts with different vocabulary,
multilingual models accumulate all happening subwords as
part of the vocabulary for the supporting languages. This
becomes a limiting factor when we want to extend

Chunk 19 · 1,994 chars

it one of the primary steps
before feeding a text distribution to a computation model. As
languages can be of different scripts with different vocabulary,
multilingual models accumulate all happening subwords as
part of the vocabulary for the supporting languages. This
becomes a limiting factor when we want to extend the model
capacity to a thousand more languages. Training monolingual
tokenizers each time is not a viable option, and the original
multilingual tokenizer does not have vocabulary distributional
knowledge about the unseen languages.
a) Subword Segmentation: Subword-based tokenizations
are the most widely used tokenization for multilingual
transformer-based models. Because simple whitespace-based
tokenization suffers from the dimensionality bottleneck prob-
lem, simple character-based splitting results in losing all

-- 6 of 11 --

7
context signals. The main difference in common sub-word-
based tokenizers lies in the choice of character pairs to merge.
For example, BPE [18] makes a frequency-based merging,
whereas the Unigram [19] model uses a probability based
merging (computing the likelihood of each subword instead of
using the most frequent ones). Word-Piece [17] tries to utilize
the advantage of both unigram and BPE. It merges based on
likelihood instead of frequency, but the choice of words to join
is based on frequency. SentencePiece [20] is another subword-
based tokenizer that rules out the initial whitespace-based
splitting (useful for languages like Chinese and Japanese) by
considering space as just another character and then employing
BPE or Ingram on top of that. In the multilingual scenario,
languages can be of different scripts with different vocabulary,
and multilingual models accumulate all happening subwords
as part of the vocabulary from the supporting languages. Now
subword-based tokenizer with a fixed vocabulary size performs
unfairly for low-resource languages due to the data imbal-
ance among languages resulting in excessive

Chunk 20 · 1,991 chars

scripts with different vocabulary,
and multilingual models accumulate all happening subwords
as part of the vocabulary from the supporting languages. Now
subword-based tokenizer with a fixed vocabulary size performs
unfairly for low-resource languages due to the data imbal-
ance among languages resulting in excessive fragmentation
of subwords [54]. In addition, the fixed vocabulary becomes
a limiting factor when we want to extend the model capacity
to a thousand more languages.
b) Character-based Model: Subword tokenizers elimi-
nate the out-of-vocabulary problem to a large extent, though
the reliance on static vocabulary prevents end-to-end learning
across all languages. One alternative would be to use a
character-based approach [21], [22]. Though this is more
adaptable to the code-switched language and noisy text, they
may not capture the token-level representation and convert
the longer sequence to character-level representation [22], thus
increasing the task complexity [23].
G. Evaluation
Language models like mBERT [3] and XLM-R [4] helped
in a way to shift the overall focus towards a unified gen-
eral multilingual framework of language modeling. With this
advancement, researchers identified the need for a unified
evaluation framework that can provide insight into the overall
transfer-learning capability of a multilingual model. With this
aim, in [16], the authors designed an evaluation setting that
ranges from syntactic to semantic tasks and structural predic-
tion tasks. In each task evaluation, the training is performed
in English, whereas the evaluation comprises 40 languages
from 12 families, thus ensuring the typological diversity in
language selection. Later, this practice of expanding over far-
reach languages continued through subsequent studies [55].
However, these unified evaluation frameworks still do not en-
tirely comprise languages from highly low-resource languages
and dialects.
V. OPEN PROBLEMS
a) Tokenization free approach: In short, a

Chunk 21 · 1,999 chars

language selection. Later, this practice of expanding over far-
reach languages continued through subsequent studies [55].
However, these unified evaluation frameworks still do not en-
tirely comprise languages from highly low-resource languages
and dialects.
V. OPEN PROBLEMS
a) Tokenization free approach: In short, a silver-lined
multilingual tokenizer is yet to be found as a monolingual
subword tokenizer still outperforms a multilingual tokenizer
in almost every task [56], [57]. Now monolingual Training
tokenizer every time is not a feasible option and the original
multilingual tokenizer does not have vocabulary distributional
knowledge about the unseen languages. One option could
be the effort to rule out the tokenization step completely.
Several recent studies are exploring this direction. Models
like CAINNE [23] are tokenization-free modules that en-
code texts differently. Instead of using a vocabulary and
tokenization step, CANINE encodes texts at the character level,
produces character-level output, and uses a soft-max layer-
based smoothing for subword projection. Recently another
model PIXEL [58], treats text as an image and by doing that,
bypasses the text encoding step. These methods are getting
inspiring results but the full potentiality of this direction is yet
to unfold. For example, pixel receives impressive results for
syntactic tasks, but this is not true for semantic tasks. Further
studies need to be done on the applicability of tokenization-
free models in the case of zero-shot and few-shot transfers.
b) Extending to new languages: In the current paradigm
of MLLM, the curse-of-multilinguality [45] is an issue where
the per-language performance drops unequally when the model
is trained on multiple languages. In a general setting, the
usual scenario is: (1) A monolingual model performs better
than a multilingual model on specific tasks and language (2)
The low-resourced ones perform poorly compared to the high-
resourced ones in MLLM (3) Languages

Chunk 22 · 1,995 chars

performance drops unequally when the model
is trained on multiple languages. In a general setting, the
usual scenario is: (1) A monolingual model performs better
than a multilingual model on specific tasks and language (2)
The low-resourced ones perform poorly compared to the high-
resourced ones in MLLM (3) Languages unseen during the
pre-training stage perform the worst. To tackle this issue, the
straightforward approach would be to finetune the model to the
specific target language. However, then, the model becomes
specific to only one language though it does not increase
the catastrophic forgetting in general. The authors in [59]
have identified then, while dealing with unseen language and
scripts, having pretrained on any closely related language
usually helps. Another approach to improve the capacity
of MLLM would be to augment the vocabulary with new
tokens, which works better for unseen languages and improve
performance for the languages already seen during pre-training
[60]. Another simple approach includes mitigating the cross-
lingual transfer gap by just performing training on a bit of
target data amount[61]. Another helpful approach to improve
the model capacity is to use adapters, modular parameter units
that can be injected in every layer of a base language model.
These adapter layers can then be trained on language or task-
specific data, thus enabling the cross-lingual transfer without
changing the base model parameters[31]. This approach can
be further used for unseen language and optimized zero-shot
transfer by using these adapters at the pre-training stage, which
lifts the curse-of-multilingualism to a greater extent[62].
c) Model Interpretability and Shared Representation: It
is hypothesized that most of the large neural models are over-
parameterized [63], thus, resulting in unnecessary computation
and storage costs. Effective model pruning could be a feasible
approach to try in this regard. [64] is one of the earlier
studies on pretrained

Chunk 23 · 1,998 chars

el Interpretability and Shared Representation: It
is hypothesized that most of the large neural models are over-
parameterized [63], thus, resulting in unnecessary computation
and storage costs. Effective model pruning could be a feasible
approach to try in this regard. [64] is one of the earlier
studies on pretrained bert with model pruning. The results
demonstrate that the idea of lottery ticket observations [63]
(i.e., if we select multiple small networks, we will end up
getting the similar performance of the larger model at one
time) remain relevant in this context of language modeling
also. We find matching subnetworks for a range of downstream
tasks at 40% to 90% sparsity. Subsequently, [65] has done LTH
based empirical study on English bert. The findings suggest

-- 7 of 11 --

8
“random” subnetworks are still almost as good as the “good”
ones, and even the “worst”(sampled elements which didn’t
survive the pruning by LTH [63]) ones perform on par with a
substantial baseline. [66] is one of the studies which explore
LTH-based pruning in multilingual scenarios. The authors use
the lottery ticket hypothesis in mBERT and conclude that
the sub-networks found for different languages are similar. In
addition, mBERT comprises a language-neutral sub-network
shared among many languages, which is the most useful one
while performing cross-lingual transfer. Another finding is for
MLM tasks, the similar language & task specific sub-network
are primarily identical in the lower to the middle layer. In con-
trast, network similarity is visible mainly in the higher layer for
NER & XNLI tasks. The prospect of different pruning-based
methods is further explored in [67]. The authors study two
types of model pruning strategies: regularization-based and
gradient-based pruning. They also propose a method called
Dynamic Sparsification to allow training the model once and
then adapting to different model sizes at inference. They also
used a diversity loss to prune language-specific

Chunk 24 · 1,994 chars

n [67]. The authors study two
types of model pruning strategies: regularization-based and
gradient-based pruning. They also propose a method called
Dynamic Sparsification to allow training the model once and
then adapting to different model sizes at inference. They also
used a diversity loss to prune language-specific subnetworks.
Results show that subnetworks of different languages are
indeed different. The most straightforward pruning algorithm
performs the best, and A fast model does not mean it should
be small. Another finding is for large multilingual models like
XLM-R sharing the subnetwork for a universal representation
is preferable; the language-neutral part contributes the most
in cross-lingual transfer, which also aligns with the finding
of [66]. [68] did one study on the Effect of Dropping Layers
of Pre-trained Transformer Models where they pruned BERT,
Roberta, and XLNet models up to 40% while maintaining up
to 98% of their actual performance. The findings are: (i) the
lower layers are most critical to maintaining downstream task
performance, (ii) some tasks, such as paraphrase detection and
sentence similarity, are more robust to the dropping of layers,
and (iii) models trained using a different objective function
exhibit different learning patterns and w.r.t the layer dropping.
Another question is whether MLLM learns universal pat-
terns or not. Learning universal patterns is essential for effec-
tive cross-lingual transfer. From the discussion above, it can
be inferred that MLLM contains particular language-specific
and shared representation space. The shared representation
space helps in probing-based tasks like POS tagging. However,
performing complex tasks like MT is still beyond scope just
by using this shared representation space.
Analysis of traditional classifier-based probing methods
is another heavily investigated direction that is still being
explored to ensure model interpretability. For example, [69] ex-
amines whether good networks

Chunk 25 · 1,995 chars

er,
performing complex tasks like MT is still beyond scope just
by using this shared representation space.
Analysis of traditional classifier-based probing methods
is another heavily investigated direction that is still being
explored to ensure model interpretability. For example, [69] ex-
amines whether good networks include any superior linguistic
knowledge or not. It results in not finding any interpretable
patterns. [70] investigated Language Relationships in Multi-
lingual Sentence Encoders Through the Lens of Linguistic
Typology; This study looks into how languages are placed in
multilingual subspace in mBERT and XLM-R. At the same
time, how language specific sub-spaces within multilingual
sentence encoders (LASER [71], m-BERT [3], XLM [72],
and XLM-R [4]) concerning a range of typological properties
pertaining to the lexical, morphological, and syntactic structure
can be separated. Their results show interesting differences
in encoding linguistic variation associated with different pre-
training strategies. [69] is another recent study on linguistic
probing using universal dependency tasks and data points.
d) Fairness in Language Models: A substantial amount
of work has investigated existing social bias (e.g., gender,
racial, ethnic, occupational) identification and mitigation ap-
proach in PLMs, including reducing token sensitivity dur-
ing text generation [73], investigating model sensitivity [74],
prompting using natural sentences [75] and probing via embed-
ding lookup [76]. However, the state-of-the-art NLP models
and datasets are still biased toward certain attributes [77], and
the overall utility provided by the models is still skewed [78].
VI. CONCLUSION
This survey provides insights into the current state of the art
of multilingual text representation and the extent of language
modeling. While many languages are already being covered
under the current paradigm of language modeling, the full
potential is not explored yet. There are languages left

Chunk 26 · 1,990 chars

[78].
VI. CONCLUSION
This survey provides insights into the current state of the art
of multilingual text representation and the extent of language
modeling. While many languages are already being covered
under the current paradigm of language modeling, the full
potential is not explored yet. There are languages left to
cover. At the same time, the currently covered low-resourced
ones are still not free from the impact of capacity dilution.
Keeping this in focus, we did a review on the potential scope
of improvement and research directions.
REFERENCES
[1] Z. Harris, “Distributional structure,” Word, vol. 10,
no. 2-3, pp. 146–162, 1954. [Online]. Available:
https://link.springer.com/chapter/10.1007/978-94-009-8467-7_1
[2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.
[Online]. Available: https://arxiv.org/abs/1706.03762
[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT:
Pre-training of deep bidirectional transformers for language
understanding,” in Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers). Minneapolis, Minnesota: Association for
Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online].
Available: https://www.aclweb.org/anthology/N19-1423
[4] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek,
F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov,
“Unsupervised cross-lingual representation learning at scale,” in
Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics. Online: Association for Computational
Linguistics, Jul. 2020, pp. 8440–8451. [Online]. Available:
https://www.aclweb.org/anthology/2020.acl-main.747
[5] Z. Wang, Z. C. Lipton, and Y. Tsvetkov, “On negative interference
in multilingual models: Findings and a meta-learning treatment,”
in Proceedings of the 2020

Chunk 27 · 1,991 chars

uistics. Online: Association for Computational
Linguistics, Jul. 2020, pp. 8440–8451. [Online]. Available:
https://www.aclweb.org/anthology/2020.acl-main.747
[5] Z. Wang, Z. C. Lipton, and Y. Tsvetkov, “On negative interference
in multilingual models: Findings and a meta-learning treatment,”
in Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP). Online: Association
for Computational Linguistics, Nov. 2020, pp. 4438–4450. [Online].
Available: https://aclanthology.org/2020.emnlp-main.359
[6] J. A. Balazs and J. D. Velásquez, “Opinion min-
ing and information fusion: A survey,” Information Fu-
sion, vol. 27, pp. 95–110, 2016. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S1566253515000536
[7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
by jointly learning to align and translate,” 2014. [Online]. Available:
https://arxiv.org/abs/1409.0473
[8] W. Y. Zou, R. Socher, D. Cer, and C. D. Manning, “Bilingual word
embeddings for phrase-based machine translation,” in Proceedings
of the 2013 Conference on Empirical Methods in Natural
Language Processing. Seattle, Washington, USA: Association for
Computational Linguistics, Oct. 2013, pp. 1393–1398. [Online].
Available: https://aclanthology.org/D13-1141
[9] T. Mikolov, Q. V. Le, and I. Sutskever, “Exploiting similarities
among languages for machine translation,” 2013. [Online]. Available:
https://arxiv.org/abs/1309.4168

-- 8 of 11 --

9
[10] M. Zhang, H. Peng, Y. Liu, H. Luan, and M. Sun, “Bilingual lexicon
induction from non-parallel data with minimal supervision,” in Proceed-
ings of the Thirty-First AAAI Conference on Artificial Intelligence, ser.
AAAI’17. AAAI Press, 2017, p. 3379–3385.
[11] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilis-
tic language model,” J. Mach. Learn. Res., vol. 3, no. null, p. 1137–1155,
mar 2003.
[12] N.-Q. Pham, G. Kruszewski, and G. Boleda, “Convolutional neural
network language

Chunk 28 · 1,993 chars

Artificial Intelligence, ser.
AAAI’17. AAAI Press, 2017, p. 3379–3385.
[11] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural probabilis-
tic language model,” J. Mach. Learn. Res., vol. 3, no. null, p. 1137–1155,
mar 2003.
[12] N.-Q. Pham, G. Kruszewski, and G. Boleda, “Convolutional neural
network language models,” in Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing. Austin, Texas:
Association for Computational Linguistics, Nov. 2016, pp. 1153–1162.
[Online]. Available: https://aclanthology.org/D16-1123
[13] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and
S. Khudanpur, “Recurrent neural network based language
model.” in INTERSPEECH, T. Kobayashi, K. Hirose, and
S. Nakamura, Eds. ISCA, 2010, pp. 1045–1048. [Online]. Available:
http://dblp.uni-trier.de/db/conf/interspeech/interspeech2010.html#MikolovKBCK10
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, p. 1735–1780, nov 1997. [Online]. Available:
https://doi.org/10.1162/neco.1997.9.8.1735
[15] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
properties of neural machine translation: Encoder–decoder approaches,”
in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics
and Structure in Statistical Translation. Doha, Qatar: Association for
Computational Linguistics, Oct. 2014, pp. 103–111. [Online]. Available:
https://aclanthology.org/W14-4012
[16] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson,
“Xtreme: A massively multilingual multi-task benchmark for evaluating
cross-lingual generalization,” CoRR, vol. abs/2003.11080, 2020.
[17] M. Schuster and K. Nakajima, “Japanese and korean voice search,” in
2012 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2012, pp. 5149–5152.
[18] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation
of rare words with subword units,” in Proceedings of the 54th
Annual Meeting of the Association for

Chunk 29 · 1,953 chars

“Japanese and korean voice search,” in
2012 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2012, pp. 5149–5152.
[18] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation
of rare words with subword units,” in Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). Berlin, Germany: Association for
Computational Linguistics, Aug. 2016, pp. 1715–1725. [Online].
Available: https://aclanthology.org/P16-1162
[19] T. Kudo, “Subword regularization: Improving neural network translation
models with multiple subword candidates,” 2018. [Online]. Available:
https://arxiv.org/abs/1804.10959
[20] T. Kudo and J. Richardson, “SentencePiece: A simple and
language independent subword tokenizer and detokenizer for
neural text processing,” in Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing:
System Demonstrations. Brussels, Belgium: Association for
Computational Linguistics, Nov. 2018, pp. 66–71. [Online]. Available:
https://www.aclweb.org/anthology/D18-2012
[21] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck,
“Learning deep structured semantic models for web search using
clickthrough data.” ACM International Conference on Information and
Knowledge Management (CIKM), October 2013. [Online]. Available:
https://www.microsoft.com/en-us/research/publication/learning-deep-structured-semantic-models-for-web-search-using-clickthrough-data/
[22] K. Hwang and W. Sung, “Character-level language modeling with hierar-
chical recurrent neural networks,” 2017 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pp. 5720–5724,
2017.
[23] J. H. Clark, D. Garrette, I. Turc, and J. Wieting, “Canine:
Pre-training an efficient tokenization-free encoder for language
representation,” Transactions of the Association for Computational
Linguistics, vol. 10, pp. 73–91, 2022. [Online].

Chunk 30 · 1,990 chars

on Acoustics, Speech and Signal Processing (ICASSP), pp. 5720–5724,
2017.
[23] J. H. Clark, D. Garrette, I. Turc, and J. Wieting, “Canine:
Pre-training an efficient tokenization-free encoder for language
representation,” Transactions of the Association for Computational
Linguistics, vol. 10, pp. 73–91, 2022. [Online]. Available:
https://aclanthology.org/2022.tacl-1.5
[24] A. Williams, N. Nangia, and S. Bowman, “A broad-coverage challenge
corpus for sentence understanding through inference,” in Proceedings of
the 2018 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume
1 (Long Papers). Association for Computational Linguistics, 2018, pp.
1112–1122. [Online]. Available: http://aclweb.org/anthology/N18-1101
[25] A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman,
H. Schwenk, and V. Stoyanov, “XNLI: Evaluating cross-lingual
sentence representations,” in Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing. Brussels,
Belgium: Association for Computational Linguistics, Oct.-Nov. 2018,
pp. 2475–2485. [Online]. Available: https://aclanthology.org/D18-1269
[26] D. Zeman, J. Nivre, M. Abrams, E. Ackermann, N. Aepli, H. Aghaei,
Ž. Agi´c, A. Ahmadi, L. Ahrenberg, Ajede, and et al., “Universal
dependencies 2.9,” 2021, LINDAT/CLARIAH-CZ digital library at
the Institute of Formal and Applied Linguistics (ÚFAL), Faculty
of Mathematics and Physics, Charles University. [Online]. Available:
http://hdl.handle.net/11234/1-4611
[27] A. Rahimi, Y. Li, and T. Cohn, “Massively multilingual transfer for
NER,” in Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics. Florence, Italy: Association for
Computational Linguistics, Jul. 2019, pp. 151–164. [Online]. Available:
https://www.aclweb.org/anthology/P19-1015
[28] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al.,

Chunk 31 · 1,990 chars

e Association
for Computational Linguistics. Florence, Italy: Association for
Computational Linguistics, Jul. 2019, pp. 151–164. [Online]. Available:
https://www.aclweb.org/anthology/P19-1015
[28] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models
are few-shot learners,” 2020, arXiv:2005.14165.
[29] D. Zeman and J. Hajiˇc, Eds., Proceedings of the CoNLL 2018 Shared
Task: Multilingual Parsing from Raw Text to Universal Dependencies.
Brussels, Belgium: Association for Computational Linguistics, Oct.
2018. [Online]. Available: https://www.aclweb.org/anthology/K18-2000
[30] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning
with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21,
no. 1, jun 2022.
[31] J. Pfeiffer, I. Vuli´c, I. Gurevych, and S. Ruder, “MAD-X: An
Adapter-Based Framework for Multi-Task Cross-Lingual Transfer,”
in Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP). Online: Association
for Computational Linguistics, Nov. 2020, pp. 7654–7673. [Online].
Available: https://aclanthology.org/2020.emnlp-main.617
[32] “Bigscience large open-science open-access multilingual lan-
guage model,” 2022, online resource. [Online]. Available:
https://huggingface.co/bigscience/bloom-560m
[33] O. Shliazhko, A. Fenogenova, M. Tikhonova, V. Mikhailov, A. Kozlova,
and T. Shavrina, “mgpt: Few-shot learners go multilingual,” 2022.
[Online]. Available: https://arxiv.org/abs/2204.07580
[34] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
and L. Zettlemoyer, “Deep contextualized word representations,” in
Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers). New Orleans, Louisiana:
Association for Computational

Chunk 32 · 1,998 chars

M. Gardner, C. Clark, K. Lee,
and L. Zettlemoyer, “Deep contextualized word representations,” in
Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers). New Orleans, Louisiana:
Association for Computational Linguistics, Jun. 2018, pp. 2227–2237.
[Online]. Available: https://aclanthology.org/N18-1202
[35] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
of word representations in vector space,” 2013. [Online]. Available:
https://arxiv.org/abs/1301.3781
[36] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word
vectors with subword information,” Transactions of the Association
for Computational Linguistics, vol. 5, pp. 135–146, 2017. [Online].
Available: https://aclanthology.org/Q17-1010
[37] J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors
for word representation,” in Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP).
Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp.
1532–1543. [Online]. Available: https://aclanthology.org/D14-1162
[38] C. Sammut and G. I. Webb, Eds., TF–IDF. Boston,
MA: Springer US, 2010, pp. 986–987. [Online]. Available:
https://doi.org/10.1007/978-0-387-30164-8_832
[39] O. Melamud, J. Goldberger, and I. Dagan, “context2vec:
Learning generic context embedding with bidirectional LSTM,”
in Proceedings of the 20th SIGNLL Conference on Computational
Natural Language Learning. Berlin, Germany: Association for
Computational Linguistics, Aug. 2016, pp. 51–61. [Online]. Available:
https://aclanthology.org/K16-1006
[40] B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in
translation: Contextualized word vectors,” in Proceedings of the 31st
International Conference on Neural Information Processing Systems,
ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017,
p. 6297–6308.
[41] M. E. Peters, M. Neumann, M. Iyyer, M.

Chunk 33 · 1,993 chars

B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in
translation: Contextualized word vectors,” in Proceedings of the 31st
International Conference on Neural Information Processing Systems,
ser. NIPS’17. Red Hook, NY, USA: Curran Associates Inc., 2017,
p. 6297–6308.
[41] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and
L. Zettlemoyer, “Deep contextualized word representations,” in Proc. of
NAACL, 2018.
[42] J. Howard and S. Ruder, “Universal language model fine-
tuning for text classification,” in Proceedings of the 56th
Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). Melbourne, Australia: Association for
Computational Linguistics, Jul. 2018, pp. 328–339. [Online]. Available:
https://aclanthology.org/P18-1031
[43] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan,
M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster,
D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt:
Open pre-trained transformer language models,” 2022.

-- 9 of 11 --

10
[44] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly
optimized bert pretraining approach,” 2019. [Online]. Available:
https://arxiv.org/abs/1907.11692
[45] J. Pfeiffer, N. Goyal, X. V. Lin, X. Li, J. Cross, S. Riedel, and
M. Artetxe, “Lifting the curse of multilinguality by pre-training modular
transformers,” 2022, arXiv:2205.06266.
[46] K. Sinha, R. Jia, D. Hupkes, J. Pineau, A. Williams, and D. Kiela,
“Masked language modeling and the distributional hypothesis: Order
word matters pre-training for little,” in Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing.
Online and Punta Cana, Dominican Republic: Association for
Computational Linguistics, Nov. 2021, pp. 2888–2913. [Online].
Available: https://aclanthology.org/2021.emnlp-main.230
[47] A. CONNEAU and G. Lample, “Cross-lingual language

Chunk 34 · 1,995 chars

roceedings of the 2021
Conference on Empirical Methods in Natural Language Processing.
Online and Punta Cana, Dominican Republic: Association for
Computational Linguistics, Nov. 2021, pp. 2888–2913. [Online].
Available: https://aclanthology.org/2021.emnlp-main.230
[47] A. CONNEAU and G. Lample, “Cross-lingual language model
pretraining,” in Advances in Neural Information Processing
Systems, H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32.
Curran Associates, Inc., 2019. [Online]. Available:
https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf
[48] Z.-Y. Dou and G. Neubig, “Word alignment by fine-tuning
embeddings on parallel corpora,” in Proceedings of the 16th
Conference of the European Chapter of the Association for
Computational Linguistics: Main Volume. Online: Association for
Computational Linguistics, Apr. 2021, pp. 2112–2128. [Online].
Available: https://aclanthology.org/2021.eacl-main.181
[49] D. Yarowsky, G. Ngai, and R. Wicentowski, “Inducing multilingual
text analysis tools via robust projection across aligned corpora,”
in Proceedings of the First International Conference on Human
Language Technology Research, ser. HLT ’01. USA: Association
for Computational Linguistics, 2001, p. 1–8. [Online]. Available:
https://doi.org/10.3115/1072133.1072187
[50] C. Dyer, V. Chahuneau, and N. A. Smith, “A simple, fast, and
effective reparameterization of IBM model 2,” in Proceedings of the
2013 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies. Atlanta,
Georgia: Association for Computational Linguistics, Jun. 2013, pp.
644–648. [Online]. Available: https://aclanthology.org/N13-1073
[51] S. Cao, N. Kitaev, and D. Klein, “Multilingual alignment
of contextual word representations,” 2020. [Online]. Available:
https://arxiv.org/abs/2002.03518
[52] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya,
M. M.

Chunk 35 · 1,996 chars

ics, Jun. 2013, pp.
644–648. [Online]. Available: https://aclanthology.org/N13-1073
[51] S. Cao, N. Kitaev, and D. Klein, “Multilingual alignment
of contextual word representations,” 2020. [Online]. Available:
https://arxiv.org/abs/2002.03518
[52] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya,
M. M. Khapra, and P. Kumar, “IndicNLPSuite: Monolingual Corpora,
Evaluation Benchmarks and Pre-trained Multilingual Language Models
for Indian Languages,” in Findings of EMNLP, 2020.
[53] T. Dozat and C. D. Manning, “Deep biaffine attention
for neural dependency parsing,” 2016. [Online]. Available:
https://arxiv.org/abs/1611.01734
[54] Y. Tay, V. Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin,
S. Baumgartner, C. Yu, and D. Metzler, “Charformer: Fast character
transformers via gradient-based subword tokenization,” 2021. [Online].
Available: https://arxiv.org/abs/2106.12672
[55] S. Ruder, N. Constant, J. Botha, A. Siddhant, O. Firat, J. Fu, P. Liu,
J. Hu, D. Garrette, G. Neubig, and M. Johnson, “XTREME-R: Towards
more challenging and nuanced multilingual evaluation,” in Proceedings
of the 2021 Conference on Empirical Methods in Natural Language
Processing. Online and Punta Cana, Dominican Republic: Association
for Computational Linguistics, Nov. 2021, pp. 10 215–10 245. [Online].
Available: https://aclanthology.org/2021.emnlp-main.802
[56] P. Rust, J. Pfeiffer, I. Vuli´c, S. Ruder, and I. Gurevych, “How
good is your tokenizer? on the monolingual performance of
multilingual language models,” in Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and
the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers). Online: Association for
Computational Linguistics, Aug. 2021, pp. 3118–3135. [Online].
Available: https://aclanthology.org/2021.acl-long.243
[57] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey,
M. Gallé, A. Raja, C. Si, W. Y. Lee, B. Sagot, and S. Tan,
“Between

Chunk 36 · 1,996 chars

al Language
Processing (Volume 1: Long Papers). Online: Association for
Computational Linguistics, Aug. 2021, pp. 3118–3135. [Online].
Available: https://aclanthology.org/2021.acl-long.243
[57] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey,
M. Gallé, A. Raja, C. Si, W. Y. Lee, B. Sagot, and S. Tan,
“Between words and characters: A brief history of open-vocabulary
modeling and tokenization in nlp,” 2021. [Online]. Available:
https://arxiv.org/abs/2112.10508
[58] P. Rust, J. F. Lotz, E. Bugliarello, E. Salesky, M. de Lhoneux, and
D. Elliott, “Language modelling with pixels,” 2022. [Online]. Available:
https://arxiv.org/abs/2207.06991
[59] B. Muller, A. Anastasopoulos, B. Sagot, and D. Seddah, “When
being unseen from mBERT is just the beginning: Handling new
languages with multilingual language models,” in Proceedings of the
2021 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies. Online:
Association for Computational Linguistics, Jun. 2021, pp. 448–462.
[Online]. Available: https://aclanthology.org/2021.naacl-main.38
[60] Z. Wang, K. K, S. Mayhew, and D. Roth, “Extending multilingual
BERT to low-resource languages,” in Findings of the Association
for Computational Linguistics: EMNLP 2020. Online: Association
for Computational Linguistics, Nov. 2020, pp. 2649–2656. [Online].
Available: https://aclanthology.org/2020.findings-emnlp.240
[61] A. Lauscher, V. Ravishankar, I. Vuli´c, and G. Glavaš, “From zero
to hero: On the limitations of zero-shot language transfer with
multilingual Transformers,” in Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP). Online:
Association for Computational Linguistics, Nov. 2020, pp. 4483–4499.
[Online]. Available: https://aclanthology.org/2020.emnlp-main.363
[62] J. Pfeiffer, N. Goyal, X. Lin, X. Li, J. Cross, S. Riedel, and
M. Artetxe, “Lifting the curse of multilinguality by pre-training
modular transformers,” in

Chunk 37 · 1,994 chars

cessing (EMNLP). Online:
Association for Computational Linguistics, Nov. 2020, pp. 4483–4499.
[Online]. Available: https://aclanthology.org/2020.emnlp-main.363
[62] J. Pfeiffer, N. Goyal, X. Lin, X. Li, J. Cross, S. Riedel, and
M. Artetxe, “Lifting the curse of multilinguality by pre-training
modular transformers,” in Proceedings of the 2022 Conference of
the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies. Seattle, United States:
Association for Computational Linguistics, Jul. 2022, pp. 3479–3495.
[Online]. Available: https://aclanthology.org/2022.naacl-main.255
[63] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding
sparse, trainable neural networks,” 2018. [Online]. Available:
https://arxiv.org/abs/1803.03635
[64] T. Chen, J. Frankle, S. Chang, S. Liu, Y. Zhang, Z. Wang, and
M. Carbin, “The lottery ticket hypothesis for pre-trained bert networks,”
2020. [Online]. Available: https://arxiv.org/abs/2007.12223
[65] S. Prasanna, A. Rogers, and A. Rumshisky, “When BERT Plays
the Lottery, All Tickets Are Winning,” in Proceedings of the
2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP). Online: Association for Computational
Linguistics, Nov. 2020, pp. 3208–3229. [Online]. Available:
https://aclanthology.org/2020.emnlp-main.259
[66] N. Foroutan, M. Banaei, R. Lebret, A. Bosselut, and K. Aberer,
“Discovering language-neutral sub-networks in multilingual language
models,” 2022. [Online]. Available: https://arxiv.org/abs/2205.12672
[67] Y. Li, F. Luo, R. Xu, S. Huang, F. Huang, and L. Wang,
“Probing structured pruning on multilingual pre-trained models:
Settings, algorithms, and efficiency,” 2022. [Online]. Available:
https://arxiv.org/abs/2204.02601
[68] H. Sajjad, F. Dalvi, N. Durrani, and P. Nakov, “On the effect of dropping
layers of pre-trained transformer models,” 2021.
[69] K. Sta´nczak, E. Ponti, L. T. Hennigen, R. Cotterell, and
I. Augenstein, “Same neurons, different

Chunk 38 · 1,985 chars

algorithms, and efficiency,” 2022. [Online]. Available:
https://arxiv.org/abs/2204.02601
[68] H. Sajjad, F. Dalvi, N. Durrani, and P. Nakov, “On the effect of dropping
layers of pre-trained transformer models,” 2021.
[69] K. Sta´nczak, E. Ponti, L. T. Hennigen, R. Cotterell, and
I. Augenstein, “Same neurons, different languages: Probing
morphosyntax in multilingual pre-trained models,” 2022. [Online].
Available: https://arxiv.org/abs/2205.02023
[70] R. Choenni and E. Shutova, “Investigating Language Relationships
in Multilingual Sentence Encoders Through the Lens of Linguistic
Typology,” Computational Linguistics, pp. 1–38, 07 2022. [Online].
Available: https://doi.org/10.1162/coli_a_00444
[71] M. Artetxe and H. Schwenk, “Massively multilingual sentence embed-
dings for zero-shot cross-lingual transfer and beyond,” Transactions of
the Association for Computational Linguistics, vol. 7, pp. 597–610, nov
2019. [Online]. Available: https://doi.org/10.1162%2Ftacl_a_00288
[72] G. Lample and A. Conneau, “Cross-lingual language model pretraining,”
2019. [Online]. Available: https://arxiv.org/abs/1901.07291
[73] P. P. Liang, C. Wu, L.-P. Morency, and R. Salakhutdinov, “Towards
understanding and mitigating social biases in language models,” 2021.
[Online]. Available: https://arxiv.org/abs/2106.13219
[74] A. Immer, L. Torroba Hennigen, V. Fortuin, and R. Cotterell,
“Probing as quantifying inductive bias,” in Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). Dublin, Ireland: Association for
Computational Linguistics, May 2022, pp. 1839–1851. [Online].
Available: https://aclanthology.org/2022.acl-long.129
[75] S. Alnegheimish, A. Guo, and Y. Sun, “Using natural sentence
prompts for understanding biases in language models,” in
Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human
Language Technologies. Seattle, United States: Association

Chunk 39 · 1,601 chars

022.acl-long.129
[75] S. Alnegheimish, A. Guo, and Y. Sun, “Using natural sentence
prompts for understanding biases in language models,” in
Proceedings of the 2022 Conference of the North American
Chapter of the Association for Computational Linguistics: Human
Language Technologies. Seattle, United States: Association for
Computational Linguistics, Jul. 2022, pp. 2824–2830. [Online].
Available: https://aclanthology.org/2022.naacl-main.203
[76] J. Ahn and A. Oh, “Mitigating language-dependent ethnic bias in BERT,”
in Proceedings of the 2021 Conference on Empirical Methods in Natural
Language Processing. Online and Punta Cana, Dominican Republic:

-- 10 of 11 --

11
Association for Computational Linguistics, Nov. 2021, pp. 533–549.
[Online]. Available: https://aclanthology.org/2021.emnlp-main.42
[77] P. Joshi, S. Santy, A. Budhiraja, K. Bali, and M. Choudhury,
“The state and fate of linguistic diversity and inclusion in the
NLP world,” in Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics. Online: Association
for Computational Linguistics, Jul. 2020, pp. 6282–6293. [Online].
Available: https://aclanthology.org/2020.acl-main.560
[78] D. Blasi, A. Anastasopoulos, and G. Neubig, “Systematic inequalities
in language technology performance across the world’s languages,”
in Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland:
Association for Computational Linguistics, May 2022, pp. 5486–5505.
[Online]. Available: https://aclanthology.org/2022.acl-long.376

-- 11 of 11 --