Multilingual Text Representation
Summary
This survey explores the evolution of multilingual text representation in NLP, tracing progress from early word-frequency models to modern transformer-based systems. Initially, models like n-grams and neural networks (e.g., RNNs, CNNs) focused on monolingual tasks. The transformer architecture, with its attention mechanism, enabled shared representation spaces across languages, leading to models like mBERT and XLM-R, which support over 100 languages. These models use pre-training on monolingual and parallel corpora, often with objectives like masked language modeling (MLM) and translation language modeling (TLM). Challenges remain, including capacity dilution when adding more languages, uneven performance across low- and high-resource languages, and the need for equitable representation. Tokenization is a key issue, with subword methods like BPE and SentencePiece widely used but struggling with low-resource languages. Open problems include developing tokenization-free models, improving adaptability to new languages, and enhancing model interpretability and fairness. The survey highlights the need for research to address these gaps and achieve truly inclusive multilingual NLP.
PDF viewer
Chunks(40)
Chunk 0 ¡ 1,990 chars
arXiv:2309.00949v1 [cs.CL] 2 Sep 2023 1 Multilingual Text Representation Fahim Faisal Department of Computer Science George Mason University ffaisal@gmu.edu AbstractâModern NLP breakthrough includes large multi- lingual models capable of performing tasks across more than 100 languages. State-of-the-art language models came a long way, starting from the simple one-hot representation of words capable of performing tasks like natural language understanding, common-sense reasoning, or question-answering, thus capturing both the syntax and semantics of texts. At the same time, language models are expanding beyond our known language boundary, even competitively performing over very low-resource dialects of endangered languages. However, there are still problems to solve to ensure an equitable representation of texts through a unified modeling space across language and speakers. In this survey, we shed light on this iterative progression of multilingual text representation and discuss the driving factors that ultimately led to the current state-of-the-art. Subsequently, we discuss how the full potential of language democratization could be obtained, reaching beyond the known limits and what is the scope of improvement in that space. Index TermsâNLP, Multilinguality, Language Model I. INTRODUCTION Natural language processing (NLP) primarily involves mak- ing linguistic-specific applications for machines to understand language. Earlier days of NLP development mainly focused on the idea of distributional hypothesis [1] that is, "words occurring in the same context tend to have a similar meaning or closely related meaning". From there, NLP has come a long way in the modeling language. The tasks NLP tries to solve are complex and multidimensional if we put them into the perspective of the machine and numerical mapping. For example, there are multiple dynamics to deal with here, like various languages and dialects to consider and tasks to solve with different levels of
Chunk 1 ¡ 1,984 chars
g way in the modeling language. The tasks NLP tries to solve are complex and multidimensional if we put them into the perspective of the machine and numerical mapping. For example, there are multiple dynamics to deal with here, like various languages and dialects to consider and tasks to solve with different levels of granularity. Combining all these dynamics in a unified representation space is a complex problem. The base starting point could be just words, as the word is one such unit of language that is quite universal. Following this thought, the most straightforward idea could be to compute the word frequency, thus constructing a count- based numerical mapping that we can think of as a compu- tationally feasible language representation. N-gram language models are representations that have led the domain of NLP for a substantial time. Later on, advancing over the distributional scheme of language representation coupled with the availabil- ity of huge computation resources, neural models became the go-to approach for all kinds of NLP tasks. Researchers investigated different theories and directions in this domain of neural language modeling before transformer-based neural networks revolutionized NLP. The transformer-based model provides exceptional text representation utilizing the multi- layered encoder blocks [2]. This is useful as text gets different meanings based on how it is used in a context. In addition, multiple languages can share a single representation space using transformers. This effectively led to the path for multilin- gual text representation, where data collected from languages existing all over the world can be accumulated in a single setting, and models can learn and perform actionable inference on a wide variety of tasks comprising language and dialectal varieties. Though a monolingual or region-focused transformer still vastly outperforms a more generalized multilingual model on most tasks, it is not always feasible to train
Chunk 2 ¡ 1,985 chars
umulated in a single setting, and models can learn and perform actionable inference on a wide variety of tasks comprising language and dialectal varieties. Though a monolingual or region-focused transformer still vastly outperforms a more generalized multilingual model on most tasks, it is not always feasible to train multiple versions of the domain-focused model. The idea is to make a shared representation space that effectively works for many languages, while the resource scarcity of specific languages should benefit from other high-resource languages. mBERT [3] and xlm-r [4] trained with multiple objective functions on more than hundreds of languages came a long way in achieving this vision. However, the full potential of a unified multilingual text representation is still a significant research problem to solve. Because, quite regularly, the inclusion of new tasks and languages in the modeling paradigm points out the fact that, when these models move beyond the monolingual scheme, the total capacity of the model gets distributed across languages, thus often resulting in capacity dilution [5]. An ideal scenario would be to have no amount of negative interference, such that we get an equitable performance across languages. Another important direction for multilingual models is to ensure the easy-expand-ability to new languages and adaptability to new tasks. Keeping all these advanced development of multilingual text representation in context, in this survey, we provide insight into the open problems and questions to look for. In addition, we discuss how the text representation model starting from the count-based vector representation of words, came to the point of a multilingual text representation model capable of performing across more than 100 languages. We structure the contents based on the following contexts: (1) How did the text representation model make the iterative progress, and which were the driving factors in each step? (2) What are the
Chunk 3 ¡ 1,989 chars
words, came to the point of a multilingual text representation model capable of performing across more than 100 languages. We structure the contents based on the following contexts: (1) How did the text representation model make the iterative progress, and which were the driving factors in each step? (2) What are the primary building blocks of a unified multilingual text representation model, and how do they vary given the difference in scenarios? (3) What are the current barriers that limit the progression of full-scale multilingual text representation? (4) The fairness and interpretability of currently available models and how equitable they serve the intended userâs utility. -- 1 of 11 -- 2 II. PRELIMINARIES A. Terminologies a) Tokenization: : A process of transforming the text into tokens of words is generally known as tokenization [6]. Not necessarily; it needs to be in words, but it can be divided into words, symbols, phrases, or sub-word tokens. b) Word segmentation: Separating the phrase, content, and keywords. Other steps include stemming, lemmatizing, handling negation, and separating punctuation. c) Embedding: Embedding is a relatively low- dimensional space where we can place the transformed high-dimensional vectors. Word embedding generally means a type of embedding space where the close-meaning words would be grouped, maintaining a close distance. d) Attention: We can interpret this mechanism as the vector of important weights [7]. An attention vector provides insight into how words in a sentence might be correlated with other words in that sentence. In addition, it determines the most useful blocks of the input in terms of transferring the contextual insights to actionable information for the target output. The attention mechanism mitigates the long-lasting disadvantage of fixed-length vectors in sequence-to-sequence models: the incapability of remembering long sentences. B. Early days of Multilingual Word Representation Multilingual Word
Chunk 4 ¡ 1,999 chars
ansferring the contextual insights to actionable information for the target output. The attention mechanism mitigates the long-lasting disadvantage of fixed-length vectors in sequence-to-sequence models: the incapability of remembering long sentences. B. Early days of Multilingual Word Representation Multilingual Word representation involves representing words from different languages through a shared space. The bilingual word embedding model [8] is one of the earlier works that uses machine translation word alignment and embeds words from the source and target language into a single vector space. During training, it imposes constraints over the distance based on the translation pairs of two languages. Later on, [9] proposed an approach that can learn bilingual word embedding given the lexicon of bilingual word pairs. It performs a linear transformation to transform the source language vectors to the target language vector space. However, when provided a lexicon of small size, this model performs poorly. To tackle this issue, [10] introduced a matching score mechanism along with the original bilingual lexicon pairs to bring the embeddings closer. C. Neural Language Model Basics In neural language modeling (NLM), we train a probabilistic classifier to predict a probability distribution where the condi- tional probability of selecting word wi is learned using various kinds of neural networks (e.g., feed-forward, recurrent, etc.): P (s) = l â i=1 P (wi|wiâ1 1 ) a) Feed-forward NLM:: A feed-forward NLM (FFN) [11] adopts the notion of an n-gram language model by assuming each word in a sequence depends on the words closer to it statistically though it fails to consider the long-term dependency. Instead of considering the dependence of the whole previous sequence, a context window (i â n + 1 to i â 1) is used for better approximation: P (wi|W iâ1 1 = P (wi|W iâ1 iân+1). The context word sequence x = [wiân+1..., wiâ1] is fed into a FFN, and later, a softmax layer over the
Chunk 5 ¡ 1,998 chars
to consider the long-term dependency. Instead of considering the dependence of the whole previous sequence, a context window (i â n + 1 to i â 1) is used for better approximation: P (wi|W iâ1 1 = P (wi|W iâ1 iân+1). The context word sequence x = [wiân+1..., wiâ1] is fed into a FFN, and later, a softmax layer over the final output matrix is used to get the output probability of P (wi|W iâ1 iân ). b) Convolutional NLM:: This one enhances the ffn by injecting a CNN layer on top of the input representation [12], which involves a sliding window of the input vectors centered on each word vector and later on, performing max-pooling on it. c) Recurrent NLM: Recurrent Neural Network (RNN) based LM [13] addresses the issue of long-term dependency problem. At every time step of RNN, the input is just the previous word vector instead of the concatenated vectors of n previous words. Meanwhile, the information of all previous words is preserved by the internal state of RNN. The most common RNN types are LSTM [14] and GRU [15]. The key problem with RNN-based language models is falling into the vanishing gradient (taking the multiplication of a large number of derivatives eventually results in a value close to zero, which can further not be used in error function calculation). This failure leads to the problem of not capturing the dependencies among words in a long sentence as the amount of computation using RNN increases when the distance among words increase in a sentence. d) Transformer Language Models: This is a non- recurrent encoder-decoder architecture with a series of attention-based blocks. An encoder prepares a contextual representation of the given input, whereas the decoder can generate output based on the output segments that are already generated. The attention mechanism is at the core of the Trans- former architecture [2]. An encoder-decoder-based transformer contains three types of attention units facilitated with the help of queries, keys, and values. Key K: This
Chunk 6 ¡ 1,996 chars
whereas the decoder can generate output based on the output segments that are already generated. The attention mechanism is at the core of the Trans- former architecture [2]. An encoder-decoder-based transformer contains three types of attention units facilitated with the help of queries, keys, and values. Key K: This is a label of a word and is used to distinguish between different words. Query Q: It represents an active request that checks all the keys and selects the one that matches the request. Value V : A value is always paired with a key, and when the query selects a key, its value is the one that propagates. A value is an information a word contains. When k, Q, &V all get generated from the same source sequence, it is self-attention (exist one in the encoder, one in the decoder). When they come from different sequences, they form a cross-attention mechanism (happens in between encoder-decoder interactions). The other type is the scaled-dot product attention mechanism that performs dot product calcu- lation between Q and K matrices. Before that, the k, Q, &V are calculated by the matrix multiplication with learned word vectors. The scaled-dot product attention then ensures that we select more information from the values where the key and the query are more similar. On top of these mechanisms, we have multi-head attention where multiple scaled dot product attentions run in parallel, thus helping the network to attend to multiple pieces of information simultaneously. In addition, there is no recurrent element in transformers. So, to learn the position information of each word, a combination of sine and cosine waves of different frequencies is used, as each position would have a unique combination of values. There -- 2 of 11 -- 3 are other essential segments like residual connections and layer normalization. Residual or skip connection adds the input to the output after a layer, allowing the gradients to flow directly through the network. Whereas the layer
Chunk 7 ¡ 1,999 chars
sed, as each position would have a unique combination of values. There -- 2 of 11 -- 3 are other essential segments like residual connections and layer normalization. Residual or skip connection adds the input to the output after a layer, allowing the gradients to flow directly through the network. Whereas the layer normalization keeps the mean of each training sample close to 0 and the standard deviation close to 1, thus stabilizing the training as well as reducing the training time. In practice, a layer from the encoder contains multi-head at- tention sub-layers and a position-wise feed-forward sub-layer. The sub-layers are connected using the residual connections. The multi-head attention sub-layer contains several attention heads. A head is a scaled dot product attention structure taking the query matrix Q, the key matrix k, and the value matrix V . So the output, Attention(Q, K, V ) = Sof tmax( QKT âdk )V M ultilhead(Q, K, V ) = [head1, head2, ..headh]W 0 where, headi = Attention(QW Q i , KW K i , V W V i ) and W Q i = linear projections, and dk = is the dimension of the query matrix. The fully connected position-wise feed-forward sub-layers contain two linear transformations with RELU: F F N (x) = W2max(0, W1x + b1) + b2 Transformers are better for dealing with long-term dependency than RNNs. Originally it was proposed to solve the problem of machine translation, but later on, it became the backbone of all kinds of NLP applications. D. Multilingual Language Modeling A general Multilingual Language Model (MLLM) archi- tecture contains an input layer, transformer layer, and output layer. ⢠input layer: sequence of tokens from a representation of one-hot subword token vocabulary (concatenated from all languages). ⢠transformer layer: each layer contains k attention heads, followed by a feed-forward neural network. ⢠output layer: contextual representation for each token. It contains a simple linear transformation followed by a Softmax that takes the last layer
Chunk 8 ¡ 1,999 chars
one-hot subword token vocabulary (concatenated from all languages). ⢠transformer layer: each layer contains k attention heads, followed by a feed-forward neural network. ⢠output layer: contextual representation for each token. It contains a simple linear transformation followed by a Softmax that takes the last layer token representation and produces the probability distribution. III. AREA TAXONOMY The multilingual text representation area taxonomy is pre- sented in Figure 1. Here we divide the taxonomy into certain primary parts, thus representing the building blocks of text representation from different perspectives. For example, Repre- sentation Type defines ways of representing text starting from the classical models to recent Transformer breakthroughs. Fur- thermore, we can divide a standard NLP training paradigm into parts like pre-training and fine-tuning. We believe tokenization, being a step from text preprocessing, deserves a separate branch of discussion because of its impact on multilingual settings. In the next section, we report a detailed discussion of the existing research in every direction of this taxonomy. IV. TAXONOMY-BASED SURVEY A. Representation Type The simplest form of text representation would be the word- count-based one that fully relies on statistical information. From there, text representation has advanced to a label, where we can represent texts from different languages in a shared space. The progression is described in detail here. a) Classical Models: Word frequency is the basis of the classical text representation model used in earlier days. We can divide these models into two parts: (1) Categorical and (2) Weighted. One-hot-encoding and Bag-of-Words (BoW) are the categorical models. In one-hot encoding, the dimension to represent texts is equal to the terms present in the vocabulary, where binary values are used to define the presence or absence of a particular term. Whereas, BoW is just an updated one- hot-encoding where we sum
Chunk 9 ¡ 1,995 chars
One-hot-encoding and Bag-of-Words (BoW) are the categorical models. In one-hot encoding, the dimension to represent texts is equal to the terms present in the vocabulary, where binary values are used to define the presence or absence of a particular term. Whereas, BoW is just an updated one- hot-encoding where we sum all the one-hot-representation of words in a sentence. However, these categorical models fail to capture semantic relations and the order of the words. A weighted text representation model known as Tf-Idf was introduced to solve this problem. A Tf (i.e., term-frequency) matrix just divides the word count by the length of a document, thus identifying how often a word occurs in a document. Whereas, Idf (i.e., inverse document frequency) matrix tries to reduce the effect of common words by putting more weights on the critical words (words that are not equally frequent in all documents, like stop words). b) Continuous Representation: One popular approach to represent text is to present it as vectors where each dimension corresponds to the frequency of words, thus resulting in a word vector. The pitfall is that the vector length might be substantial depending on the vocabulary size. In that case, adopting a dimensionality reduction procedure becomes a de- fault choice. Though these reduced vectors might be compact and efficient to compute, they contain less of the original information. Moreover, the individual dimensions no longer preserve the interpretable features that could be mapped back to the original textual building blocks. In one way, the context gets distributed throughout the vector length, thus making it a distributed representation of continuous values. In this dis- tributed/continuous representation, each dimension of a word type vector becomes a parameter to be learned and optimized based on the observable patterns in the data. We can see these parameters as continuous values that can be learned using a continuous objective function using
Chunk 10 ¡ 1,997 chars
tion of continuous values. In this dis- tributed/continuous representation, each dimension of a word type vector becomes a parameter to be learned and optimized based on the observable patterns in the data. We can see these parameters as continuous values that can be learned using a continuous objective function using iterative algorithms like gradient descent. Word2vec [35] is one such distributed vector representation of text. Word2vec considers similar meaning words like "small" and "smaller" comparatively closer in the vector space. There are two types of word2vec algorithms in practice: (1) Continuous bag of words (CBOW) (2) Skip- gram. In CBOW, context is considered the input. The neural network tries to correlate the weight matrix with each word, thus improving the representation of words through backprop- agation of the error gradient. In skip-gram, the context is estimated based on the given the word. However, both CBOW and Skip-gram were very time-consuming in practice. To solve this issue, Hierarchical softmax and negative sampling approaches were introduced. Negative sampling restricts the -- 3 of 11 -- 4 Taxonomy Evaluation xtreme [16] Tokenization Subword Segmentation wordpiece [17] bpe [18] unigram [19] sentencepiece [20] Character- Based [21], [22], [23] Tasks Semantic multi-nli [24] xnli [25] Syntactic pos [26] dep [26] ner [27] Data and Language Pretraining Data Parallel [4] Monolingual [3], [28] Language LM Objectives Monolingual mlm [3] clm [28] nsp [3] Parallel tlm [29] Training Fine-tuning Multi-Task [30], [28] Parameter- Effieient adapters [31] Gradient-Based bert [3] Pre-training Encoder-Decoder T5 [30] Decoder Only gpt [28] bloom [32] mgpt [33] Encoder Only bert [3] xlm-r [4] Representation Type Contextual elmo [34] mbert [3] gpt [28] Continuous word2vec [35] fasttext [36] glove [37] Classical one-hot tf-idf [38] Fig. 1. Area Taxonomy of Multilingual Text Representation output sum so that only a subset of the vectors get updated in each
Chunk 11 ¡ 1,996 chars
bloom [32] mgpt [33] Encoder Only bert [3] xlm-r [4] Representation Type Contextual elmo [34] mbert [3] gpt [28] Continuous word2vec [35] fasttext [36] glove [37] Classical one-hot tf-idf [38] Fig. 1. Area Taxonomy of Multilingual Text Representation output sum so that only a subset of the vectors get updated in each step, whereas, in hierarchical softmax, words are chosen based on their count-based conditional probabilities. Later on, GLOVE [37] improves significantly on word2vec by using the global contextual information by constructing the global co-occurrence matrix and factorizing it later on. However, these distributed representation methods failed to consider the out-of-vocabulary (OOV) words. This is when Fasttext [36] was proposed. Fasttext breaks the words to n-gram instead of using the full-word representation at once, thus solving the OOV issue. These models can extract syntactic and semantic information while dealing with specific corner cases. However, there is this existing issue of keeping the full context-specific representation of a document because understanding the actual context is required for most downstream tasks in NLP. c) Contextual Word Vectors: The weakness of the con- tinuous word representation model was to failing to capture the global context information on a low-dimensional scale. To solve this issue, contextual word representation models like Context2vec [39], Cove [40] and ELMO [41] were proposed later on. These advanced models solved many of the existing issues but still face catastrophic forgetting. Context2vec model is based on Word2vecâs CBOW model but replaces its average word representation within a fixed window with a better and -- 4 of 11 -- 5 more powerful Bi-directional LSTM neural network. Whereas Cove uses machine translation instead of the approach used in Word2vec (skip-gram or CBOW) or GloVe (Matrix fac- torization). They pre-train a two-layer BiLSTM for attending sequence-to-sequence translation, starting from
Chunk 12 ¡ 1,996 chars
dow with a better and -- 4 of 11 -- 5 more powerful Bi-directional LSTM neural network. Whereas Cove uses machine translation instead of the approach used in Word2vec (skip-gram or CBOW) or GloVe (Matrix fac- torization). They pre-train a two-layer BiLSTM for attending sequence-to-sequence translation, starting from GloVe word vectors. Then they took the output of the sequence encoder and called it a CoVe, followed by combining it with GloVe vectors, and used it in a downstream task-specific mode using transfer learning. On top of these models, ELMO (embeddings from language model) [34] was the one to successfully advance the contextualization of word vectors. The key idea was: while a word token will have its vector, this vector needs to depend on the nearby word contexts. This is similar to the distributional hypothesis [1]. However, unlike word-type vectors, these word token vectors combine the word-level vectors with neural network parameters going beyond the lookup table of word- type vectors. ELMO contains two neural networks: One for the left context (start of the sentence to the token) and another for the right context (from the token to the end of the sentence). It uses a recurrent neural network that was the most advanced neural network at that time and was a novel thing to introduce in language modelling. Later on, ULMFiT [42] successfully adopt the concept of transfer-learning in the contextual text representation. It segments the end-to-end process into three steps then: (i) General LM pre-training, (ii) Target task LM fine-tuning, and (iii) Target task classifier fine-tuning. Then finally, the transformer-based fine-tuning appeared in the pic- ture which proved to be more efficient and faster than LSTM or CNN for language modelling. d) Transformer Based Language Models: Transformer- based language models are more efficient than the ones with LSTM or CNN for language modeling. The first model to efficiently use the transformer architecture was GPT
Chunk 13 ¡ 1,987 chars
n the pic- ture which proved to be more efficient and faster than LSTM or CNN for language modelling. d) Transformer Based Language Models: Transformer- based language models are more efficient than the ones with LSTM or CNN for language modeling. The first model to efficiently use the transformer architecture was GPT [28] which is a decoder-based model. GPT was trained with Causal Language Modeling. The next one to efficiently represent the semantics and context was BERT [3]: Bidirectional Encoder Representations from Transformers. BERT uses a parallel attention layer instead of using sequential recurrent layers. It is an encoder model that uses masked language modelling (MLM) and next-sentence prediction (nsp) objectives in a self- supervised manner for being trained on a large pool of textual data. Here, the next sentence prediction is used to collect long- term or pragmatic information. B. Training We can largely divide the current practice of transformer- based model training into pre-training and fine-tuning. Pre- training is the training part of preparing a base language model that can further be transferred and adapted to any downstream task through fine-tuning. 1) Pre-training: Pre-training is a common step in making large language models in the current NLP paradigm. Pre- training means training a language model on extensive textual data in a self-supervised manner. Self-supervised because the pre-training objective looks for the data labels to predict, which are automatically contracted from the data itself (e.g., masked language modeling, next sentence prediction). Here we discuss some common types of pretrained transformer- based models and their specific approaches to performing the training. a) Decoder only: GPT [28] is the most common exam- ple of the decoder-only model. It uses the causal language modeling objective. Other advanced decoder only models are OPT [43], BLOOM [32], mGPT [33]. Recently, these decoder-only models are also going
Chunk 14 ¡ 1,995 chars
d models and their specific approaches to performing the training. a) Decoder only: GPT [28] is the most common exam- ple of the decoder-only model. It uses the causal language modeling objective. Other advanced decoder only models are OPT [43], BLOOM [32], mGPT [33]. Recently, these decoder-only models are also going multilingual (e.g., mGPT, BLOOM) and increasing the span of their parameters. b) Encoder Only: The most common example of encoder only models are bert [3]. It uses masked language modeling (MLM) and next sentence prediction (NSP) objectives to train on large monolingual texts. Several interesting works investigate improving the self-supervised pre-training objective of encoder-only models. For example, in [44], the authors find out that using dynamic masking during training(randomly masking 15% of token each step) time instead of static masking improves the performance. In addition, they show that NSP can be dropped completely from the training objective to get better performance. c) Encoder-Decoder: T5 [30] is a encoder-decoder model. The objective of T5 is closely related to the MLM and word dropout techniques. The difference with the original MLM is the consecutive span of corrupted or dropped tokens is replaced with one single sentinel token. The output sequence contains these dropped-out spans delimited by the sentinel tokens that replace the original text. Using this technique of denoising sequence-to-sequence objective, the decoder can predict the span of tokens in the masked position instead of just a single token. 2) Fine-tuning: Fine-tuning means adopting a pre-trained model for any downstream task. Previously, fine-tuning was only used to indicate the gradient-based training of the com- plete model. Here, we will consider any updates that make a base model further usable for any downstream task as fine- tuning. a) Gradient-based: The gradient-based finetuning means producing a whole set of new parameters by training the entire model and
Chunk 15 ¡ 1,998 chars
as only used to indicate the gradient-based training of the com- plete model. Here, we will consider any updates that make a base model further usable for any downstream task as fine- tuning. a) Gradient-based: The gradient-based finetuning means producing a whole set of new parameters by training the entire model and updating all parts for a downstream task. In the early days of transformer and monolingual language modeling, this was the go-to approach to try any model on a new task or language [3]. b) Parameter efficient: The gradient-based finetuning is a massive bottleneck if we consider the number of downstream tasks. Then there is parameter-efficient finetuning named as adapters which include updating only parts of model pa- rameters [31]. Adapters proved to help perform effective cross-lingual transfer while reducing the amount of negative interference [45]. c) Multitask and zero-shot learning: Multitask learning includes training the model on a wide variety of tasks simulta- neously instead of just one task [30]. Following this scheme, there would be less necessity for further finetuning. However, till now, the utilization of this scheme is still confined to only computationally intensive models and not in a general-purpose setting. -- 5 of 11 -- 6 C. Language Modeling Objectives We can think of the learning objectives of language mod- eling in a data-driven manner. One is monolingual objectives which generally work on monolingual data trying to figure out the representation of a missing token. Whereas in bilingual objectives, data comes from a parallel corpus. The aim is to represent similar-meaning words as close as possible. a) Monolingual Objectives: Masked language model- ing(MLM) is monolingual but surprisingly helps learn mul- tilingual models. Consider, a sentence X = (x1, x2, ..xs). In masked language modeling, token xi is replaced with ÂŻxi. So the input to the model becomes X = (x1, ... ÂŻ xi, ..xs). Now the prediction task ins Y = (xi) where xi
Chunk 16 ¡ 1,988 chars
l Objectives: Masked language model- ing(MLM) is monolingual but surprisingly helps learn mul- tilingual models. Consider, a sentence X = (x1, x2, ..xs). In masked language modeling, token xi is replaced with ÂŻxi. So the input to the model becomes X = (x1, ... ÂŻ xi, ..xs). Now the prediction task ins Y = (xi) where xi can be predicted from the final representation of ÂŻ xi [3]. Previously it was thought that MLM works well because of its ability to discover syntactic and semantic mechanisms in the pre-training stage. However, recent findings suggest MLM learns the higher order distributional statistics, thus making it a very useful prior for further fine-tuning [46]. Another common one is Causal Language Modeling (CLM). CLM is the traditional learning objective in a language model that includes estimating the probability of a word given the previous sequence of words in a sentence, thus P (xt|x1, x2.., xtâ1, θ). In other words, a model can be trained with inputs X = (x1, , , xtâ1) and outputs Y = (xt). Here xt is the output label which can be predicted from the final layer representation of token xtâ1. b) Parallel-corpora based Objectives: This is mainly a bilingual objective, where the model learns to reduce the distance of similar meaning text given a parallel corpus. As the amount of parallel corpus is much less than monolingual one, a joint objective utilizing both monolingual and parallel corpus is used. Translation language modeling (TLM) [47] is the most common one, where xA 1 , xA 2 , ...xA n is a sentence in language A and xB 1 , xB 2 , ...xB m is a sentence in language B. Now the MLM masks tokens from both sentences A and B. It tries to predict the missing token by utilizing either the surrounding tokens in A or the translation in B. Another useful objective is to use the contextual representation from the base language model given word alignment [48], [49], [50] between translations and ask the model explicitly to reduce the distance of similar
Chunk 17 ¡ 1,998 chars
s to predict the missing token by utilizing either the surrounding tokens in A or the translation in B. Another useful objective is to use the contextual representation from the base language model given word alignment [48], [49], [50] between translations and ask the model explicitly to reduce the distance of similar meaning word representation [51]. D. Data and Language a) Pre-training Data: Multilingual Language Models explore two data types primarily for pre-training: (1) Large Monolingual Data and (2) Parallel Corpus. Bert [3] uses mono- lingual data from Wikipedia, whereas XLM-R [4] uses a much larger common-crawl corpus to train the model. Language family or region-specific models [52] have their focused source of data to explore. b) Languages: BERT [3] and XLM-R [4] are two multi- lingual models having trained in more than 100 languages. One problem is that the data availability is not evenly distributed across languages. Usually, models use exponential smoothing to make the data ratio fair across high-resource and low- resource languages. The main idea is if a language i contains m% of total pre-training data, then the probability of that language is pi = k/100 where pi is exponentiated by a factor Îą. Then the resulting values are normalized to give a probability value to all the languages. Îą < 1 means the high-resourced ones will be under-sampled, whereas the low- resourced ones will be over-sampled. E. Tasks The range of mainstream NLP tasks requires models to perform transfer learning at different difficulty levels, thus requiring word, phrase, or sentence-level understanding. Es- sentially, we can frame the tasks into two main groups: a) Syntactic Task: These tasks focus on the sentence- level or word-level structure of the languages. The most common example of this type of task includes dependency parsing (DEP) [53], named entity recognition (NER) [27] and parts of speech tagging (POS) [26]. Universal dependency project [26] contains a dataset of DEP
Chunk 18 ¡ 1,999 chars
Syntactic Task: These tasks focus on the sentence- level or word-level structure of the languages. The most common example of this type of task includes dependency parsing (DEP) [53], named entity recognition (NER) [27] and parts of speech tagging (POS) [26]. Universal dependency project [26] contains a dataset of DEP and POS tasks covering more than 100 languages as part of the evaluation. b) Semantic Task: Some tasks require models to perform language-level understanding or inference that can not be answered by just following the sentence structure. Natural language inference [24], [25] that tries to predict whether a premise sentence entails, contradicts, or is neutral toward a hypothesis sentence is one such task. F. Tokenization Tokenization is splitting a sequence of characters given a document unit into a piece of singular units (e.g., tokens) based on some level of heuristics. The simplest form of heuristics would be to cut into words based on white space, thus preserving the meaning at the unit level. However, this comes with issues like phrase-level segmentation and differ- ent characterization of the similar-meaning word because of minor spelling variations. Moreover, words form differently in different languages. For example, in french, the use of apos- trophe sometimes works like the mention of a definite article (lâensemble). In German, compound words are written with- out white-space (computational modeling â Computermodel- lierung). Thus, researchers focus on sub-word and character- based splitting instead of white-space tokenization to perform a more usable segmentation. Tokenization does the text-to- numerical mapping of input, making it one of the primary steps before feeding a text distribution to a computation model. As languages can be of different scripts with different vocabulary, multilingual models accumulate all happening subwords as part of the vocabulary for the supporting languages. This becomes a limiting factor when we want to extend
Chunk 19 ¡ 1,994 chars
it one of the primary steps before feeding a text distribution to a computation model. As languages can be of different scripts with different vocabulary, multilingual models accumulate all happening subwords as part of the vocabulary for the supporting languages. This becomes a limiting factor when we want to extend the model capacity to a thousand more languages. Training monolingual tokenizers each time is not a viable option, and the original multilingual tokenizer does not have vocabulary distributional knowledge about the unseen languages. a) Subword Segmentation: Subword-based tokenizations are the most widely used tokenization for multilingual transformer-based models. Because simple whitespace-based tokenization suffers from the dimensionality bottleneck prob- lem, simple character-based splitting results in losing all -- 6 of 11 -- 7 context signals. The main difference in common sub-word- based tokenizers lies in the choice of character pairs to merge. For example, BPE [18] makes a frequency-based merging, whereas the Unigram [19] model uses a probability based merging (computing the likelihood of each subword instead of using the most frequent ones). Word-Piece [17] tries to utilize the advantage of both unigram and BPE. It merges based on likelihood instead of frequency, but the choice of words to join is based on frequency. SentencePiece [20] is another subword- based tokenizer that rules out the initial whitespace-based splitting (useful for languages like Chinese and Japanese) by considering space as just another character and then employing BPE or Ingram on top of that. In the multilingual scenario, languages can be of different scripts with different vocabulary, and multilingual models accumulate all happening subwords as part of the vocabulary from the supporting languages. Now subword-based tokenizer with a fixed vocabulary size performs unfairly for low-resource languages due to the data imbal- ance among languages resulting in excessive
Chunk 20 ¡ 1,991 chars
scripts with different vocabulary, and multilingual models accumulate all happening subwords as part of the vocabulary from the supporting languages. Now subword-based tokenizer with a fixed vocabulary size performs unfairly for low-resource languages due to the data imbal- ance among languages resulting in excessive fragmentation of subwords [54]. In addition, the fixed vocabulary becomes a limiting factor when we want to extend the model capacity to a thousand more languages. b) Character-based Model: Subword tokenizers elimi- nate the out-of-vocabulary problem to a large extent, though the reliance on static vocabulary prevents end-to-end learning across all languages. One alternative would be to use a character-based approach [21], [22]. Though this is more adaptable to the code-switched language and noisy text, they may not capture the token-level representation and convert the longer sequence to character-level representation [22], thus increasing the task complexity [23]. G. Evaluation Language models like mBERT [3] and XLM-R [4] helped in a way to shift the overall focus towards a unified gen- eral multilingual framework of language modeling. With this advancement, researchers identified the need for a unified evaluation framework that can provide insight into the overall transfer-learning capability of a multilingual model. With this aim, in [16], the authors designed an evaluation setting that ranges from syntactic to semantic tasks and structural predic- tion tasks. In each task evaluation, the training is performed in English, whereas the evaluation comprises 40 languages from 12 families, thus ensuring the typological diversity in language selection. Later, this practice of expanding over far- reach languages continued through subsequent studies [55]. However, these unified evaluation frameworks still do not en- tirely comprise languages from highly low-resource languages and dialects. V. OPEN PROBLEMS a) Tokenization free approach: In short, a
Chunk 21 ¡ 1,999 chars
language selection. Later, this practice of expanding over far- reach languages continued through subsequent studies [55]. However, these unified evaluation frameworks still do not en- tirely comprise languages from highly low-resource languages and dialects. V. OPEN PROBLEMS a) Tokenization free approach: In short, a silver-lined multilingual tokenizer is yet to be found as a monolingual subword tokenizer still outperforms a multilingual tokenizer in almost every task [56], [57]. Now monolingual Training tokenizer every time is not a feasible option and the original multilingual tokenizer does not have vocabulary distributional knowledge about the unseen languages. One option could be the effort to rule out the tokenization step completely. Several recent studies are exploring this direction. Models like CAINNE [23] are tokenization-free modules that en- code texts differently. Instead of using a vocabulary and tokenization step, CANINE encodes texts at the character level, produces character-level output, and uses a soft-max layer- based smoothing for subword projection. Recently another model PIXEL [58], treats text as an image and by doing that, bypasses the text encoding step. These methods are getting inspiring results but the full potentiality of this direction is yet to unfold. For example, pixel receives impressive results for syntactic tasks, but this is not true for semantic tasks. Further studies need to be done on the applicability of tokenization- free models in the case of zero-shot and few-shot transfers. b) Extending to new languages: In the current paradigm of MLLM, the curse-of-multilinguality [45] is an issue where the per-language performance drops unequally when the model is trained on multiple languages. In a general setting, the usual scenario is: (1) A monolingual model performs better than a multilingual model on specific tasks and language (2) The low-resourced ones perform poorly compared to the high- resourced ones in MLLM (3) Languages
Chunk 22 ¡ 1,995 chars
performance drops unequally when the model is trained on multiple languages. In a general setting, the usual scenario is: (1) A monolingual model performs better than a multilingual model on specific tasks and language (2) The low-resourced ones perform poorly compared to the high- resourced ones in MLLM (3) Languages unseen during the pre-training stage perform the worst. To tackle this issue, the straightforward approach would be to finetune the model to the specific target language. However, then, the model becomes specific to only one language though it does not increase the catastrophic forgetting in general. The authors in [59] have identified then, while dealing with unseen language and scripts, having pretrained on any closely related language usually helps. Another approach to improve the capacity of MLLM would be to augment the vocabulary with new tokens, which works better for unseen languages and improve performance for the languages already seen during pre-training [60]. Another simple approach includes mitigating the cross- lingual transfer gap by just performing training on a bit of target data amount[61]. Another helpful approach to improve the model capacity is to use adapters, modular parameter units that can be injected in every layer of a base language model. These adapter layers can then be trained on language or task- specific data, thus enabling the cross-lingual transfer without changing the base model parameters[31]. This approach can be further used for unseen language and optimized zero-shot transfer by using these adapters at the pre-training stage, which lifts the curse-of-multilingualism to a greater extent[62]. c) Model Interpretability and Shared Representation: It is hypothesized that most of the large neural models are over- parameterized [63], thus, resulting in unnecessary computation and storage costs. Effective model pruning could be a feasible approach to try in this regard. [64] is one of the earlier studies on pretrained
Chunk 23 ¡ 1,998 chars
el Interpretability and Shared Representation: It is hypothesized that most of the large neural models are over- parameterized [63], thus, resulting in unnecessary computation and storage costs. Effective model pruning could be a feasible approach to try in this regard. [64] is one of the earlier studies on pretrained bert with model pruning. The results demonstrate that the idea of lottery ticket observations [63] (i.e., if we select multiple small networks, we will end up getting the similar performance of the larger model at one time) remain relevant in this context of language modeling also. We find matching subnetworks for a range of downstream tasks at 40% to 90% sparsity. Subsequently, [65] has done LTH based empirical study on English bert. The findings suggest -- 7 of 11 -- 8 ârandomâ subnetworks are still almost as good as the âgoodâ ones, and even the âworstâ(sampled elements which didnât survive the pruning by LTH [63]) ones perform on par with a substantial baseline. [66] is one of the studies which explore LTH-based pruning in multilingual scenarios. The authors use the lottery ticket hypothesis in mBERT and conclude that the sub-networks found for different languages are similar. In addition, mBERT comprises a language-neutral sub-network shared among many languages, which is the most useful one while performing cross-lingual transfer. Another finding is for MLM tasks, the similar language & task specific sub-network are primarily identical in the lower to the middle layer. In con- trast, network similarity is visible mainly in the higher layer for NER & XNLI tasks. The prospect of different pruning-based methods is further explored in [67]. The authors study two types of model pruning strategies: regularization-based and gradient-based pruning. They also propose a method called Dynamic Sparsification to allow training the model once and then adapting to different model sizes at inference. They also used a diversity loss to prune language-specific
Chunk 24 ¡ 1,994 chars
n [67]. The authors study two types of model pruning strategies: regularization-based and gradient-based pruning. They also propose a method called Dynamic Sparsification to allow training the model once and then adapting to different model sizes at inference. They also used a diversity loss to prune language-specific subnetworks. Results show that subnetworks of different languages are indeed different. The most straightforward pruning algorithm performs the best, and A fast model does not mean it should be small. Another finding is for large multilingual models like XLM-R sharing the subnetwork for a universal representation is preferable; the language-neutral part contributes the most in cross-lingual transfer, which also aligns with the finding of [66]. [68] did one study on the Effect of Dropping Layers of Pre-trained Transformer Models where they pruned BERT, Roberta, and XLNet models up to 40% while maintaining up to 98% of their actual performance. The findings are: (i) the lower layers are most critical to maintaining downstream task performance, (ii) some tasks, such as paraphrase detection and sentence similarity, are more robust to the dropping of layers, and (iii) models trained using a different objective function exhibit different learning patterns and w.r.t the layer dropping. Another question is whether MLLM learns universal pat- terns or not. Learning universal patterns is essential for effec- tive cross-lingual transfer. From the discussion above, it can be inferred that MLLM contains particular language-specific and shared representation space. The shared representation space helps in probing-based tasks like POS tagging. However, performing complex tasks like MT is still beyond scope just by using this shared representation space. Analysis of traditional classifier-based probing methods is another heavily investigated direction that is still being explored to ensure model interpretability. For example, [69] ex- amines whether good networks
Chunk 25 ¡ 1,995 chars
er, performing complex tasks like MT is still beyond scope just by using this shared representation space. Analysis of traditional classifier-based probing methods is another heavily investigated direction that is still being explored to ensure model interpretability. For example, [69] ex- amines whether good networks include any superior linguistic knowledge or not. It results in not finding any interpretable patterns. [70] investigated Language Relationships in Multi- lingual Sentence Encoders Through the Lens of Linguistic Typology; This study looks into how languages are placed in multilingual subspace in mBERT and XLM-R. At the same time, how language specific sub-spaces within multilingual sentence encoders (LASER [71], m-BERT [3], XLM [72], and XLM-R [4]) concerning a range of typological properties pertaining to the lexical, morphological, and syntactic structure can be separated. Their results show interesting differences in encoding linguistic variation associated with different pre- training strategies. [69] is another recent study on linguistic probing using universal dependency tasks and data points. d) Fairness in Language Models: A substantial amount of work has investigated existing social bias (e.g., gender, racial, ethnic, occupational) identification and mitigation ap- proach in PLMs, including reducing token sensitivity dur- ing text generation [73], investigating model sensitivity [74], prompting using natural sentences [75] and probing via embed- ding lookup [76]. However, the state-of-the-art NLP models and datasets are still biased toward certain attributes [77], and the overall utility provided by the models is still skewed [78]. VI. CONCLUSION This survey provides insights into the current state of the art of multilingual text representation and the extent of language modeling. While many languages are already being covered under the current paradigm of language modeling, the full potential is not explored yet. There are languages left
Chunk 26 ¡ 1,990 chars
[78]. VI. CONCLUSION This survey provides insights into the current state of the art of multilingual text representation and the extent of language modeling. While many languages are already being covered under the current paradigm of language modeling, the full potential is not explored yet. There are languages left to cover. At the same time, the currently covered low-resourced ones are still not free from the impact of capacity dilution. Keeping this in focus, we did a review on the potential scope of improvement and research directions. REFERENCES [1] Z. Harris, âDistributional structure,â Word, vol. 10, no. 2-3, pp. 146â162, 1954. [Online]. Available: https://link.springer.com/chapter/10.1007/978-94-009-8467-7_1 [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, âAttention is all you need,â 2017. [Online]. Available: https://arxiv.org/abs/1706.03762 [3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, âBERT: Pre-training of deep bidirectional transformers for language understanding,â in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171â4186. [Online]. Available: https://www.aclweb.org/anthology/N19-1423 [4] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. GuzmĂĄn, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, âUnsupervised cross-lingual representation learning at scale,â in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 8440â8451. [Online]. Available: https://www.aclweb.org/anthology/2020.acl-main.747 [5] Z. Wang, Z. C. Lipton, and Y. Tsvetkov, âOn negative interference in multilingual models: Findings and a meta-learning treatment,â in Proceedings of the 2020
Chunk 27 ¡ 1,991 chars
uistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 8440â8451. [Online]. Available: https://www.aclweb.org/anthology/2020.acl-main.747 [5] Z. Wang, Z. C. Lipton, and Y. Tsvetkov, âOn negative interference in multilingual models: Findings and a meta-learning treatment,â in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 4438â4450. [Online]. Available: https://aclanthology.org/2020.emnlp-main.359 [6] J. A. Balazs and J. D. VelĂĄsquez, âOpinion min- ing and information fusion: A survey,â Information Fu- sion, vol. 27, pp. 95â110, 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253515000536 [7] D. Bahdanau, K. Cho, and Y. Bengio, âNeural machine translation by jointly learning to align and translate,â 2014. [Online]. Available: https://arxiv.org/abs/1409.0473 [8] W. Y. Zou, R. Socher, D. Cer, and C. D. Manning, âBilingual word embeddings for phrase-based machine translation,â in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA: Association for Computational Linguistics, Oct. 2013, pp. 1393â1398. [Online]. Available: https://aclanthology.org/D13-1141 [9] T. Mikolov, Q. V. Le, and I. Sutskever, âExploiting similarities among languages for machine translation,â 2013. [Online]. Available: https://arxiv.org/abs/1309.4168 -- 8 of 11 -- 9 [10] M. Zhang, H. Peng, Y. Liu, H. Luan, and M. Sun, âBilingual lexicon induction from non-parallel data with minimal supervision,â in Proceed- ings of the Thirty-First AAAI Conference on Artificial Intelligence, ser. AAAIâ17. AAAI Press, 2017, p. 3379â3385. [11] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, âA neural probabilis- tic language model,â J. Mach. Learn. Res., vol. 3, no. null, p. 1137â1155, mar 2003. [12] N.-Q. Pham, G. Kruszewski, and G. Boleda, âConvolutional neural network language
Chunk 28 ¡ 1,993 chars
Artificial Intelligence, ser. AAAIâ17. AAAI Press, 2017, p. 3379â3385. [11] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, âA neural probabilis- tic language model,â J. Mach. Learn. Res., vol. 3, no. null, p. 1137â1155, mar 2003. [12] N.-Q. Pham, G. Kruszewski, and G. Boleda, âConvolutional neural network language models,â in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 1153â1162. [Online]. Available: https://aclanthology.org/D16-1123 [13] T. Mikolov, M. KarafiĂĄt, L. Burget, J. CernockĂ˝, and S. Khudanpur, âRecurrent neural network based language model.â in INTERSPEECH, T. Kobayashi, K. Hirose, and S. Nakamura, Eds. ISCA, 2010, pp. 1045â1048. [Online]. Available: http://dblp.uni-trier.de/db/conf/interspeech/interspeech2010.html#MikolovKBCK10 [14] S. Hochreiter and J. Schmidhuber, âLong short-term memory,â Neural Comput., vol. 9, no. 8, p. 1735â1780, nov 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.8.1735 [15] K. Cho, B. van MerriĂŤnboer, D. Bahdanau, and Y. Bengio, âOn the properties of neural machine translation: Encoderâdecoder approaches,â in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103â111. [Online]. Available: https://aclanthology.org/W14-4012 [16] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson, âXtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization,â CoRR, vol. abs/2003.11080, 2020. [17] M. Schuster and K. Nakajima, âJapanese and korean voice search,â in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 5149â5152. [18] R. Sennrich, B. Haddow, and A. Birch, âNeural machine translation of rare words with subword units,â in Proceedings of the 54th Annual Meeting of the Association for
Chunk 29 ¡ 1,953 chars
âJapanese and korean voice search,â in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 5149â5152. [18] R. Sennrich, B. Haddow, and A. Birch, âNeural machine translation of rare words with subword units,â in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 1715â1725. [Online]. Available: https://aclanthology.org/P16-1162 [19] T. Kudo, âSubword regularization: Improving neural network translation models with multiple subword candidates,â 2018. [Online]. Available: https://arxiv.org/abs/1804.10959 [20] T. Kudo and J. Richardson, âSentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,â in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, Belgium: Association for Computational Linguistics, Nov. 2018, pp. 66â71. [Online]. Available: https://www.aclweb.org/anthology/D18-2012 [21] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, âLearning deep structured semantic models for web search using clickthrough data.â ACM International Conference on Information and Knowledge Management (CIKM), October 2013. [Online]. Available: https://www.microsoft.com/en-us/research/publication/learning-deep-structured-semantic-models-for-web-search-using-clickthrough-data/ [22] K. Hwang and W. Sung, âCharacter-level language modeling with hierar- chical recurrent neural networks,â 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5720â5724, 2017. [23] J. H. Clark, D. Garrette, I. Turc, and J. Wieting, âCanine: Pre-training an efficient tokenization-free encoder for language representation,â Transactions of the Association for Computational Linguistics, vol. 10, pp. 73â91, 2022. [Online].
Chunk 30 ¡ 1,990 chars
on Acoustics, Speech and Signal Processing (ICASSP), pp. 5720â5724, 2017. [23] J. H. Clark, D. Garrette, I. Turc, and J. Wieting, âCanine: Pre-training an efficient tokenization-free encoder for language representation,â Transactions of the Association for Computational Linguistics, vol. 10, pp. 73â91, 2022. [Online]. Available: https://aclanthology.org/2022.tacl-1.5 [24] A. Williams, N. Nangia, and S. Bowman, âA broad-coverage challenge corpus for sentence understanding through inference,â in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, 2018, pp. 1112â1122. [Online]. Available: http://aclweb.org/anthology/N18-1101 [25] A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V. Stoyanov, âXNLI: Evaluating cross-lingual sentence representations,â in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 2475â2485. [Online]. Available: https://aclanthology.org/D18-1269 [26] D. Zeman, J. Nivre, M. Abrams, E. Ackermann, N. Aepli, H. Aghaei, Ĺ˝. Agi´c, A. Ahmadi, L. Ahrenberg, Ajede, and et al., âUniversal dependencies 2.9,â 2021, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ĂFAL), Faculty of Mathematics and Physics, Charles University. [Online]. Available: http://hdl.handle.net/11234/1-4611 [27] A. Rahimi, Y. Li, and T. Cohn, âMassively multilingual transfer for NER,â in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 151â164. [Online]. Available: https://www.aclweb.org/anthology/P19-1015 [28] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al.,
Chunk 31 ¡ 1,990 chars
e Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 151â164. [Online]. Available: https://www.aclweb.org/anthology/P19-1015 [28] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., âLanguage models are few-shot learners,â 2020, arXiv:2005.14165. [29] D. Zeman and J. HajiËc, Eds., Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Brussels, Belgium: Association for Computational Linguistics, Oct. 2018. [Online]. Available: https://www.aclweb.org/anthology/K18-2000 [30] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, âExploring the limits of transfer learning with a unified text-to-text transformer,â J. Mach. Learn. Res., vol. 21, no. 1, jun 2022. [31] J. Pfeiffer, I. Vuli´c, I. Gurevych, and S. Ruder, âMAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer,â in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 7654â7673. [Online]. Available: https://aclanthology.org/2020.emnlp-main.617 [32] âBigscience large open-science open-access multilingual lan- guage model,â 2022, online resource. [Online]. Available: https://huggingface.co/bigscience/bloom-560m [33] O. Shliazhko, A. Fenogenova, M. Tikhonova, V. Mikhailov, A. Kozlova, and T. Shavrina, âmgpt: Few-shot learners go multilingual,â 2022. [Online]. Available: https://arxiv.org/abs/2204.07580 [34] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, âDeep contextualized word representations,â in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational
Chunk 32 ¡ 1,998 chars
M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, âDeep contextualized word representations,â in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 2227â2237. [Online]. Available: https://aclanthology.org/N18-1202 [35] T. Mikolov, K. Chen, G. Corrado, and J. Dean, âEfficient estimation of word representations in vector space,â 2013. [Online]. Available: https://arxiv.org/abs/1301.3781 [36] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, âEnriching word vectors with subword information,â Transactions of the Association for Computational Linguistics, vol. 5, pp. 135â146, 2017. [Online]. Available: https://aclanthology.org/Q17-1010 [37] J. Pennington, R. Socher, and C. Manning, âGloVe: Global vectors for word representation,â in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532â1543. [Online]. Available: https://aclanthology.org/D14-1162 [38] C. Sammut and G. I. Webb, Eds., TFâIDF. Boston, MA: Springer US, 2010, pp. 986â987. [Online]. Available: https://doi.org/10.1007/978-0-387-30164-8_832 [39] O. Melamud, J. Goldberger, and I. Dagan, âcontext2vec: Learning generic context embedding with bidirectional LSTM,â in Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 51â61. [Online]. Available: https://aclanthology.org/K16-1006 [40] B. McCann, J. Bradbury, C. Xiong, and R. Socher, âLearned in translation: Contextualized word vectors,â in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPSâ17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6297â6308. [41] M. E. Peters, M. Neumann, M. Iyyer, M.
Chunk 33 ¡ 1,993 chars
B. McCann, J. Bradbury, C. Xiong, and R. Socher, âLearned in translation: Contextualized word vectors,â in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPSâ17. Red Hook, NY, USA: Curran Associates Inc., 2017, p. 6297â6308. [41] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, âDeep contextualized word representations,â in Proc. of NAACL, 2018. [42] J. Howard and S. Ruder, âUniversal language model fine- tuning for text classification,â in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 328â339. [Online]. Available: https://aclanthology.org/P18-1031 [43] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, âOpt: Open pre-trained transformer language models,â 2022. -- 9 of 11 -- 10 [44] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, âRoberta: A robustly optimized bert pretraining approach,â 2019. [Online]. Available: https://arxiv.org/abs/1907.11692 [45] J. Pfeiffer, N. Goyal, X. V. Lin, X. Li, J. Cross, S. Riedel, and M. Artetxe, âLifting the curse of multilinguality by pre-training modular transformers,â 2022, arXiv:2205.06266. [46] K. Sinha, R. Jia, D. Hupkes, J. Pineau, A. Williams, and D. Kiela, âMasked language modeling and the distributional hypothesis: Order word matters pre-training for little,â in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 2888â2913. [Online]. Available: https://aclanthology.org/2021.emnlp-main.230 [47] A. CONNEAU and G. Lample, âCross-lingual language
Chunk 34 ¡ 1,995 chars
roceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 2888â2913. [Online]. Available: https://aclanthology.org/2021.emnlp-main.230 [47] A. CONNEAU and G. Lample, âCross-lingual language model pretraining,â in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'AlchĂŠ-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf [48] Z.-Y. Dou and G. Neubig, âWord alignment by fine-tuning embeddings on parallel corpora,â in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Online: Association for Computational Linguistics, Apr. 2021, pp. 2112â2128. [Online]. Available: https://aclanthology.org/2021.eacl-main.181 [49] D. Yarowsky, G. Ngai, and R. Wicentowski, âInducing multilingual text analysis tools via robust projection across aligned corpora,â in Proceedings of the First International Conference on Human Language Technology Research, ser. HLT â01. USA: Association for Computational Linguistics, 2001, p. 1â8. [Online]. Available: https://doi.org/10.3115/1072133.1072187 [50] C. Dyer, V. Chahuneau, and N. A. Smith, âA simple, fast, and effective reparameterization of IBM model 2,â in Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, Georgia: Association for Computational Linguistics, Jun. 2013, pp. 644â648. [Online]. Available: https://aclanthology.org/N13-1073 [51] S. Cao, N. Kitaev, and D. Klein, âMultilingual alignment of contextual word representations,â 2020. [Online]. Available: https://arxiv.org/abs/2002.03518 [52] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M.
Chunk 35 ¡ 1,996 chars
ics, Jun. 2013, pp. 644â648. [Online]. Available: https://aclanthology.org/N13-1073 [51] S. Cao, N. Kitaev, and D. Klein, âMultilingual alignment of contextual word representations,â 2020. [Online]. Available: https://arxiv.org/abs/2002.03518 [52] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, and P. Kumar, âIndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages,â in Findings of EMNLP, 2020. [53] T. Dozat and C. D. Manning, âDeep biaffine attention for neural dependency parsing,â 2016. [Online]. Available: https://arxiv.org/abs/1611.01734 [54] Y. Tay, V. Q. Tran, S. Ruder, J. Gupta, H. W. Chung, D. Bahri, Z. Qin, S. Baumgartner, C. Yu, and D. Metzler, âCharformer: Fast character transformers via gradient-based subword tokenization,â 2021. [Online]. Available: https://arxiv.org/abs/2106.12672 [55] S. Ruder, N. Constant, J. Botha, A. Siddhant, O. Firat, J. Fu, P. Liu, J. Hu, D. Garrette, G. Neubig, and M. Johnson, âXTREME-R: Towards more challenging and nuanced multilingual evaluation,â in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 10 215â10 245. [Online]. Available: https://aclanthology.org/2021.emnlp-main.802 [56] P. Rust, J. Pfeiffer, I. Vuli´c, S. Ruder, and I. Gurevych, âHow good is your tokenizer? on the monolingual performance of multilingual language models,â in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 3118â3135. [Online]. Available: https://aclanthology.org/2021.acl-long.243 [57] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. GallĂŠ, A. Raja, C. Si, W. Y. Lee, B. Sagot, and S. Tan, âBetween
Chunk 36 ¡ 1,996 chars
al Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, Aug. 2021, pp. 3118â3135. [Online]. Available: https://aclanthology.org/2021.acl-long.243 [57] S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. GallĂŠ, A. Raja, C. Si, W. Y. Lee, B. Sagot, and S. Tan, âBetween words and characters: A brief history of open-vocabulary modeling and tokenization in nlp,â 2021. [Online]. Available: https://arxiv.org/abs/2112.10508 [58] P. Rust, J. F. Lotz, E. Bugliarello, E. Salesky, M. de Lhoneux, and D. Elliott, âLanguage modelling with pixels,â 2022. [Online]. Available: https://arxiv.org/abs/2207.06991 [59] B. Muller, A. Anastasopoulos, B. Sagot, and D. Seddah, âWhen being unseen from mBERT is just the beginning: Handling new languages with multilingual language models,â in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, Jun. 2021, pp. 448â462. [Online]. Available: https://aclanthology.org/2021.naacl-main.38 [60] Z. Wang, K. K, S. Mayhew, and D. Roth, âExtending multilingual BERT to low-resource languages,â in Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 2649â2656. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.240 [61] A. Lauscher, V. Ravishankar, I. Vuli´c, and G. GlavaĹĄ, âFrom zero to hero: On the limitations of zero-shot language transfer with multilingual Transformers,â in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 4483â4499. [Online]. Available: https://aclanthology.org/2020.emnlp-main.363 [62] J. Pfeiffer, N. Goyal, X. Lin, X. Li, J. Cross, S. Riedel, and M. Artetxe, âLifting the curse of multilinguality by pre-training modular transformers,â in
Chunk 37 ¡ 1,994 chars
cessing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 4483â4499. [Online]. Available: https://aclanthology.org/2020.emnlp-main.363 [62] J. Pfeiffer, N. Goyal, X. Lin, X. Li, J. Cross, S. Riedel, and M. Artetxe, âLifting the curse of multilinguality by pre-training modular transformers,â in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 3479â3495. [Online]. Available: https://aclanthology.org/2022.naacl-main.255 [63] J. Frankle and M. Carbin, âThe lottery ticket hypothesis: Finding sparse, trainable neural networks,â 2018. [Online]. Available: https://arxiv.org/abs/1803.03635 [64] T. Chen, J. Frankle, S. Chang, S. Liu, Y. Zhang, Z. Wang, and M. Carbin, âThe lottery ticket hypothesis for pre-trained bert networks,â 2020. [Online]. Available: https://arxiv.org/abs/2007.12223 [65] S. Prasanna, A. Rogers, and A. Rumshisky, âWhen BERT Plays the Lottery, All Tickets Are Winning,â in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 3208â3229. [Online]. Available: https://aclanthology.org/2020.emnlp-main.259 [66] N. Foroutan, M. Banaei, R. Lebret, A. Bosselut, and K. Aberer, âDiscovering language-neutral sub-networks in multilingual language models,â 2022. [Online]. Available: https://arxiv.org/abs/2205.12672 [67] Y. Li, F. Luo, R. Xu, S. Huang, F. Huang, and L. Wang, âProbing structured pruning on multilingual pre-trained models: Settings, algorithms, and efficiency,â 2022. [Online]. Available: https://arxiv.org/abs/2204.02601 [68] H. Sajjad, F. Dalvi, N. Durrani, and P. Nakov, âOn the effect of dropping layers of pre-trained transformer models,â 2021. [69] K. Sta´nczak, E. Ponti, L. T. Hennigen, R. Cotterell, and I. Augenstein, âSame neurons, different
Chunk 38 ¡ 1,985 chars
algorithms, and efficiency,â 2022. [Online]. Available: https://arxiv.org/abs/2204.02601 [68] H. Sajjad, F. Dalvi, N. Durrani, and P. Nakov, âOn the effect of dropping layers of pre-trained transformer models,â 2021. [69] K. Sta´nczak, E. Ponti, L. T. Hennigen, R. Cotterell, and I. Augenstein, âSame neurons, different languages: Probing morphosyntax in multilingual pre-trained models,â 2022. [Online]. Available: https://arxiv.org/abs/2205.02023 [70] R. Choenni and E. Shutova, âInvestigating Language Relationships in Multilingual Sentence Encoders Through the Lens of Linguistic Typology,â Computational Linguistics, pp. 1â38, 07 2022. [Online]. Available: https://doi.org/10.1162/coli_a_00444 [71] M. Artetxe and H. Schwenk, âMassively multilingual sentence embed- dings for zero-shot cross-lingual transfer and beyond,â Transactions of the Association for Computational Linguistics, vol. 7, pp. 597â610, nov 2019. [Online]. Available: https://doi.org/10.1162%2Ftacl_a_00288 [72] G. Lample and A. Conneau, âCross-lingual language model pretraining,â 2019. [Online]. Available: https://arxiv.org/abs/1901.07291 [73] P. P. Liang, C. Wu, L.-P. Morency, and R. Salakhutdinov, âTowards understanding and mitigating social biases in language models,â 2021. [Online]. Available: https://arxiv.org/abs/2106.13219 [74] A. Immer, L. Torroba Hennigen, V. Fortuin, and R. Cotterell, âProbing as quantifying inductive bias,â in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 1839â1851. [Online]. Available: https://aclanthology.org/2022.acl-long.129 [75] S. Alnegheimish, A. Guo, and Y. Sun, âUsing natural sentence prompts for understanding biases in language models,â in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association
Chunk 39 ¡ 1,601 chars
022.acl-long.129 [75] S. Alnegheimish, A. Guo, and Y. Sun, âUsing natural sentence prompts for understanding biases in language models,â in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 2824â2830. [Online]. Available: https://aclanthology.org/2022.naacl-main.203 [76] J. Ahn and A. Oh, âMitigating language-dependent ethnic bias in BERT,â in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: -- 10 of 11 -- 11 Association for Computational Linguistics, Nov. 2021, pp. 533â549. [Online]. Available: https://aclanthology.org/2021.emnlp-main.42 [77] P. Joshi, S. Santy, A. Budhiraja, K. Bali, and M. Choudhury, âThe state and fate of linguistic diversity and inclusion in the NLP world,â in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, Jul. 2020, pp. 6282â6293. [Online]. Available: https://aclanthology.org/2020.acl-main.560 [78] D. Blasi, A. Anastasopoulos, and G. Neubig, âSystematic inequalities in language technology performance across the worldâs languages,â in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 5486â5505. [Online]. Available: https://aclanthology.org/2022.acl-long.376 -- 11 of 11 --