BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models
Summary
BHASA is a holistic evaluation suite for Large Language Models (LLMs) in Southeast Asian (SEA) languages, addressing the lack of comprehensive benchmarks for non-English languages. It includes three components: an NLP benchmark with eight tasks (NER, QA, Sentiment Analysis, Toxicity Detection, Machine Translation, Summarization, NLI, Causal Reasoning), the LINDSEA linguistic diagnostic toolkit, and a cultural diagnostics dataset. The NLP benchmark is currently implemented for Indonesian, Vietnamese, Thai, and Tamil, with LINDSEA and cultural diagnostics limited to Indonesian and Tamil. GPT-4, a leading multilingual model, was evaluated as a benchmark. Results show significant improvements over GPT-3.5-Turbo, but performance varies by language, with Indonesian outperforming Vietnamese, Thai, and Tamil. GPT-4 struggles with NER, Toxicity Detection, Machine Translation, and NLI, and shows limited cultural representation and sensitivity. The study highlights the need for better linguistic and cultural evaluation of LLMs in SEA languages. BHASA is an ongoing project aiming to expand its scope and improve its components.
PDF viewer
Chunks(159)
Chunk 0 · 1,994 chars
BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models Disclaimer: This work contains potentially sensitive content. Wei Qi Leong∗ AI Singapore National University of Singapore weiqi@aisingapore.org Jian Gang Ngui∗ AI Singapore National University of Singapore Department of Linguistics, University of Arizona jiangangngui@aisingapore.org Yosephine Susanto† AI Singapore National University of Singapore yosephine@aisingapore.org Hamsawardhini Rengarajan† AI Singapore National University of Singapore hamsa@aisingapore.org Kengatharaiyer Sarveswaran Department of Computer Science, University of Jaffna sarves@univ.jfn.ac.lk William Chandra Tjhi AI Singapore National University of Singapore wtjhi@aisingapore.org Abstract The rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these bench- marks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Gen- eration (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmat- ics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indone- sian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities
Chunk 1 · 1,999 chars
ary effort, we implement the NLP benchmark only for Indone- sian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sen- sitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future. The repository for this paper can be found at: https://github.com/aisingapore/BHASA. Keywords Southeast Asian languages · Evaluation suite · Benchmarking · Linguistic diagnostics · Cultural representation · Cultural sensitivity · Large language models · Natural language processing ∗Co-first authors; corresponding authors †Major contributors; full contributions in Appendix A. -- 1 of 86 -- BHASA Contents 1 Introduction 4 2 Related work 5 2.1 Holistic NLP benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Individual evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Evaluation of LLMs in SEA Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 BHASA: A NLP benchmark for SEA Languages 7 3.1 Overall Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2 Language selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Task and dataset curation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3.1 Named entity recognition (NER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3.2 Question answering (QA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.3 Sentiment analysis . . . . . .
Chunk 2 · 1,998 chars
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3.1 Named entity recognition (NER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3.2 Question answering (QA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.3 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.4 Toxicity detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.5 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3.6 Abstractive summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.7 Natural language inference (NLI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.8 Causal reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4.2 Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4.3 Data cleaning and sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5.2 Human evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5.3 LLMs as NLG evaluators? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.6 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6.1 Named entity recognition (NER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6.2 Question answering . . . . . . . . . . .
Chunk 3 · 1,999 chars
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.6 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6.1 Named entity recognition (NER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6.2 Question answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.6.3 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.6.4 Toxicity detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.6.5 Machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6.6 Abstractive summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.6.7 Natural language inference (NLI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.6.8 Causal reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6.9 Data leakage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 LINDSEA: Linguistic diagnostics for Southeast Asian Languages 28 2 -- 2 of 86 -- BHASA 4.1 Design principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Test formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.1 Minimal pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2.2 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.3 Information recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2.4 Binary choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Dataset creation and validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Dataset creation . . . . . . . . . . . . . . . . . .
Chunk 4 · 1,999 chars
. . . . . . . . . . . . . . . . . . . . . . 31 4.2.4 Binary choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Dataset creation and validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.1 Dataset creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3.2 Dataset Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Experimental setup and evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.5.1 Minimal pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5.2 Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5.3 Coreference resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.5.4 Pragmatic reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Cultural representation and sensitivity 43 5.1 Cultural representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.1 Task design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2 Cultural sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.1 Task design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6 Conclusion, limitations and future works 50 A Author contributions 64 B Prompts 64 C Data quality issues 73 D NLG Human Evaluation guidelines 74 E LINDSEA Linguistic phenomena categories 76 F LINDSEA detailed
Chunk 5 · 1,989 chars
48 5.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6 Conclusion, limitations and future works 50 A Author contributions 64 B Prompts 64 C Data quality issues 73 D NLG Human Evaluation guidelines 74 E LINDSEA Linguistic phenomena categories 76 F LINDSEA detailed results 78 G Cultural representation 83 H Cultural sensitivity 86 3 -- 3 of 86 -- BHASA 1 Introduction Large Language Models (LLMs) have recently been catapulted into the spotlight for their immense potential, demonstrating strong zero-shot or few-shot capabilities across a wide spectrum of Natural Language Process- ing (NLP) tasks [1, 2, 3, 4, 5]. The rapid development of these LLMs and the emergence of novel abilities with scale [6] have necessitated the construction of more difficult and diverse benchmarks such as BIG-bench [7] as well as holistic benchmarks such as HELM [8]. However, most of these benchmarks focus only on performance in English at the moment. Existing works benchmarking LLMs on non-English languages are few in number [9, 10, 11, 12, 13] and are not as holistic and diverse as HELM or BIG-bench. As such, we present BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages that we are continually improving and expanding. BHASA comprises three components: (1) a NLP bench- mark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. The NLP benchmark covers Named Entity Recognition (NER), Question Answering, Sentiment Analysis, Toxicity Detection, Machine Translation, Abstractive Summarization, Natural Language Inference (NLI) and Causal Reasoning. In this first version of the benchmark, we include only Indonesian,
Chunk 6 · 1,998 chars
representation and sensitivity. The NLP benchmark covers Named Entity Recognition (NER), Question Answering, Sentiment Analysis, Toxicity Detection, Machine Translation, Abstractive Summarization, Natural Language Inference (NLI) and Causal Reasoning. In this first version of the benchmark, we include only Indonesian, Vietnamese, Thai and Tamil as target languages in the NLP benchmark due to resource constraints. Furthermore, given that LLMs are now touted as being able to interact with users purely through natural language, the systematic evaluation of natural language proficiency becomes even more crucial. We therefore include LINDSEA as the second component in BHASA. LINDSEA is a handcrafted linguistic diagnostic benchmark designed by linguists to holistically and quantitatively evaluate a model’s understanding of various linguistic phenomena specific to SEA languages. LINDSEA is designed to have a broad and deep coverage of linguistic features including syntax, semantics and pragmatics to allow for fine-grained analyses and precise pinpointing of a model’s linguistic weaknesses. In this first version of the benchmark, we release both an Indonesian dataset as well as an early version of the Tamil dataset that will be expanded in a future update of this paper. Additionally, given the importance of cultural representation and sensitivity in the use of LLMs, we also conduct preliminary experiments on the matter with a set of Indonesian and Tamil culture-specific questions as the third component in BHASA. At the time of this writing, GPT-4 is arguably among the best-performing LLMs available [14] that is touted as explicitly supporting multilingual capabilities, which importantly includes some SEA languages [15]. As such, it could be a good candidate for getting a sense of how well LLMs can currently perform in the context of SEA languages. We therefore evaluate GPT-4 on BHASA to investigate the current state-of-the- art performance on SEA languages, including also
Chunk 7 · 1,995 chars
abilities, which importantly includes some SEA languages [15]. As such, it could be a good candidate for getting a sense of how well LLMs can currently perform in the context of SEA languages. We therefore evaluate GPT-4 on BHASA to investigate the current state-of-the- art performance on SEA languages, including also results on its immediate predecessor GPT-3.5-Turbo as a reference and to enable insights into the growth trajectory of this family of models. We reiterate that BHASA is designed to be model-agnostic, even if only GPT-3.5-Turbo and GPT-4 are evaluated in this paper. Our initial findings, based on GPT-4’s evaluation for the time being as a proxy representation of the current state-of-the-art multilingual performance in LLMs, show that: 1. In general, there have been significant improvements in zero-shot SEA language capabilities from the time of GPT-3.5-Turbo to that of GPT-4, with GPT-4 being able to better follow instructions in native prompts, sometimes even benefiting from them. 2. Among our target SEA languages, LLM performance decreases in the order of Indonesian, Viet- namese, Thai and Tamil. 3. Performance in Question Answering, Sentiment Analysis, Abstractive Summarization and Causal Reasoning in our target languages is quite good, but there is still much room for improvement in Named Entity Recognition (NER), Toxicity Detection, Machine Translation and Natural Language Inference (NLI). 4. Idiosyncratic linguistic features of Indonesian and Tamil are not well-represented in LLMs. 5. There is a conspicuous lack of cultural representation and sensitivity with regard to Indonesian and Tamil culture in LLMs. 4 -- 4 of 86 -- BHASA 2 Related work 2.1 Holistic NLP benchmarks With the advent of transformer-based pre-trained models [16] starting with BERT [17], GLUE [18] and SuperGLUE [19] have become mainstays in the multi-task evaluation of NLP models on Natural Language Understanding (NLU) tasks. GLUE also contains a diagnostic dataset covering
Chunk 8 · 1,991 chars
2 Related work 2.1 Holistic NLP benchmarks With the advent of transformer-based pre-trained models [16] starting with BERT [17], GLUE [18] and SuperGLUE [19] have become mainstays in the multi-task evaluation of NLP models on Natural Language Understanding (NLU) tasks. GLUE also contains a diagnostic dataset covering syntax, semantics, reasoning skills and knowledge that allows for a targeted diagnosis of a model‘s strengths and weaknesses. However, these benchmarks were quickly saturated, with models outperforming humans, often by large margins [20]. HELM [8] was an effort to holistically evaluate LLMs given the extent of the gamut of tasks that they are capable of handling. It introduced the idea that accuracy is no longer sufficient in the current age, and that other metrics such as robustness, fairness, bias, toxicity, calibration and efficiency are equally important to be measured. It also included targeted evaluations of language, reasoning, knowledge, memorization, disinformation, bias and toxicity. While it was a very ambitious study that evaluated many models on many scenarios and metrics, it predated GPT-3.5-Turbo and GPT-4 and only focused on English given the massive scope that was covered. BIG-Bench [7] was another effort in response to the emergent abilities of LLMs [6] to crowdsource innovative tasks to explore yet undiscovered capabilities of these models. It contains a few tasks for evaluating certain aspects of natural language use, but they are much smaller in scale and diversity compared to BLiMP [21] which was included in HELM. OpenAI also open-sourced their own crowdsourced benchmark, OpenAI Evals,1 for users to report shortcomings in their models.2 They cover a wide range of tasks across many languages, but only have two tasks related to SEA languages at the point of writing this paper – Indonesian written number to Arabic numeral conversion and Vietnamese examination multiple- choice questions. We mainly draw inspiration from HELM in
Chunk 9 · 1,995 chars
rtcomings in their models.2 They cover a wide range of tasks across many languages, but only have two tasks related to SEA languages at the point of writing this paper – Indonesian written number to Arabic numeral conversion and Vietnamese examination multiple- choice questions. We mainly draw inspiration from HELM in aspiring to be holistic but we also go beyond HELM and BIG-Bench or OpenAI Evals in designing a more extensive linguistic diagnostic toolkit for SEA languages as well as cultural diagnostics. 2.2 Individual evaluations Reasoning is extremely important in decision-making, making it a key skill to have for LLMs especially with the dawn of the new paradigm of using LLMs as task-planning agents [22]. [23] conducted a comprehensive study comparing ChatGPT-3.5 and ChatGPT-4’s zero-shot reasoning capabilities, although they do not include datasets that explicitly target pragmatic reasoning. LogiEval [24] tested ChatGPT, ChatGPT-4, and GPT-4 on English and Chinese reasoning tasks, but did not target any SEA language and also did not target pragmatic reasoning. Studies that did include pragmatic reasoning did not cover SEA languages and were not tested on GPT-3.5-Turbo or GPT-4. For example, multiple works have introduced datasets to test for scalar implicatures [25, 26, 27, 28], while others have introduced datasets for presuppositions, such as IMPPRES [25] and NOPE [29]. We draw inspiration from their datasets in crafting our own dataset for pragmatic reasoning in SEA languages. Natural Language Generation (NLG) evaluations on LLMs often involve machine translation and summa- rization. [30] comprehensively evaluated the machine translation capabilities of GPT models (ChatGPT, text-davinci-003, and text-davinci-002) on 18 translation directions with both high and low-resource lan- guages, using overlap metrics (BLEU3 and ChrF [31]), neural network-based metrics, both with references (COMET-22 [32]) and without references (COMETkiwi [33]), as well as human
Chunk 10 · 1,992 chars
apabilities of GPT models (ChatGPT, text-davinci-003, and text-davinci-002) on 18 translation directions with both high and low-resource lan- guages, using overlap metrics (BLEU3 and ChrF [31]), neural network-based metrics, both with references (COMET-22 [32]) and without references (COMETkiwi [33]), as well as human evaluation. However, they did not include SEA languages and the study was published before the release of GPT-4. In addition, [34] point out that translating texts containing idioms is challenging for Transformer-based systems. They pro- posed IdiomKB, a multilingual knowledge base for English, Mandarin and Japanese idiomatic expressions and evaluated ChatGPT-3.5 on translating these idioms using BLEU, COMETkiwi and GPT-4. However, they do not include SEA languages. As for summarization, [35] evaluated InstructGPT [36] and other LLMs on English abstractive news summarization using human evaluation. Annotators were asked to rate sum- maries on their faithfulness, coherence and relevance, in accordance with the guidelines in [37]. However, the study again does not cover SEA languages and did not evaluate GPT-3.5-Turbo or GPT-4. Given that linguistic theory can inform us on models’ linguistic representations [38], many works introducing linguistic diagnostics for language models have been published, but they mainly focus on English. Minimal pairs have been used in various studies to diagnose model weaknesses in syntax. For example, the SyntaxGym [39, 40] platform was designed to help centralize the process of targeted syntactic evaluation using minimal 1https://github.com/openai/evals/ 2https://openai.com/research/gpt-4 3https://github.com/mjpost/sacrebleu 5 -- 5 of 86 -- BHASA pairs. BLiMP [21] and its descendants, CLiMP [41] and JBLiMP [42] also contain minimal pairs covering various syntactic phenomena. However, they only cover English, Chinese and Japanese respectively and do not cover SEA languages. Furthermore, the grammatically acceptable
Chunk 11 · 1,989 chars
b.com/mjpost/sacrebleu 5 -- 5 of 86 -- BHASA pairs. BLiMP [21] and its descendants, CLiMP [41] and JBLiMP [42] also contain minimal pairs covering various syntactic phenomena. However, they only cover English, Chinese and Japanese respectively and do not cover SEA languages. Furthermore, the grammatically acceptable sentences in the minimal pairs can sometimes sound unnatural and odd, especially for BLiMP and CLiMP which were semi-automatically generated with templates and a lexicon. Finally, the issue of cultural representation and sensitivity has become increasingly important for the field of NLP with the introduction of LLMs. [43] discuss the challenges in preventing cultural hegemony in NLP systems and propose that the NLP community should aim for cross-cultural representation and preservation of cultural values. In the same line of thinking, [44] surveyed the risks of disproportionate cultural repre- sentation and sensitivity, raising issues such as the imposition of norms and values from a single originating culture and amplification of marginalization. They also point out the difficulties surrounding evaluation of such issues owing to the fact that there is no singular standard for cultural values and sensitivity. Never- theless, attempts have been made at evaluating models for cultural biases. For example, Anthropic released GlobalOpinionQA [45], a large dataset containing 2,556 multiple-choice questions and responses on global issues across different countries. Their study found that their LLM responses tended to be more similar to the opinions of certain populations from the USA and some European or South American countries. Overall, to the best of our knowledge, there are no studies specifically targeting SEA languages for cultural represen- tation. Given that GPT-4 is arguably one of the best-performing multilingual models at the time of writing, we conduct a preliminary study to evaluate the adequacy of GPT-4’s Indonesian and Tamil cultural
Chunk 12 · 1,998 chars
ll, to the best of our knowledge, there are no studies specifically targeting SEA languages for cultural represen- tation. Given that GPT-4 is arguably one of the best-performing multilingual models at the time of writing, we conduct a preliminary study to evaluate the adequacy of GPT-4’s Indonesian and Tamil cultural repre- sentation and sensitivity, which gives us a tentative yardstick to see how such representation and sensitivity can be further improved. In addition, we intend to design a model-agnostic systematic study of cultural sensitivity in the future. 2.3 Evaluation of LLMs in SEA Languages Although much work has sprung up in recent times evaluating the performance of LLMs in English, only a handful of studies include SEA languages in their benchmarks. One of the first studies on ChatGPT that did so evaluated its zero-shot performance on Indonesian sentiment analysis and machine translation between English and Indonesian, Javanese and Sundanese [11]. The study used human evaluation for translation but was unclear on the protocol, using only a vague binary choice of correct or incorrect for each sentence. Furthermore, only small samples of data between 30 to 200 sentences or documents were used for each task, which may be insufficient to draw statistically significant conclusions. Following that study, OpenAI published the GPT-4 Technical Report [15] which reported strong multilingual 3-shot performance of GPT-4 on a machine-translated version of MMLU [46], including Indonesian and Thai. However, essential details of the evaluation were not revealed and it remained unclear whether GPT-4 would be able to perform equally well on non-translated test data. [12] evaluated ChatGPT with both English and machine translated prompts in a zero-shot setting on multiple tasks in multiple languages, including part-of-speech (POS) tagging, NLI, question answering, commonsense reasoning, and abstractive summarization in Vietnamese, Thai and Tamil. MEGA [9] followed soon
Chunk 13 · 1,997 chars
test data. [12] evaluated ChatGPT with both English and machine translated prompts in a zero-shot setting on multiple tasks in multiple languages, including part-of-speech (POS) tagging, NLI, question answering, commonsense reasoning, and abstractive summarization in Vietnamese, Thai and Tamil. MEGA [9] followed soon after, aspiring to perform comprehensive multilingual benchmarking in the wake of HELM. They covered a wide range of tasks, languages and LLMs in their work, even including various prompting strategies. Their scope of tasks is the most similar to the NLP benchmark component of BHASA, covering NER, question answering, NLI, causal reasoning and abstractive summarization in our target lan- guages. However, they do not cover machine translation at all, and the toxicity classification and sentiment analysis tasks do not cover the languages that we focus on. Furthermore, although they do evaluate GPT- 3.5-Turbo and GPT-4, they only evaluated GPT-4 on NLI and question answering. A key finding from the study was that machine-translated prompts often contain translation errors which influence task perfor- mance negatively, and therefore it was recommended that machine-translated prompts be verified by native speakers. We took this into consideration and asked native speakers to manually translate all our prompts. BUFFET [10] was also established to evaluate language models across a broad range of tasks and languages, but was more targeted at evaluating few-shot cross-lingual transfer. Unlike MEGA and BUFFET, BHASA focuses on zero-shot instead of few-shot evaluations. Finally, M3Exam [13] was released as a multilingual, multimodal and multilevel benchmark for LLMs. It comprises 12,317 multiple-choice examination paper questions across 9 languages, including Vietnamese, Thai and Javanese and was tested on GPT-3.5-Turbo, GPT-4, BLOOM and Vicuna. However, it is more similar to MMLU and does not specifically target any aspect of NLU or NLG. Overall, although there
Chunk 14 · 1,995 chars
enchmark for LLMs. It comprises 12,317 multiple-choice examination paper questions across 9 languages, including Vietnamese, Thai and Javanese and was tested on GPT-3.5-Turbo, GPT-4, BLOOM and Vicuna. However, it is more similar to MMLU and does not specifically target any aspect of NLU or NLG. Overall, although there have been a few endeavors to measure multilingual performance of LLMs, the evalu- ations done for SEA languages often only cover tasks for which multilingual datasets are more conveniently 6 -- 6 of 86 -- BHASA available, do not include linguistic diagnostics or cultural diagnostics, often use machine-translated prompts which are known to have translation errors that negatively impact models’ performance, and also usually have not been tested on GPT-4. 3 BHASA: A NLP benchmark for SEA Languages 3.1 Overall Design In order to fill in the gaps in the literature on LLM performance in SEA languages, we designed BHASA (Benchmark for Holistic Evaluation of Generative AI in SoutheastAsian Languages),4 a benchmark that aims to provide a holistic assessment of LLM abilities in SEA languages. BHASA comprises 3 major compo- nents (see Figure 1), (1) a NLP benchmark, (2) LINDSEA, a linguistic diagnostic dataset, and (3) a cultural diagnostic dataset. In this section (Section 3), we will look at the first component, the NLP benchmark. For our NLP benchmark, we take inspiration from HELM [8] in developing a taxonomy of tasks before proceeding with the selection of tasks and datasets. We define key competencies of a language model as a trinity – Natural Language Understanding (NLU), Natural Language Generation (NLG) and Natural Language Reasoning (NLR). Although NLU and NLG have already been taxonomized quite well in other benchmarks, NLR is often not included as a category. We therefore created a NLR taxonomy based on existing works [23, 47, 48, 49, 50]. As can be seen from Figure 2, NLR tasks can be classified into two main categories – logical reasoning and
Chunk 15 · 1,996 chars
(NLR). Although NLU and NLG have already been taxonomized quite well in other benchmarks, NLR is often not included as a category. We therefore created a NLR taxonomy based on existing works [23, 47, 48, 49, 50]. As can be seen from Figure 2, NLR tasks can be classified into two main categories – logical reasoning and higher-level reasoning. Logical reasoning can be further divided into deductive and non-deductive reasoning (which includes inductive, abductive and analogical reasoning). On the other hand, higher-level reasoning tasks do not target a single type of reasoning skill, but rather require a combination of them, sometimes requiring certain types of knowledge such as in commonsense reasoning. Natural Language Inference (NLI) and Question Answering are usually classified as NLU tasks, but they actually refer more to the task format than the skill tested [50]. As such, certain NLI or Question Answering tasks may fall under the umbrella of NLR if they target certain reasoning skills. Unlike HELM [8], KoLA [51], BIG-Bench [7] and some other benchmarks, which explicitly test for the knowledge of language models, we were unable to include such tests in BHASA as datasets are scarce. For the time being, we have created a small dataset for testing cultural knowledge (see Section 5), but we intend to look into knowledge tasks for SEA languages in future iterations of BHASA. We design our NLP benchmark to evaluate zero-shot performance of LLMs using both English and native prompts. As we want the benchmark to be of high quality, we ensure that prompts are translated manually by native speakers and that datasets chosen are sourced natively or at least professionally translated. The following sections will go further into detail about how we designed the benchmark and the rationale behind our decisions. 3.2 Language selection While we would ideally have liked to include all official ASEAN languages in this benchmark evaluation, we had to accept the reality that many
Chunk 16 · 1,996 chars
professionally translated. The following sections will go further into detail about how we designed the benchmark and the rationale behind our decisions. 3.2 Language selection While we would ideally have liked to include all official ASEAN languages in this benchmark evaluation, we had to accept the reality that many languages in SEA are low-resource languages and therefore did not have sufficient datasets for all languages. In the end, we decided to include Indonesian, Vietnamese, Thai and Tamil after considering both resource availability and typological diversity. We will extend the evaluation to other SEA languages gradually. Table 1 below summarizes the features of the four target languages. 3.3 Task and dataset curation We selected the tasks for the benchmark based on whether there was a dataset of reasonable quality available for at least 3 of the 4 languages targeted. We selected datasets that contained data originally in the target language as far as possible and where translated datasets were used, we ensured that they were translated by professional translators or that elaborate effort has been shown to ensure that the translation was done accurately. This is important because translated datasets often contain translationese [53] which can differ significantly from natively written text [54, 55, 56, 57]. 4The name of the benchmark is a play on the Sanskrit word भाषा (bh¯as. ¯a), which means language, as a nod to the influence Sanskrit has had on many of the languages in Southeast Asia. 7 -- 7 of 86 -- BHASA Figure 1: The BHASA benchmark and its three components Figure 2: Taxonomy of Natural Language Reasoning tasks At the end of the data collection and selection process, we finalized a set of 8 tasks with datasets available for all 4 languages with the exception of Toxicity Detection for which we were unable to find a Tamil dataset of reasonable quality. Table 2 below summarizes the tasks and the details of the respective datasets. 8 -- 8 of 86 --
Chunk 17 · 1,999 chars
a collection and selection process, we finalized a set of 8 tasks with datasets available for all 4 languages with the exception of Toxicity Detection for which we were unable to find a Tamil dataset of reasonable quality. Table 2 below summarizes the tasks and the details of the respective datasets. 8 -- 8 of 86 -- BHASA Code Family Script Word Order Morphological Typology Resource Class Name Type Indonesian id Austronesian Latin Alphabet SVO Agglutinative 3 Vietnamese vi Austroasiatic Modified Latin Alphabet SVO Analytic 4 Thai th Tai-Kadai Thai Abugida SVO Analytic 3 Tamil ta Dravidian Tamil Abugida SOV Agglutinative 3 Table 1: Linguistic details of the 4 target languages of BHASA. Resource classes are obtained from [52], with 5 being languages with the most resources (e.g. English) and 0 being languages with virtually no resources (e.g. Warlpiri) Competency Task Dataset Language Output Metrics Domain Translation NLU NER NERGrit id 3 classes F1 Wikipedia Native VLSP2018 vi 4 classes F1 News Native ThaiNERv2 th 13 classes F1 News Native Naamapadam ta 3 classes F1 General Native QA TyDiQA id span F1 Wikipedia Native XQuAD vi, th span F1 Wikipedia PT IndicQA ta span F1 Wikipedia Native Sentiment Analysis NusaX id 3 classes F1 Social Media Native UIT-VSFC vi 3 classes F1 Reviews Native Wisesight th 4 classes F1 Social Media Native IndicSentiment ta 2 classes Accuracy Reviews PT Toxicity Detection MLHSD id 3 classes F1 Twitter Native ViHSD vi 3 classes F1 Social Media Native Thai Toxicity Tweet th 2 classes F1 Twitter Native NLG Machine Translation FLORES id, vi, th, ta translation ChrF++, COMET22, Human News PT Abstractive Summarization XLSum id, vi, th, ta summary ROUGE-L, Human News Native NLR NLI IndoNLI id 3 classes Accuracy Wikipedia, News Native XNLI vi, th 3 classes Accuracy General PT IndicXNLI ta 3 classes Accuracy General AT Causal Reasoning XCOPA id, vi,
Chunk 18 · 1,992 chars
translation ChrF++, COMET22, Human News PT Abstractive Summarization XLSum id, vi, th, ta summary ROUGE-L, Human News Native NLR NLI IndoNLI id 3 classes Accuracy Wikipedia, News Native XNLI vi, th 3 classes Accuracy General PT IndicXNLI ta 3 classes Accuracy General AT Causal Reasoning XCOPA id, vi, th, ta 2 choices Accuracy General PT Table 2: Details of the 8 tasks and their individual datasets in the BHASA benchmark In the following subsections, we will describe each of the 8 benchmark tasks. The English prompt used for each task will be shown as well, but due to space constraints, all translated prompts have been placed in Appendix B (Table 46). 3.3.1 Named entity recognition (NER) Named Entity Recognition (NER) is a sequence labeling task which involves detecting and classifying named entities in a text. The only multilingual dataset containing Southeast Asian languages that we know of is WikiAnn [58]. Despite the widespread use of WikiAnn in multilingual benchmarks such as XTREME [59] and, more recently, BUFFET [10], its quality as a benchmark dataset has been called into question [60], given that the data is automatically derived and was even termed as a “silver standard” by the original authors. In view of this, we opt for individual monolingual NER benchmark datasets used by the NLP communities of the respective languages. For Indonesian, we included NER-Grit which was introduced in the IndoNLU benchmark [61], and for Vietnamese, we chose the VLSP 2018 NER Shared Task [62]. For Thai, we included ThaiNER (version 2.0) [63] which was also used in the evaluation of WangchanBERTa [64]. Finally, we extracted the Tamil subset of the Naamapadam multilingual NER dataset [65] for Tamil. Although we endeavored to measure the zero-shot performance of the models without in-context learning (as stated in Section 3.1), we could not avoid providing a single example for the NER task in order to specify the required format of the model output. We
Chunk 19 · 1,989 chars
set
of the Naamapadam multilingual NER dataset [65] for Tamil.
Although we endeavored to measure the zero-shot performance of the models without in-context learning (as
stated in Section 3.1), we could not avoid providing a single example for the NER task in order to specify
the required format of the model output. We modify GPT-NER’s [66] prompt slightly to include the entity
type in the output.
9
-- 9 of 86 --
BHASA
Prompt:
Perform Named Entity Recognition (NER) on the following text using only the following labels: {LABELS}.
Please follow the format given in the following example:
Example Text: Asia is a continent.
Example Answer: @@PLACE@@Asia## is a continent.
Text: {TEXT}
Answer:
Table 3: Prompt template for the NER task
3.3.2 Question answering (QA)
Question Answering (QA) is an important NLP task associated with many user-facing applications, such as
chatbots and web search. However, despite its importance, the task is actually not well-defined and covers
a wide range of tasks. In fact, any NLP task could arguably be reformulated as a QA task [67]. For the
purposes of this paper, we define QA to refer strictly to the task of machine reading comprehension (MRC),
a subset of textual QA, which requires a system to answer questions using a text as context. As reasoning
skills will be tested under the NLR section of the benchmark, we did not endeavor to source for QA datasets
that tested for reasoning skills (unlike HELM).
We follow existing work [9, 10, 12] in choosing the TyDi QA dataset [68] for Indonesian, XQuAD [69] for
Vietnamese and Thai, as well as IndicQA [70] for Tamil. In the case of TyDi QA, we use the minimal answer
span task instead of the passage selection task to align the task with XQuAD and IndicQA which are both
minimal span extraction tasks as well. IndicQA also contains unanswerable questions for which the answer
cannot be found within the provided text.
Prompt:
You will be given a paragraph and a question.
Answer the question byChunk 20 · 1,998 chars
k instead of the passage selection task to align the task with XQuAD and IndicQA which are both
minimal span extraction tasks as well. IndicQA also contains unanswerable questions for which the answer
cannot be found within the provided text.
Prompt:
You will be given a paragraph and a question.
Answer the question by extracting the answer from the paragraph.
Paragraph: {TEXT}
Question: {QUESTION}
Answer:
Table 4: Prompt template for the QA task
3.3.3 Sentiment analysis
Sentiment Analysis is a NLU task that involves the identification of the sentiment of a text. It is highly
relevant to many industrial applications such as for understanding market sentiment in financial settings or
for understanding consumer sentiment towards products. Most datasets define sentiment labels as ‘positive’,
‘negative’ or ‘neutral’. While there is also a more fine-grained sentiment analysis task known as aspect-based
sentiment analysis which involves classifying the sentiment with respect to one or more aspects in a text, we
decided to leave it for future work due to the lack of datasets and the cost of running additional evaluations.
We use the NusaX dataset [71] for Indonesian, the UIT-VSFC dataset [72] for Vietnamese, the Wisesight
Sentiment dataset [73] for Thai and IndicSentiment dataset [70]for Tamil.
3.3.4 Toxicity detection
Toxicity Detection involves the identification of toxic content such as hate speech and abusive language in
text. It is an increasingly important task in NLP due to the need for content moderation at scale, for
example on social media platforms [8]. However, most studies have only focused on the Western world and
there has hitherto been no other effort to evaluate LLMs on this task for SEA languages. We therefore seek
10
-- 10 of 86 --
BHASA
Prompt:
What is the sentiment of the following sentence?
{TEXT}
Answer only with a single word: Positive/Negative/Neutral
Table 5: Prompt template for the Sentiment Analysis task
to understand whether LLMs are able toChunk 21 · 1,997 chars
r effort to evaluate LLMs on this task for SEA languages. We therefore seek
10
-- 10 of 86 --
BHASA
Prompt:
What is the sentiment of the following sentence?
{TEXT}
Answer only with a single word: Positive/Negative/Neutral
Table 5: Prompt template for the Sentiment Analysis task
to understand whether LLMs are able to perform toxicity detection in SEA languages given that toxicity
triggers can differ between Western and Eastern contexts [74].
We use the Indonesian Twitter Multi-label Hate Speech and Abusive Language Detection Dataset (hence-
forth the MLHSD dataset) [75] for Indonesian, the ViHSD Dataset [76] for Vietnamese, and Thai Toxicity
Detection Dataset [77] for Thai. As the MLHSD dataset was designed as a multi-label classification task as
opposed to the multi-class classification task in Vietnamese and Thai, we opted to map the labels for the
MLHSD dataset such that every text can only be one of three labels, ‘Clean’, ‘Abusive’ or ‘Hate’, in order
to better align the task across the three languages.
As the definition of toxicity can differ across datasets, we consulted the paper accompanying each dataset to
ensure that we define each label in the prompt according to the definitions stipulated by the paper authors.
Prompt:
You are a hate speech detector. The label definitions are as follows:
Clean: No harassment.
Abusive: Involves harassment and even profanity, but does not attack any specific object.
Hate: Directly harasses or abuses a specific object.
When given a sentence, classify it with a single word: Clean/Abusive/Hate.
Sentence: {TEXT}
Table 6: Prompt template for the Toxicity Detection task
3.3.5 Machine translation
LLMs have demonstrated great potential in machine translation [30], but mainly only for high-resource
languages, while the performance for low-resource languages still lags far behind supervised baselines [11,
30, 78]. Given OpenAI‘s claim that GPT-4 has improved multilingual performance compared to GPT-3.5
[15], it would be importantChunk 22 · 1,993 chars
rated great potential in machine translation [30], but mainly only for high-resource
languages, while the performance for low-resource languages still lags far behind supervised baselines [11,
30, 78]. Given OpenAI‘s claim that GPT-4 has improved multilingual performance compared to GPT-3.5
[15], it would be important for us to verify its machine translation abilities, especially on SEA languages
which are generally considered to be relatively low-resource languages. We use the FLORES-200 dataset
[79] for this task as it contains all four of our target languages and is a parallel dataset, allowing for fairer
comparisons across languages. Stringent processes were put in place for quality control and for the selection
of translators and reviewers to ensure the quality of the dataset, hence our decision to use it.
Prompt:
Translate the following text into {LANGUAGE}.
Text: {TEXT}
Translation:
Table 7: Prompt template for the Machine Translation task
11
-- 11 of 86 --
BHASA
3.3.6 Abstractive summarization
Abstractive Summarization requires a system to read a document, identify the key points within, and sum-
marize them into a coherent and fluent text while paraphrasing the document. It is a much more challenging
task compared to Extractive Summarization which involves extracting the most salient span(s) of text in
a document verbatim, and therefore will require much stronger NLG abilities. For this task, we chose the
XLSum dataset [80] which covers all 4 of our target languages.
Prompt:
Article: {TEXT}
Summarize this {LANGUAGE} article in 1 or 2 sentences. The response should be in {LANGUAGE}.
Summary:
Table 8: Prompt template for the Abstractive Summarization task
3.3.7 Natural language inference (NLI)
NLI is a task that involves determining whether a given premise entails or contradicts a hypothesis, typically
using a 3-way classification scheme with the labels ‘entailment’, ‘contradiction’ and ‘neutral’. It is a rather
vague task that is defined more by itsChunk 23 · 1,996 chars
Summarization task
3.3.7 Natural language inference (NLI)
NLI is a task that involves determining whether a given premise entails or contradicts a hypothesis, typically
using a 3-way classification scheme with the labels ‘entailment’, ‘contradiction’ and ‘neutral’. It is a rather
vague task that is defined more by its format than its content. For example, the SNLI dataset has been
analyzed to contain three major categories of premise-hypothesis entailment problems, namely paraphrasing,
compound semantics understanding, and reasoning with implicit premises [50]. This has caused NLI to be
regarded both as a NLU and as a NLR task. Critics have commented that the use of the label ‘entailment’
is inappropriate given that its technical definition in linguistics and logic requires that the hypothesis follow
necessarily from the premise in all situations [81]. This is not the case in many NLI datasets since they
often involve presuppositions, implicatures, and world knowledge in addition to logical entailment [82]. Nev-
ertheless, we categorize NLI as a higher-level reasoning task under NLR, acknowledging that multiple skills
are being tested and carry out more pinpointed evaluations of reasoning skills subsequently using our own
handcrafted datasets (see Section 4).
For the NLI task, we choose the IndoNLI lay-authored dataset [83] for Indonesian, XNLI [84] for Vietnamese
and Thai, and IndicXNLI [85] for Tamil. While translating the prompt template for the task into our
target languages, we found that the terms premise, hypothesis, entailment, contradiction and neutral are
not easily translatable into our target languages and could cause confusion for the models. As such, we
decided to design the prompt template in a way that avoids using those words and that uses simple words
in replacement.
Prompt:
You will be given two sentences, X and Y.
X: {PREMISE}
Y: {HYPOTHESIS}
Determine which of the following statements applies to sentences X and Y the best.
A: If X is true, YChunk 24 · 1,997 chars
odels. As such, we
decided to design the prompt template in a way that avoids using those words and that uses simple words
in replacement.
Prompt:
You will be given two sentences, X and Y.
X: {PREMISE}
Y: {HYPOTHESIS}
Determine which of the following statements applies to sentences X and Y the best.
A: If X is true, Y must be true.
B: X contradicts Y.
C: When X is true, Y may or may not be true.
Answer strictly with a single letter A, B or C.
Table 9: Prompt template for the NLI task
12
-- 12 of 86 --
BHASA
3.3.8 Causal reasoning
Causal Reasoning is a subset of reasoning that involves identifying and understanding the causes and effects
of events or phenomena [47]. It is often considered to require commonsense knowledge [86, 87, 50]. We
therefore classify it as higher-level reasoning under the NLR category.
We test the models on both forward and backward causal reasoning (effect and cause respectively) using the
multilingual XCOPA dataset [87] for all target languages.
Prompt:
Situation: {PREMISE}
Given this situation, which of the following choices is most likely to be its {QUESTION}?
A: {CHOICE 1}
B: {CHOICE 2}
Respond strictly with the letters A or B only.
{QUESTION} = cause/effect
Table 10: Prompt template for the Causal Reasoning task
3.4 Experimental setup
3.4.1 Models
In this paper, we focus on evaluating GPT-4 as it has been reported to achieve strong 3-shot performances
on a machine-translated MMLU benchmark given its multilingual capabilities [15]. We also include GPT-
3.5-Turbo to observe the purported improvements in performance from GPT-3.5-Turbo to GPT-4.
The GPT-3.5-Turbo model used was gpt-3.5-turbo-0301 and the GPT-4 model used was gpt-4-0314. The two
models have a context length of 4096 and 8192 tokens respectively. Disregarding this difference in context
length, we evaluate both models fairly by truncating the input where necessary as if both models only had
a context length of 4096.
We set the temperature to be 0 for all NLU and NLR tasks toChunk 25 · 1,991 chars
gpt-4-0314. The two models have a context length of 4096 and 8192 tokens respectively. Disregarding this difference in context length, we evaluate both models fairly by truncating the input where necessary as if both models only had a context length of 4096. We set the temperature to be 0 for all NLU and NLR tasks to maximize the stability and reproducibility of the generations. As for NLG tasks, we set the temperature to be 0 and 0.3 for Machine Translation and Abstractive Summarization respectively in line with findings from previous works [88, 89, 35]. We use a top_p of 1, frequency_penalty of 0 and presence_penalty of 0 as per OpenAI‘s defaults. 3.4.2 Prompts The sensitivity of LLMs to prompts has been demonstrated in multiple studies, leading to the birth of a new research direction known as prompt engineering. Prompt engineering papers have proposed various improvements to prompts, such as the use of few-shot examples [1], chain-of-thought prompting [90, 6, 91, 92], and even automatic generation of prompts [93]. In our paper, we avoid the use of few-shot examples as we want to test the zero-shot capabilities of LLMs without any in-context learning. Furthermore, the use of few-shot examples introduces a plethora of variables which can have significant impact on model performance, such as the number of examples [94, 95], the choice of examples [94, 96], the language of the examples [9], and the order of the examples [97, 98]. We therefore leave the exploration of these options to future work. Instead, we focus our efforts on the translation of prompts and test the models with both English and target language prompts in zero-shot settings. Existing benchmarking efforts often use machine-translated prompts [9, 10, 12] but this has been reported to elicit drops in model performance due to errors in translation [9]. As such, we worked with native speakers to translate the English prompts into their respective languages in order to ensure the accuracy of the
Chunk 26 · 1,994 chars
xisting benchmarking efforts often use machine-translated prompts [9, 10, 12] but this has been reported to elicit drops in model performance due to errors in translation [9]. As such, we worked with native speakers to translate the English prompts into their respective languages in order to ensure the accuracy of the prompts and to maintain, as far as possible, the same meaning across all languages. However, we found that the translation exercise was more challenging than expected because there were many terms that did not translate well into our target languages and there were certain idiosyncrasies of these languages that had to be taken into consideration before deciding on a particular translation. 13 -- 13 of 86 -- BHASA Firstly, certain task-related keywords do not have exact equivalents in our target languages. A case in point would be the words premise, hypothesis, entailment and neutral in the NLI task. These words also have a slightly different meaning in the context of NLI, which would make a direct translation of these terms inappropriate. We deal with this issue by doing away with these opaque terms, opting instead to paraphrase them using layman terms for transparency. For example, entailment is paraphrased as “If X is true, Y must be true”. Secondly, certain words can have multiple translations with slightly different nuances and the choice of translation could impact the way the model responds. For example, there are two ways to translate the word summary in both Indonesian and Tamil. Ringkasan and சுருக்கம் (curukkam) refer to a shortened text while rangkuman and ெபாழBப்பு (pol ¯ippu) refer to a text that is shortened in a systematic way to retain its essence and are the words usually used in the context of a language examination. Finally, we found that even the way the user addresses the model can affect the way it responds. For example, when asking the model to do anything in Tamil, which is morphologically rich and exhibits subject- verb
Chunk 27 · 1,997 chars
ematic way to retain its essence and are the words usually used in the context of a language examination. Finally, we found that even the way the user addresses the model can affect the way it responds. For example, when asking the model to do anything in Tamil, which is morphologically rich and exhibits subject- verb agreement, the user needs to decide whether to address the model more respectfully or more informally as that will affect the inflection of the verb used.5 These issues apply not only to our target languages but also to any other language, especially those that are typologically or culturally distant from English. We therefore encourage researchers and practitioners to take these into consideration when prompting language models natively in non-English languages. A full list of prompts in every language used can be found in Appendix B (Table 46). 3.4.3 Data cleaning and sampling Data cleaning Before using the datasets for the benchmark, we did exploratory data analysis for each dataset to understand the quality and distribution of the data. We had to drop certain data points due to missing labels or text. This included 1 sentence in the IndicSentiment dataset which had no label (reducing the dataset size from 1000 sentences to 999 sentences), and 1140 tweets in the Thai Toxicity Detection dataset which either had no labels or had “TWEET_NOT_FOUND” as the text (reducing the dataset size from 3300 tweets to 2160 tweets). Furthermore, we discovered an alarming number of mislabelled sentences in the XCOPA dataset. While there is generally an equal distribution of questions across languages, with 250 cause and 250 effect questions in each test set, the Indonesian and Thai datasets had 254 and all 500 out of 500 instances asking for the effect respectively. Upon further inspection, we determined that the 4 questions in the Indonesian dataset were indeed mislabelled and proceeded to relabel them as ‘cause’ in line with the other datasets. The same was done
Chunk 28 · 1,993 chars
st set, the Indonesian and Thai datasets had 254 and all 500 out of 500 instances asking for the effect respectively. Upon further inspection, we determined that the 4 questions in the Indonesian dataset were indeed mislabelled and proceeded to relabel them as ‘cause’ in line with the other datasets. The same was done for the Thai dataset to get 250 cause and 250 effect questions (see Table 48 in Appendix C for the 4 Indonesian instances as well as a sample of 4 mislabelled Thai instances). This was a surprising finding as the XCOPA dataset was professionally translated and had near perfect inter-annotator agreement with efforts to ensure the quality of the dataset [87]. We investigated all other test datasets in XCOPA and noted that, in total, 5 out of 11 of the datasets had the same issue to differing degrees (see Table 11). Even more concerning was the fact that there has been no study that we know of that reported these errors, making it unclear whether studies using XCOPA are reporting metrics on the original erroneous dataset or if they also cleaned the dataset before testing. This highlights the importance of scrutinizing datasets used for benchmarking, even if they have already been used widely. Data sampling Due to limited resources, we sample all datasets to a maximum of 1000 instances (doc- uments or sentences) for all tasks except for Machine Translation, for which we use the full test dataset of 1012 sentences, as well as for Question Answering and Abstractive Summarization, for which we use a sample of 100 documents given their length (see Table 12). Sampling was done randomly but checks were done to ensure that the distributions of labels and document lengths were preserved in the samples. Furthermore, the document lengths for IndicQA and XLSum (Thai and Tamil datasets) had to be clipped to 3500 and 3300 tokens6 respectively due to the context length of gpt-3.5-turbo-0301 being capped at 4096. 5For example, the verb answer in the phrase “please
Chunk 29 · 1,995 chars
labels and document lengths were preserved in the samples. Furthermore, the document lengths for IndicQA and XLSum (Thai and Tamil datasets) had to be clipped to 3500 and 3300 tokens6 respectively due to the context length of gpt-3.5-turbo-0301 being capped at 4096. 5For example, the verb answer in the phrase “please answer with a single letter A or B” can be translated as பதிலளி (patilal.i) or பதிலளியுங்கள் (patilal.iyunkal.) depending on whether we are treating the model with less or more formality respectively. Furthermore, பதிலளிக்கவும் (patilal.ikkavum) and பதிலளிக்குக (patilal.ikkuka) can also be used as an imperative that is more obligatory and is often used in instructions. 6Tokenization was done using OpenAI‘s cl100k_base tokenizer (which is used in GPT-4 and GPT-3.5-Turbo). 14 -- 14 of 86 -- BHASA Cause Effect Chinese 250 250 Estonian 250 250 Haitian Creole 250 250 Indonesian 246 254 Italian 246 254 Quechua 250 250 Swedish 213 287 Tamil 250 250 Thai 0 500 Turkish 199 301 Vietnamese 250 250 Table 11: Number of questions per language asking for cause and effect in XCOPA datasets. Every language should have 250 questions each for cause and effect. Competency Task Dataset Language Set Total Sample Trimming NLU NER NERGrit id Test 209 209 - VLSP2018 vi Test 8454 1000 - ThaiNERv2 th Test 1472 1000 - Naamapadam ta Test 758 758 - QA TyDiQA id Dev 565 100 - XQuAD vi, th Test 1190 100 - IndicQA ta Test 1804 100 3500 Sentiment Analysis NusaX id Test 400 400 - UIT-VSFC vi Test 3166 1000 - Wisesight th Test 2671 1000 - IndicSentiment ta Test 999 999 - Toxicity Detection MLHSD id Test 13169 1000 - ViHSD vi Test 6680 1000 - Thai Toxicity Tweet th Test 2160 1000 - NLG Machine Translation FLORES id, vi, th, ta Devtest 1012 1012 - Abstractive Summarization XLSum id Test 4780 100 - vi Test 4013 100 - th Test 826 100 3300 ta Test 2027 100 3300 NLR NLI IndoNLI id Test 2201 1000 - XNLI vi, th Test 5010 1000 - IndicXNLI ta Test 5010 1000
Chunk 30 · 1,998 chars
i Toxicity Tweet th Test 2160 1000 - NLG Machine Translation FLORES id, vi, th, ta Devtest 1012 1012 - Abstractive Summarization XLSum id Test 4780 100 - vi Test 4013 100 - th Test 826 100 3300 ta Test 2027 100 3300 NLR NLI IndoNLI id Test 2201 1000 - XNLI vi, th Test 5010 1000 - IndicXNLI ta Test 5010 1000 - Causal Reasoning XCOPA id, vi, th, ta Test 500 500 - Table 12: Test dataset details for BHASA’s NLP benchmark. Trimming refers to the number of tokens that input documents were truncated to (using OpenAI’s cl100k_base tokenizer). 3.5 Evaluation 3.5.1 Metrics In terms of metrics, we believe that the method adopted in HELM [8] to measure 7 categories of metrics (Accuracy, Calibration, Robustness, Fairness, Bias, Toxicity and Efficiency) would be ideal. However, given constraints in resources, we opted to focus solely on accuracy metrics for this study and leave the rest for future works. Accuracy metrics for each task were selected based on convention, with F1 score or accuracy 15 -- 15 of 86 -- BHASA being used for most tasks. However, as automated metrics for NLG may not correlate well with human judgments, be uninterpretable, have inherent biases and be unable to capture certain task-specific nuances [99], we include human evaluation on a small sample size to provide additional qualitative insights on the models’ NLG performance. This is especially important for Abstractive Summarization which conventionally uses the ROUGE metric [100] because the metric was recently reported to be challenging to compare across studies due to code errors in evaluation packages and differences in evaluation protocol which are not clearly explained [101]. We still include the metric in line with convention but we make our evaluation parameters transparent. We chose to use XLSum’s repository’s multilingual ROUGE scoring package7 as it is used to evaluate models tested on XLSum. We used all defaults and also specified the use of PyThaiNLP‘s newmm tokenizer for the
Chunk 31 · 1,997 chars
[101]. We still include the metric in line with convention but we make our evaluation parameters transparent. We chose to use XLSum’s repository’s multilingual ROUGE scoring package7 as it is used to evaluate models tested on XLSum. We used all defaults and also specified the use of PyThaiNLP‘s newmm tokenizer for the evaluation of Thai summarization. Stemming and stopword removal was not performed. The β for the Fβ score was set at 1 as per the default. As for the bootstrap aggregation, we set 4156 as the random numpy seed for reproducibility and used the ‘mid’ result (i.e. mean) as the final score. For the task of Machine Translation, we follow the recommendation of the WMT22 Metrics Shared Task to employ neural network-based metrics as they are more aligned with human evaluation and robust to different domains [102]. We adopted COMET as our neural network-based metric and ChrF++ [103] as our overlap metric as they were both used as the official metrics for the WMT Shared Task 2022 [104]. However, instead of using COMET-20 [105] which was used in the WMT Shared Task 2022, we opted for the improved COMET-22 metric [32] as it is one of the top performing metrics in the WMT22 Metrics Shared Task [102], surpassing the performance of COMET-20. 3.5.2 Human evaluation As mentioned in Section 3.5.1, we perform human evaluation as a sanity check for the chosen automated metrics as well as to provide qualitative insights. For each target language, 3 lay annotators who are native speakers of the respective target languages and have a high proficiency in English were recruited. Data was sampled randomly to obtain 20 sentences for each translation direction as well as 10 articles for summarization. Annotators were presented with 5 translations for each sentence and 5 summaries for each article (4 responses from GPT-3.5-Turbo and GPT-4 using both English and translated prompts, and one reference) and were asked to evaluate them according to our guidelines. The guidelines for
Chunk 32 · 1,995 chars
well as 10 articles for summarization. Annotators were presented with 5 translations for each sentence and 5 summaries for each article (4 responses from GPT-3.5-Turbo and GPT-4 using both English and translated prompts, and one reference) and were asked to evaluate them according to our guidelines. The guidelines for Abstractive Summarization were adapted from various works [106, 37, 35] and require the annotators to score summaries based on relevance, faithfulness of content, fluency, and coherence (see Appendix D for more details). Relevance, fluency and coherence are scored on a Likert scale of 1 to 5, while faithfulness is scored binarily as 0 (not faithful) or 1 (faithful). On the other hand, the guidelines for the Machine Translation task were modified from the Multidimensional Quality Metrics (MQM) framework [107, 108]. Annotators are asked to look out for errors in accuracy, terminology, linguistic conventions, style and locale conventions (see Appendix D for more details), and to mark translations as incorrect if any of those errors exists. Annotators have to note down the type of errors present and are also encouraged to add in comments to explain the error. While the use of MQM in [108] involves highlighting the relevant span of text and tagging it with the error type and severity, we decided to forego these steps and simplify the task to reduce the task complexity for the annotators, especially since they are not professional translators. 3.5.3 LLMs as NLG evaluators? Despite the rise in popularity of LLM-based evaluations [109, 110, 111, 112, 113, 114, 115, 116, 117] as a way to scale up evaluations at a lower cost, we decided not to adopt this approach for a number of reasons. Firstly, given the fact that the reliability or effectiveness of LLMs as evaluators has not been demonstrated for SEA languages, we might not be able to trust the results of the evaluation. Current studies mostly focus on evaluating English outputs, and the effectiveness of
Chunk 33 · 1,998 chars
adopt this approach for a number of reasons. Firstly, given the fact that the reliability or effectiveness of LLMs as evaluators has not been demonstrated for SEA languages, we might not be able to trust the results of the evaluation. Current studies mostly focus on evaluating English outputs, and the effectiveness of LLMs in evaluating machine translation has only been explored for Chinese, German and Russian [113]. Secondly, it has been shown that LLMs can have bias towards both themselves and other models [109, 117], making their judgments less reliable. Furthermore, although GPT-4 has been found to be the best judge in multiple studies [113, 114, 117], it would be undesirable for us to use it since we are also evaluating GPT-4 itself. After all, quis custodiet ipsos custodes?8 7https://github.com/csebuetnlp/xl-sum/tree/master/multilingual_rouge_scoring 8“Quis custodiet ipsos custodes?” originates from the Roman satirical work, the Satires, and means “Who will watch the watchmen?” 16 -- 16 of 86 -- BHASA Finally, LLMs also exhibit various other biases such as position bias (where the order of presenting options significantly alters judgment) [116] and verbosity bias (where the output length can alter judgment regard- less of text quality) [117]. There is also uncertainty in how prompts should be written to ensure reliable judgments. Because of this, we eschew the use of LLMs to evaluate NLG and rely only on automated metrics and human evaluation for this paper. 3.6 Results and analysis In general, we find that the models’ performances vary across languages and decrease in the order of Indone- sian, Vietnamese, Thai and Tamil. Both models are also less able to follow instructions when prompted in Thai or Tamil. Furthermore, in general, we find that GPT-4 does indeed outperform GPT-3.5-Turbo significantly on our benchmarking tasks and target languages. This gap is even more distinct in the trans- lated prompt setting where GPT-3.5-Turbo has a higher rate of not
Chunk 34 · 1,994 chars
less able to follow instructions when prompted in Thai or Tamil. Furthermore, in general, we find that GPT-4 does indeed outperform GPT-3.5-Turbo significantly on our benchmarking tasks and target languages. This gap is even more distinct in the trans- lated prompt setting where GPT-3.5-Turbo has a higher rate of not being able to follow the instructions or answering wrongly. In the following sections we look at the performance of the models on each task separately. 3.6.1 Named entity recognition (NER) id vi th ta Fine-tuned 79.09 (XLM-R) 77.52 (VNER) 84.80 (WangchanBERTa) 64.5 (MuRIL) GPT-3.5-Turbo 31.76 16.85 15.87 21.65 GPT-3.5-Turbo (T) 33.88 23.64 10.71 9.55 GPT-4 62.05 60.96 38.92 51.84 GPT-4 (T) 60.47 50.00 29.03 36.78 Table 13: Strict NER F1 scores of GPT-3.5-Turbo and GPT-4. The performances of fine-tuned models were obtained from [61] for Indonesian, [62] for Vietnamese, [64] for Thai, and [70] for Tamil. Both models generally did not do well for the NER task and this was generally due to the output not being in the required format. Although the prompt requires the model to label entities within the text by delimiting it with the @@ and ## symbols, these symbols were sometimes left out or used partially. The models may also remove the whitespaces between tokens, merging them and causing the number of tokens of the prediction to differ from that of the ground truth. This is understandable as the task of NER is that of sequence labeling which is not very compatible with the text generation task that LLMs are trained on [66]. Another factor contributing to the poor performance is likely related to the difficulty in specifying the definition of each class in the prompt. The model has to rely on transfer learning from whatever data it has previously been trained on to deduce what the classes in the prompt refer to, but the model is often mistaken. For example, pronouns are sometimes labeled as PERSON. Nevertheless, GPT-4 performs significantly better than
Chunk 35 · 1,998 chars
of each class in the prompt. The model has to rely on transfer learning from whatever data it has previously been trained on to deduce what the classes in the prompt refer to, but the model is often mistaken. For example, pronouns are sometimes labeled as PERSON. Nevertheless, GPT-4 performs significantly better than GPT-3.5-Turbo in all 4 languages. Translated prompts result in large drops in performance across the board except for GPT-3.5-Turbo when prompted in Indonesian or Vietnamese, in which case performance actually improves unexpectedly. The drop in performance is likely due to the translated labels being unfamiliar to the model as NER datasets do not usually use labels written in the target language. Interestingly, the drop in performance of GPT-4 with Indonesian prompts is much smaller compared to prompts in other languages, suggesting that its abilities in understanding Indonesian is much better. Overall, GPT-3.5-Turbo and GPT-4 are probably not suitable for use in a zero-shot setting for NER in these 4 languages. In general, researchers and practitioners wanting to use LLMs for NER may consider adding examples and a self-verification step [66] for better performances. 17 -- 17 of 86 -- BHASA id vi th ta Human 92.00 - - - Fine-tuned 84.80 (mBERT) 79.10 (XLM-R Large) 74.20 (XLM-R Large) 45.40 (IndicBERTv2) GPT-3.5-Turbo 48.45 39.47 45.34 (59.60) 28.88 GPT-3.5-Turbo (T) 45.58 44.53 39.00 (49.88) 13.28 GPT-4 76.89 73.85 75.30 (83.69) 34.71 GPT-4 (T) 73.60 72.02 64.41 (75.19) 34.38 Table 14: F1 score of GPT-3.5-Turbo and GPT-4 for Question Answering. Fine-tuned model and human performances were obtained from [68] for Indonesian, [69] for Vietnamese and Thai, and [70] for Tamil. Results on the Thai dataset in parentheses refer to results obtained after appropriate tokenization. 3.6.2 Question answering GPT-4 performs significantly better than GPT-3.5-Turbo for Question Answering in terms of F1, and this is largely due to its ability to extract strictly the
Chunk 36 · 1,995 chars
namese and Thai, and [70] for Tamil. Results on the Thai dataset in parentheses refer to results obtained after appropriate tokenization. 3.6.2 Question answering GPT-4 performs significantly better than GPT-3.5-Turbo for Question Answering in terms of F1, and this is largely due to its ability to extract strictly the most relevant span for the answer, which also results in high Exact Match scores. On the other hand, GPT-3.5-Turbo tends to extract an entire sentence and does not answer to the point and therefore loses out on precision. Nevertheless, post-analyses revealed that 80% to 90% of these sentences extracted by GPT-3.5-Turbo do indeed contain the relevant answer span, so the model does understand the document and the question, but simply fails to answer the question directly by excluding the irrelevant portions. Due to the fact that the SQuAD evaluation script tokenizes model predictions using whitespace, we had to retokenize model responses for Thai for a fairer evaluation. Since Thai is a scriptio continua language that does not separate words with spaces, not tokenizing the model responses would mean calculating F1 on a sentence level and over-penalizing the model. The F1 scores are much higher after retokenization using PyThaiNLP’s newmm tokenizer and we report these results in parentheses beside the initial results. Surprisingly, GPT-4 in a zero-shot setting can outperform a fine-tuned XLM-R Large model on the Thai XQuAD dataset. In the other languages however, GPT-4 still lags slightly behind smaller fine-tuned models. This gap in performance is especially conspicuous for the Tamil IndicQA dataset where the results seem quite dismal and we sought to investigate the reasons behind it. IndicQA is intrinsically harder than the other datasets due to it containing unanswerable questions, and this was likely partially the reason for the low F1 scores. Out of 28 unanswerable questions, GPT-3.5-Turbo managed to identify them as such for 18 and 0 questions
Chunk 37 · 1,991 chars
nd we sought to investigate the reasons behind it. IndicQA is intrinsically harder than the other datasets due to it containing unanswerable questions, and this was likely partially the reason for the low F1 scores. Out of 28 unanswerable questions, GPT-3.5-Turbo managed to identify them as such for 18 and 0 questions with English and translated prompts respectively, while GPT-4 managed to do so for 20 and 13 questions with English and translated prompts respectively. Looking into individual question responses however, we find that the model actually performed better than what was suggested by the metrics. In some cases, the noun in the model’s answer was inflected differently from that in the ground truth and was not considered a match. For example, one of the questions asked “who built the Ram Jhula bridge”, and the ground truth read “by the Indian government” (using the instrumental case to mark the noun as an agent in a passive voice construction) while GPT-4 answered “the Indian government” (see Example 1 and 2 for the linguistic glossing). GPT-4’s answer is arguably more accurate since the question was posed in the active voice and not passive voice. In fact, it is rather impressive that it was able to change the inflection of the noun into the nominative case to better answer the question. However, as the task is supposed to be extractive instead of abstractive, we accept the ground truth as the correct answer. Nevertheless, we might want to explore lemmatization as a fairer way to evaluate models on such languages which are morphologically rich. (1) இந்திய Intiya India அரசாங்கத்தால் arac¯a ˙nkatt-¯al government-inst ‘By the Indian government’ 18 -- 18 of 86 -- BHASA (2) இந்திய Intiya India அரசாங்கம் arac¯a ˙nkam government ‘The Indian government’ We also found that some questions were not phrased correctly or their answers were inaccurate. For example, one question asked for the year that the Indian National Committee for Space Research (INCOSPAR)
Chunk 38 · 1,996 chars
nment’ 18 -- 18 of 86 -- BHASA (2) இந்திய Intiya India அரசாங்கம் arac¯a ˙nkam government ‘The Indian government’ We also found that some questions were not phrased correctly or their answers were inaccurate. For example, one question asked for the year that the Indian National Committee for Space Research (INCOSPAR) was founded and the ground truth was “1945”. However, this fact was neither in the text given to the model nor even factually correct (INCOSPAR was founded in 1962). Another question asked for the name of one of the lakes dug during the Pallava dynasty. In the corresponding text, a long list of lakes was provided and the ground truth chose the first three lakes in the list as the answer. The question is not well-framed because any of the lakes could be the answer, and the ground truth is not even answering the question correctly since it gave three names instead of just one. GPT-4 actually did answer correctly by responding with the first lake in the list. These errors in the dataset may explain why the performance of fine-tuned IndicBERTv2 [70] was not able to attain even 50% in F1 score. 3.6.3 Sentiment analysis id vi th Fine-tuned 91.60 (XLM-R Large) 87.94 (MaxEnt) 67.05 (WangchanBERTa) GPT-3.5-Turbo 82.60 63.80 44.09 GPT-3.5-Turbo (T) 79.53 58.03 40.71 GPT-4 84.28 60.02 40.07 GPT-4 (T) 74.14 65.19 39.15 Table 15: Macro-F1 scores of GPT-3.5-Turbo and GPT-4 on the Sentiment Analysis task for Indonesian, Vietnamese and Thai. Fine-tuned model performances are obtained from [71] for Indonesian, [72] for Viet- namese and [64] for Thai. ta Fine-tuned 93.60 GPT-3.5-Turbo 77.18 GPT-3.5-Turbo (T) 23.62 GPT-4 83.98 GPT-4 (T) 71.97 Table 16: Accuracy of GPT-3.5-Turbo and GPT-4 on the IndicSentiment dataset. The fine-tuned model reported here is IndicBERTv2 [70]. The zero-shot performance of GPT-4 and GPT-3.5-Turbo varies a lot from language to language. Both models did well for the Indonesian NusaX dataset, with GPT-4 performing slightly better with a lead
Chunk 39 · 1,994 chars
ccuracy of GPT-3.5-Turbo and GPT-4 on the IndicSentiment dataset. The fine-tuned model reported here is IndicBERTv2 [70]. The zero-shot performance of GPT-4 and GPT-3.5-Turbo varies a lot from language to language. Both models did well for the Indonesian NusaX dataset, with GPT-4 performing slightly better with a lead of 1.68% F1 score when prompted in English. It was surprising however to note that GPT-4 performed worse than GPT-3.5-Turbo when prompted in Indonesian, given that GPT-4 seems to understand Indonesian better in other tasks. It was found that a large contributor to this difference in performance came from misclassifying neutral sentences. 20 out of 96 neutral sentences (20.83%) were classified as positive by GPT- 4, and they are generally advice or statements describing a situation in a neutral way. In order to do error analysis, we asked GPT-4 to explain its decisions. For sentences which are advice, GPT-4 explained that the advice is good for the recipient and therefore has a positive sentiment, whereas for statements, it tends to interpret the situation and give its own perspective on the matter (see Table 17). While this does 19 -- 19 of 86 -- BHASA demonstrate GPT-4’s stronger Indonesian capabilities, it can be undesirable for the model to inject such judgments depending on the context. To reduce the chances of this happening, we suggest that researchers and practitioners can consider using more informative prompts to define the labels more clearly. Sentence Explanation by GPT-4 (Advice) Jangan lupa minum air mineral 8 – 10 gelas per harinya. Kalimat tersebut memiliki sentimen positif karena memberikan saran atau nasihat yang baik untuk menjaga kesehatan, yaitu dengan mengingatkan untuk minum air mineral 8-10 gelas per hari. Hal ini menunjukkan perhatian terhadap kesehatan dan kesejahteraan orang lain. Do not forget to drink 8 – 10 glasses of mineral water per day. The sentence has a positive sentiment because it gives a good suggestion or
Chunk 40 · 1,999 chars
baik untuk menjaga kesehatan, yaitu dengan mengingatkan untuk minum air mineral 8-10 gelas per hari. Hal ini menunjukkan perhatian terhadap kesehatan dan kesejahteraan orang lain. Do not forget to drink 8 – 10 glasses of mineral water per day. The sentence has a positive sentiment because it gives a good suggestion or advice to take care of one‘s health, that is to say by reminding one to drink 8-10 glasses of mineral water a day. This shows concern for the health and well-being of others. (Statement) Kantor penghubung korsel-korut dibuka pekan ini. Kalimat tersebut menunjukkan adanya pembukaan kantor penghubung antara Korea Selatan dan Korea Utara, yang merupakan langkah maju dalam hubungan antar negara tersebut. Hal ini dapat dianggap sebagai tanda peningkatan kerja sama dan komunikasi antara kedua negara, sehingga sentimennya positif. The North Korea-South Korea Liaison Office opened this week. The sentence indicates the opening of a liaison office between North Korea and South Korea, which is a step forward in the relations between said countries. This can be considered a sign of increasing cooperation and communication between the two countries, so the sentiment is positive. Table 17: Explanations given by GPT-4 on why it labeled neutral sentences as having a positive sentiment when prompted in Indonesian. As for the Vietnamese dataset, it seems that GPT-4 does worse than GPT-3.5-Turbo when prompted in English but does better when prompted in Vietnamese. It was found through error analysis that GPT-4 performed worse when prompted in English because it classified 55 out of 445 (12.36%) negative sentences as neutral. Many of these sentences contain modal verbs such as nên (‘should’) and cần (‘need’) or are imperative in nature. When prompted in Vietnamese, the understanding of these modals and imperatives as presupposing the utterer’s dissatisfaction with the current state of affairs helps the model make the right decision. However, when prompted in English, the
Chunk 41 · 1,998 chars
modal verbs such as nên (‘should’) and cần (‘need’) or are imperative in nature. When prompted in Vietnamese, the understanding of these modals and imperatives as presupposing the utterer’s dissatisfaction with the current state of affairs helps the model make the right decision. However, when prompted in English, the model prefers to view such suggestions as neutral (see Table 18). Performance in Thai is poor across all runs and this was mainly due to the misclassification of neutral sentences. This is likely an issue of label definition as many of these neutral sentences are actually adver- tisements which sound positive but are classified as neutral in the ground truth as they are not product reviews. Another factor contributing to the poor performance is the fact that there are 4 classes for the Thai Wisesight Sentiment dataset, including the class ‘Question’ on top of the usual ‘Positive’, ‘Negative’ and ‘Neutral’. GPT-4 in particular tends to label statements as questions, with 212 and 169 false positives when prompted in English and Thai respectively. It seems that the model labels rhetorical questions and echo questions as questions even if they are not meant to be interpreted literally as such. However, it has to be ac- knowledged that specifying the definition of the question label could help steer the model‘s responses towards the intended interpretation, and perhaps excluding the question label could result in a better comparison across languages and datasets. We leave these explorations for future iterations of BHASA. Finally, the models performed reasonably well for the Tamil dataset, although that is perhaps to be expected given that only two classes, ‘Positive’ and ‘Negative’, were involved. Despite the explicit instruction to answer with only one of the two labels, both models answered “Neutral” in more than 50 instances. This could be a sign that the models had been strongly instruction-tuned in English to interpret the sentiment analysis task as
Chunk 42 · 1,994 chars
ly two classes, ‘Positive’ and ‘Negative’, were involved. Despite the explicit instruction to answer with only one of the two labels, both models answered “Neutral” in more than 50 instances. This could be a sign that the models had been strongly instruction-tuned in English to interpret the sentiment analysis task as one that has three classes. The huge jump in performance from GPT-3.5-Turbo to GPT-4 when prompted in Tamil can be attributed to GPT-4’s ability to understand the Tamil instructions better and to answer with either label, whereas GPT-3.5-Turbo fails to do so and generates long responses. 3.6.4 Toxicity detection Toxicity Detection is a more challenging task for the models as the data comes from social media and is in an informal register, thus the vocabulary can be quite different from the standard language. Furthermore, a good understanding of the cultures associated with the language is required as well to interpret potentially 20 -- 20 of 86 -- BHASA Sentence Explanation by GPT-4 Bài tập thầy nên sát với đề thi hơn. The teacher‘s exercises should be closer to the examinations. (English prompt) The sentence “bài tập thầy nên sát với đề thi hơn” translates to “the teacher‘s exercises should be closer to the exam questions.” This statement is a suggestion or recommendation, and it does not express a strong positive or negative emotion. Therefore, the sentiment can be considered neutral. (Vietnamese prompt) Câu “bài tập thầy nên sát với đề thi hơn” mang sắc thái tiêu cực vì nó ngụ ý rằng bài tập hiện tại do thầy đưa ra chưa đạt yêu cầu hoặc chưa phù hợp với đề thi. Điều này cho thấy sự không hài lòng của người nói và muốn thầy cải thiện trong việc lựa chọn bài tập. The sentence “the teacher’s exercises should be closer to the examinations” has a negative sentiment because it implies that the current exercises given by the teacher are not satisfactory or unsuitable for the examinations. This shows the speaker’s dissatisfaction and desire for the
Chunk 43 · 1,987 chars
thiện trong việc lựa chọn bài tập. The sentence “the teacher’s exercises should be closer to the examinations” has a negative sentiment because it implies that the current exercises given by the teacher are not satisfactory or unsuitable for the examinations. This shows the speaker’s dissatisfaction and desire for the teacher to improve their choice of exercises. Table 18: Explanations given by GPT-4 on why it labeled negative sentences as having a neutral sentiment. Note that it labels the sentence as negative when prompted in Vietnamese. id vi th Fine-tuned - 62.69 (mBERT) 78.00 (Logistic Regression) GPT-3.5-Turbo 53.99 41.61 64.36 GPT-3.5-Turbo (T) 49.25 40.48 60.92 GPT-4 57.53 51.52 61.51 GPT-4 (T) 63.96 51.23 70.67 Table 19: Macro-F1 scores of GPT-3.5-Turbo and GPT-4 on Toxicity Detection. Fine-tuned model perfor- mances were obtained from [76] for Vietnamese and [77] for Thai. harmful messages accurately. There is generally an improvement in the toxicity detection abilities of the model from GPT-3.5-Turbo to GPT-4 except in the case of Thai. Through error analysis, it was found that this unexpected drop in F1 score for GPT-4 was due to an increase in false positives. As the Thai dataset is designed to be challenging by including polysemous words that may be innocuous or abusive depending on context, it is perhaps the case that GPT-4 errs on the side of caution by flagging out sentences that contain such words even if they may be harmless in those contexts. An interesting observation is that there is significant improvement in the performance of GPT-4 when prompted in translated prompts for the Indonesian and Thai datasets. It is perhaps the case that having the entire prompt in the native language could help guide the model towards a more culturally appropriate interpretation of the texts instead of using an Anglophone perspective that might be associated with using English. Overall, however, there seems to be much room for improvement in terms of
Chunk 44 · 1,999 chars
ps the case that having the entire prompt in the native language could help guide the model towards a more culturally appropriate interpretation of the texts instead of using an Anglophone perspective that might be associated with using English. Overall, however, there seems to be much room for improvement in terms of understanding hate or toxic speech in Indonesian, Vietnamese and Thai. 21 -- 21 of 86 -- BHASA 3.6.5 Machine translation id→en vi→en ChrF++ COMET-22 Human ChrF++ COMET-22 Human Reference - - 3.0/20 - - 5.7/20 GPT-3.5-Turbo 66.05 89.52 13.3/20 58.85 87.25 11.7/20 GPT-3.5-Turbo (T) 66.27 89.70 14.3/20 58.94 87.37 9.7/20 GPT-4 67.15 89.85 13.0/20 60.29 87.81 12.7/20 GPT-4 (T) 67.53 89.95 11.7/20 60.39 87.88 13.3/20 th→en ta→en ChrF++ COMET-22 Human ChrF++ COMET-22 Human Reference - - 13.3/20 - - - GPT-3.5-Turbo 52.03 86.09 13.3/20 41.45 78.03 - GPT-3.5-Turbo (T) 52.30 86.26 11.7/20 40.87 77.52 - GPT-4 66.38 89.09 15.3/20 65.48 87.07 - GPT-4 (T) 51.46 83.77 13.3/20 47.39 79.84 - en→id en→vi ChrF++ COMET-22 Human ChrF++ COMET-22 Human Reference - - 6.0/20 - - 7.7/20 GPT-3.5-Turbo 69.00 92.03 10.3/20 57.83 88.94 6.3/20 GPT-3.5-Turbo (T) 69.05 92.04 10.3/20 57.92 89.01 6.3/20 GPT-4 69.47 92.27 10.0/20 59.76 89.44 8.3/20 GPT-4 (T) 69.38 92.21 11.0/20 59.73 89.50 8.0/20 en→th en→ta ChrF++ COMET-22 Human ChrF++ COMET-22 Human Reference - - 14.3/20 - - - GPT-3.5-Turbo 41.19 85.20 7.3/20 34.95 70.46 - GPT-3.5-Turbo (T) 41.36 85.45 6.3/20 34.27 69.62 - GPT-4 43.28 86.66 6.7/20 42.05 82.16 - GPT-4 (T) 43.21 86.78 8.3/20 42.03 82.37 - Table 20: Model performances of GPT-3.5-Turbo and GPT-4 on the Machine Translation task. Human evaluation scores are taken as an average across 3 annotators’ ratings. 22 -- 22 of 86 -- BHASA When looking at the ChrF++ and COMET-22 metrics, GPT-4 generally translates better than GPT-3.5- Turbo, although the difference is minimal for Indonesian and Vietnamese, and only becomes more obvious when translating Thai and Tamil. Translation
Chunk 45 · 1,993 chars
aken as an average across 3 annotators’ ratings. 22 -- 22 of 86 -- BHASA When looking at the ChrF++ and COMET-22 metrics, GPT-4 generally translates better than GPT-3.5- Turbo, although the difference is minimal for Indonesian and Vietnamese, and only becomes more obvious when translating Thai and Tamil. Translation into English is also better than translation from English in general across all 4 languages. This is in agreement with existing literature [11] and may suggest that the models’ understanding of our target languages might be better than their ability to generate text in them. We hypothesize that this might be due to a combination of the magnitude of training data in English compared to our target languages as well as the vocabulary of the models’ tokenizer being more optimized for English. Using translated prompts did not have a significant impact on translation quality in general, although for the Thai→English and Tamil→English directions, GPT-4 translated the instructions in the prompt instead of the source sentence on multiple occasions, showing again that its instruction-following abilities in Thai and Tamil still leave much to be desired. As explained in Section 3.5.2, we also perform human evaluation on top of automated metrics to get some qualitative understanding of the models’ performances and also as a form of a sanity check for the automated metrics. We took a sample of 20 translations in each direction for each language to be evaluated by humans according to the procedure stated in Section 3.5.2.9 Although the results differed greatly between individ- ual human annotators, we took the average of their annotations to be a reasonable gauge of the models’ translation quality. Surprisingly, the human annotators did not consistently rate GPT-4 higher than GPT- 3.5-Turbo, suggesting that the translations of each model had different strengths and weaknesses. However, there are some patterns which we now discuss. Firstly, in general, the most
Chunk 46 · 1,997 chars
reasonable gauge of the models’ translation quality. Surprisingly, the human annotators did not consistently rate GPT-4 higher than GPT- 3.5-Turbo, suggesting that the translations of each model had different strengths and weaknesses. However, there are some patterns which we now discuss. Firstly, in general, the most common errors seen are mistranslations, unnatural phrasings, use of inappro- priate terminologies, additions and omissions. There are also many instances where translations make no sense or use words that do not exist, especially with languages that do not use the Latin script. For example, for English → Thai translation, the word flagship was translated as แก่งแย่ง (‘to struggle’) and farewell was translated as ลาทิ้ง (non-existent word). These problems are most dire for the English↔Tamil directions with annotators labeling almost all translations as wrong (see Table 21 for some examples). Secondly, named entities are sometimes translated wrongly especially when the source language uses a non- Latin script. This error occurs most often with GPT-3.5-Turbo. For example, the Thai word นครวาติกัน (‘Vatican City’) was translated as “Nakhon Pathom”, a city in Thailand, while the countries คอโมโรสและมา ยอต (‘Comoros and Mayotte’) were translated as “Morocco and Myanmar”. We see the same problem in Tamil with ைநராேகாங்ேகா (‘Nyiragongo’) being translated as “Nilgiris”, a mountain range that spans the borders of the state of Tamil Nadu. Thirdly, it was found that the reference translations were rated even lower than the model translations in many translation directions. Due to the fact that the FLORES dataset was created by translating from English, it is only natural that the non-English sentences may not always be equivalent to the English ones due to certain decisions that the translators made. As such, we can expect that when annotators evaluate the reference translations into English, they may find them unsatisfactory. However, we found many errors that
Chunk 47 · 1,995 chars
ish, it is only natural that the non-English sentences may not always be equivalent to the English ones due to certain decisions that the translators made. As such, we can expect that when annotators evaluate the reference translations into English, they may find them unsatisfactory. However, we found many errors that did not fall into that category (see Table 22). This poses a question as to whether we can trust automated metrics that require references in the context of FLORES, given that the references themselves are inaccurate. 9We do not include the scores for human evaluation of English↔Tamil translation as annotations have not con- cluded for all annotators. The results will be published in a future update. However, some of the existing qualitative results have been included in the discussion. 23 -- 23 of 86 -- BHASA Sentence Translation Tamil Source ஒரு திறம் ெபற்ற தடகளவீ ரர், புலியினால் ஒரு குறBப்பிட்ட அளவிற்கு தாவவும்,நீ ந்தவும், மBகப்ெபரிய தூரங்கைள தாண்டவும் இயலும் மற்றும் மனிதைன விட 5 மடங்கு அதிகவீ ரியத்துடன் ஒரு ெபாருைள இழுக்க முடியும். Reference Translation A well rounded athlete, the tiger can climb (though not well), swim, leap great distances and pull with five times the force of a strong human. GPT-3.5-Turbo with English prompts A brave soldier who received a medal of honor can carry a load equivalent to that specified by a tiger, climb and cross great distances, and can carry a weight of 5 times more than a human with great courage. GPT-3.5-Turbo with Tamil prompts A brave soldier with strong legs can run, jump and cover great distances, even beyond what is expected of him, and can carry a load of up to 5 times his own weight with great courage. GPT-4 with English prompts A well rounded athlete, the tiger can climb (though not well), swim, leap great distances and pull with five times the force of a strong human. GPT-4 with Tamil prompts A warrior with gills, capable of breathing and diving to a specific depth underwater, can also traverse
Chunk 48 · 1,993 chars
with great courage. GPT-4 with English prompts A well rounded athlete, the tiger can climb (though not well), swim, leap great distances and pull with five times the force of a strong human. GPT-4 with Tamil prompts A warrior with gills, capable of breathing and diving to a specific depth underwater, can also traverse vast distances with ease and possesses five times the strength of a human to lift objects. Tamil Source உண்ணியில் இருக்கும் டிஎன்ஏ ைவ கதிரியக்க பாஸ்பரசால் அல்லது உண்ணியில் இருக்கும் புரதத்ைத கதிரியக்க சல்பரால் அைடயாளப் படுத்தி, அவர்கள் இரண்டு (2) ேசாதைனகள் ெசய்தார்கள். Reference Translation They did two experiments marking either the DNA in the phage with a radioactive phosphorus or the protein of the phage with radioactive sulfur. GPT-3.5-Turbo with English prompts Using a DNA sample from saliva or a hair follicle found in the food, they conducted two (2) tests to identify the suspect. GPT-3.5-Turbo with Tamil prompts The DNA is extracted either by using a pass-through with a bead or by identifying the purine base in the DNA present in the urine They conducted two (2) experiments to identify the purine base in the urine. GPT-4 with English prompts They did two experiments marking either the DNA in the phage with a radioactive phosphorus or the protein of the phage with radioactive sulfur. GPT-4 with Tamil prompts In the yeast containing DNA, they identified it by X-ray diffraction or by the presence of the substance in the yeast, and they conducted two (2) tests. Table 21: Vastly different translations of the same source sentence by GPT-3.5-Turbo and GPT-4 when prompted in English and Tamil. The words used still seem to come from the same domain, but are completely absent in the source text. Original Sentence Reference Translation Error A moderate earthquake shook western Montana at 10:08 p.m. on Monday. Gempa mengguncang Montana barat pada hari Senin pukul 20.08. The word “moderate” is missing. Local authorities are warning residents in the
Chunk 49 · 1,998 chars
he same domain, but are completely absent in the source text. Original Sentence Reference Translation Error A moderate earthquake shook western Montana at 10:08 p.m. on Monday. Gempa mengguncang Montana barat pada hari Senin pukul 20.08. The word “moderate” is missing. Local authorities are warning residents in the vicinity of the plant to stay indoors, turn off air-conditioners and not to drink tap water. Otoritas lokal memperingatkan warga sekitar akan tanaman-tanaman yang harus tetap berada di luar ruangan, mematikan pendingin ruangan, dan tidak meminum air keran. The word “plant” here refers to a nuclear plant and should be translated as “pabrik” instead of “tanaman-tanaman” which are living plants. The archipelago lies 120 km north of the Peninsula. The largest is King George Island with the settlement of Villa Las Estrellas. Quần đảo nằm cách Bán đảo 120 km phía Bắc Peninsula. Đảo lớn nhất là King George với sự hiện diện của làng Villa Las Estrellas. The word “Bán đảo” already means peninsula so there is no need to state it again in English. Schneider testified via videolink from a USAF base in his homeland. ஷ்ைனடர் தன் தாய்நாட்டின் யுஏஎெஸஃப் தளத்தின்வீ டிேயா லிங்க் மூலம் ேசாதிக்கப்பட்டார் . USAF should either be translated fully or be transliterated accurately. In this case, the translator transliterated it but did it wrongly so the name cannot be understood. Table 22: Original translation errors in the FLORES-200 dataset 3.6.6 Abstractive summarization The abstractive summarization scores are quite low overall and are lagging behind the fine-tuned mT5 baselines from the XLSum paper [80]. However, we found through manual inspection that many of the reference summaries contained additional information not included in the article. This would mean that the ROUGE-L scores could be an underestimate of the models’ performance. We therefore asked human annotators to evaluate the summaries as well (see Table 24) to see if their judgments are similar and for a 24
Chunk 50 · 1,990 chars
of the reference summaries contained additional information not included in the article. This would mean that the ROUGE-L scores could be an underestimate of the models’ performance. We therefore asked human annotators to evaluate the summaries as well (see Table 24) to see if their judgments are similar and for a 24 -- 24 of 86 -- BHASA id vi th ta Fine-tuned 30.76 26.08 28.88 22.07 GPT-3.5-Turbo 17.87 18.93 23.03 11.02 GPT-3.5-Turbo (T) 18.60 16.22 21.50 10.60 GPT-4 20.10 19.14 25.35 12.73 GPT-4 (T) 18.71 19.22 25.26 11.89 Table 23: ROUGE-L scores of GPT-3.5-Turbo and GPT-4 on the Abstractive Summarization task. The fine-tuned model performances were based on the mT5 performances from the XLSum paper [80]. more fine-grained aspect-based understanding of the ‘models’ performance (in terms of relevance, faithfulness, fluency and coherence).10 As expected, our human annotators rated the reference summaries as irrelevant, incoherent and unfaithful to the original text, especially for Indonesian and Vietnamese. On the other hand, GPT-4 scores well in general across all evaluation aspects, with the translated prompts sometimes leading to better summaries. It seems that GPT-4 is capable of producing faithful and fluent summaries without hallucinations. id vi R Fa Fl Co R Fa Fl Co Reference 1.73 0.30 3.13 2.67 2.64 0.20 4.84 2.57 GPT-3.5-Turbo 2.93 0.73 3.37 3.23 3.60 0.70 4.90 3.60 GPT-3.5-Turbo (T) 3.20 0.73 3.70 3.53 3.85 0.60 4.90 3.75 GPT-4 2.83 0.63 3.63 3.50 3.80 0.90 4.95 3.75 GPT-4 (T) 4.03 0.83 4.00 3.87 3.75 0.90 4.85 3.80 th ta Reference 3.07 0.70 4.33 3.90 - - - - GPT-3.5-Turbo 4.03 0.93 4.43 4.37 - - - - GPT-3.5-Turbo (T) 3.93 0.87 4.17 4.20 - - - - GPT-4 4.00 0.97 4.67 4.53 - - - - GPT-4 (T) 4.07 0.83 4.57 4.53 - - - - Table 24: Results of human evaluation of abstractive summarization by GPT-3.5-Turbo and GPT-4. R stands for Relevance, Fa stands for Faithfulness, Fl stands for Fluency and Co stands for Coherence. 3.6.7 Natural language inference
Chunk 51 · 1,995 chars
.17 4.20 - - - - GPT-4 4.00 0.97 4.67 4.53 - - - - GPT-4 (T) 4.07 0.83 4.57 4.53 - - - - Table 24: Results of human evaluation of abstractive summarization by GPT-3.5-Turbo and GPT-4. R stands for Relevance, Fa stands for Faithfulness, Fl stands for Fluency and Co stands for Coherence. 3.6.7 Natural language inference (NLI) GPT-4 has improved significantly compared to GPT-3.5-Turbo on the task of NLI. In fact, it has even outperformed humans on the Indonesian dataset. However, for Vietnamese, Thai and Tamil, performance is still poor. Oddly enough, GPT-4 chooses Neutral in the majority of cases, resulting in the aforementioned lackluster performance. This shows that it is possibly unsure of its answer in those 3 languages and has to resort to the safest response. 10We do not include the scores for human evaluation of Tamil summarization as annotations have not concluded for all annotators. The results will be published in a future update. 25 -- 25 of 86 -- BHASA id vi th ta Human 85.1 - - - Fine-tuned 82.3 (XLM-R) 79.30 (XLM-R Large) 77.40 (XLM-R Large) 74.50 (IndicBERTv2) GPT-3.5-Turbo 67.60 45.84 42.44 38.14 GPT-3.5-Turbo (T) 61.90 52.55 35.14 29.52 GPT-4 85.50 61.66 54.65 50.45 GPT-4 (T) 83.20 59.96 50.75 44.14 Table 25: Accuracy scores of GPT-3.5-Turbo and GPT-4 on the NLI task. Human and fine-tuned model performances were obtained from [83] for Indonesian, [70] for Tamil, and [9] for Vietnamese and Thai. 3.6.8 Causal reasoning id vi th ta Human 100.0 98.4 98.2 98.6 Fine-tuned 92.6 (TuLRv6 XXL) - 87.2 (TuLRv6 XXL) 85.4 (TuLRv6 XXL) GPT-3.5-Turbo 88.8 84.0 71.0 56.8 GPT-3.5-Turbo (T) 85.8 80.6 64.6 1.6 GPT-4 96.8 94.0 91.0 78.0 GPT-4 (T) 96.0 95.8 87.6 70.6 Table 26: Accuracy scores of GPT-3.5-Turbo and GPT-4 on the Causal Reasoning task. Human and fine- tuned model performances were obtained from [87] and [9] respectively. GPT-3.5-Turbo performs reasonably well on the Indonesian and Vietnamese datasets for XCOPA but falters on the Thai and Tamil datasets.
Chunk 52 · 1,992 chars
95.8 87.6 70.6 Table 26: Accuracy scores of GPT-3.5-Turbo and GPT-4 on the Causal Reasoning task. Human and fine- tuned model performances were obtained from [87] and [9] respectively. GPT-3.5-Turbo performs reasonably well on the Indonesian and Vietnamese datasets for XCOPA but falters on the Thai and Tamil datasets. When prompted in Tamil, it was unable to provide an answer for most questions, resulting in a near-zero performance. On the other hand, GPT-4 performs well across the board, with performance decreasing in the order of Indonesian, Vietnamese, Thai and Tamil. In fact, it was able to outperform a fine-tuned TuLRv6-XXL model in a zero-shot setting on the Indonesian and Thai datasets. 3.6.9 Data leakage As OpenAI does not reveal the data sources used to train GPT-3.5-Turbo and GPT-4, it is difficult to know if these models had seen the test datasets used in BHASA during its training or not. However, some serendipitous findings during error analyses suggest that the NLG datasets might have been part of the training data. Firstly, for the machine translation task with the FLORES-200 dataset, despite the fact that many of the reference translations were not equivalent to the English source sentences, GPT-4, when prompted in English, was able to produce the exact same English sentence when translating from Tamil in 8 out of 20 sample sentences. Some of these Tamil sentences were even labeled as unintelligible by annotators. If GPT-4 had been translating these sentences faithfully without trying to regurgitate memorized material, it would have translated in a different way. We explain our rationale for our suspicions in further detail in Table 27. Secondly, for the abstractive summarization task, we also found that the models produced summaries con- taining information not found in the original article but found in the reference summaries. We follow [118] and use their name cloze task to investigate if the models had memorized these articles during their
Chunk 53 · 1,997 chars
le 27. Secondly, for the abstractive summarization task, we also found that the models produced summaries con- taining information not found in the original article but found in the reference summaries. We follow [118] and use their name cloze task to investigate if the models had memorized these articles during their training (see Figure 3). It was found that at least some of the BBC news articles had been seen by the models during training, leading them to insert their external knowledge into their summaries (see Table 28). Due to the reasons above, it should be noted that the assessment of summarization and translation capa- bilities of GPT-4 and GPT-3.5-Turbo in this paper may not be fully accurate. We urge researchers to take 26 -- 26 of 86 -- BHASA Tamil English Comment ஒரு திறம் ெபற்ற தடகளவீ ரர், புலியினால் ஒரு குறBப்பிட்ட அளவிற்கு தாவவும்,நீ ந்தவும், மBகப்ெபரிய தூரங்கைள தாண்டவும் இயலும் மற்றும் மனிதைன விட 5 மடங்கு அதிகவீ ரியத்துடன் ஒரு ெபாருைள இழுக்க முடியும். A well rounded athlete, the tiger can climb (though not well), swim, leap great distances and pull with five times the force of a strong human. The Tamil text does not contain the words in brackets in the English text “(though not well)”, and the relevant section actually reads “the tiger can jump to a certain extent”. ஆளும் கட்சியான ெசௗத் ெவஸ்ட் ஆஃப்ரிக்காபீ பிள்ஸ் ஆர்கைனேசஷன் (ஸ்ேவேபா) பாராளுமன்ற ேதர்தலில் தனது ெபரும்பான்ைமைய தக்க ைவத்துக் ெகாண்டது. The ruling party, South West Africa People’s Organisation (SWAPO), also retained a majority in the parliamentary elections. The Tamil sentence does not contain the word “also”. ேமால்ேடாவாவில் உள்ள முக்கிய மதம் ஆர்தடாக்ஸ் கிறBஸ்துவ மதம். The major religion in Moldova is Orthodox Christian. The Tamil text says “Orthodox Christianity”, and it would also be more accurate to say that in English in the first place. However, GPT-4 was the only model that used the term “Orthodox Christian” just like the reference. ேபார்த்துகீசியர்கள் இைத அழBத்துவிட்டு காஸா
Chunk 54 · 1,999 chars
. The major religion in Moldova is Orthodox Christian. The Tamil text says “Orthodox Christianity”, and it would also be more accurate to say that in English in the first place. However, GPT-4 was the only model that used the term “Orthodox Christian” just like the reference. ேபார்த்துகீசியர்கள் இைத அழBத்துவிட்டு காஸா பிரங்கா என்ற ெபயரில்மீ ண்டும், 1755ஆம் ஆண்டு பூகம்பத்திற்குப் பிறகு ைகவிட மட்டுேம, கட்டினார்கள். The Portuguese destroyed it and rebuilt it under the name Casa Branca, only to abandon it after an earthquake in 1755. The Tamil text only states that the Portuguese “built” it, not that they “rebuilt” it. Furthermore, the position of the word மட்டுேம mattum=e (‘only’) is wrong and does not convey the same meaning as the English text. The Tamil text would be better translated as “they built it to only abandon it”. ஒரு பிரிட்டிஷ் கப்பைலத் தவிர, அைனத்து கப்பல்களும் மூழ்கின. கிட்டத்தட்ட 200 அெமரிக்க மற்றும் ெஜர்மன் நாட்ைடச் ேசர்ந்த உயிர்கள் பறBேபானது. All the ships were sunk, except for one British cruiser. Nearly 200 American and German lives were lost. The Tamil text uses the word ship, yet somehow GPT-4 was able to over-specify the type of ship as a cruiser. உண்ணியில் இருக்கும் டிஎன்ஏ ைவ கதிரியக்க பாஸ்பரசால் அல்லது உண்ணியில் இருக்கும் புரதத்ைத கதிரியக்க சல்பரால் அைடயாளப் படுத்தி , அவர்கள் இரண்டு (2) ேசாதைனகள் ெசய்தார்கள். They did two experiments marking either the DNA in the phage with a radioactive phosphorus or the protein of the phage with radioactive sulfur. The number (2) in brackets was not translated. சூரியன், மற்ற நட்சத்திரங்கைளப் ேபாலேவ அடிப்பைடக் ெகாள்ைககளில் இயங்குவைத அவர்கள் கண்டறBந்தார்கள்: அைமப்பில் இருக்கும் அைனத்து நட்சத்திரங்களின் ெசயல்பாடும், அவற்றBன் ஒளிர்வு மற்றும் சுழற்சியினால் இயக்கப்படுகிறது என்பைதயும், ேவறு எதனாலும் இல்ைல என்பைதயும் கண்டறBந்துள்ளனர். They found the Sun operated on the same basic principles as other stars: The activity of all stars in the system was found to be driven by their luminosity, their rotation, and nothing
Chunk 55 · 1,997 chars
களின் ெசயல்பாடும், அவற்றBன் ஒளிர்வு மற்றும் சுழற்சியினால் இயக்கப்படுகிறது என்பைதயும், ேவறு எதனாலும் இல்ைல என்பைதயும் கண்டறBந்துள்ளனர். They found the Sun operated on the same basic principles as other stars: The activity of all stars in the system was found to be driven by their luminosity, their rotation, and nothing else. The phrase “Driven by their luminosity, their rotation, and nothing else” is very unlikely to have been translated from the Tamil source as it renders it in quite a different manner. பூமB,நீ ர், காற்று, ெநருப்பு ஆகியவற்றBன் ஒன்று அல்லது அதற்கு ேமற்பட்ட நான்கு கூறுகளின் கலைவயால் அைனத்தும் உருவாக்கப்படுகிறது என்று தத்துவஞானியான அரிஸ்டாட்டில் கூறBனார். Aristotle, a philosopher, theorized that everything is made up of a mixture of one or more of four elements. They were earth, water, air, and fire. The Tamil version renders the text as a single sentence, using a phrasing akin to “one or more of four elements which are earth, water, air and fire.” Yet, GPT-4 rendered it as two sentences. Table 27: GPT-4’s Tamil→English translation of 8 sentences in the FLORES-200 dataset which were identical to the reference translation. data leakage into consideration when evaluating models, taking care not to evaluate them with data they have seen (assuming the training data used is made public), because results reported may be overestimated. In the case of closed models for which the training data is not made public, researchers may consider using methods such as that introduced in [118] to probe the models for such leakage. 27 -- 27 of 86 -- BHASA Figure 3: GPT-4 is able to regurgitate the name of the boy (Jani) mentioned in the article, which suggests that it had memorized this passage during its training. Article Summary by GPT-4 Anak laki-laki ini, yang secara teknis tidak diijinkan untuk memiliki akun situs ini untuk tiga tahun mendatang, menemukan sebuah bug (kesalahan akibat ketidaksempurnaan desain) yang memungkinkan dia menghapus komentar yang dibuat
Chunk 56 · 1,995 chars
ggests that it had memorized this passage during its training. Article Summary by GPT-4 Anak laki-laki ini, yang secara teknis tidak diijinkan untuk memiliki akun situs ini untuk tiga tahun mendatang, menemukan sebuah bug (kesalahan akibat ketidaksempurnaan desain) yang memungkinkan dia menghapus komentar yang dibuat oleh pengguna lain. Masalah ini dengan “cepat” diperbaiki setelah ditemukan, demikian keterangan Facebook, perusahaan media sosial yang memiliki Instagram. Jani kemudian dibayar - yang membuat dia sebagai anak yang termuda yang pernah menerima hadiah atas penemuan bug ini. Setelah menemukan kekurangan itu pada Februari, dia mengirim email ke Facebook. Beli sepeda dan peralatan sepak bola Sejumlah ahli teknik keamanan di perusahaan itu telah membuat akun uji coba kepada Jani untuk membuktikan teorinya - dan dia dapat melakukannya. Anak laki-laki ini, dari Helsinki, mengatakan kepada koran Finlandia Iltalehti, dia berencana untuk menggunakan uang itu untuk membeli sepeda baru, peralatan sepak bola dan komputer untuk saudara laki-lakinya. Facebook mengatakan kepada BBC, telah membayar $4.3 juta sebagai hadiah bagi yang menemukan bug sejak 2011. Banyak perusahaan menawarkan sebuah insentif keuangan bagi profesional keamanan - dan anak-anak muda, yang menyampaikan kekurangan itu kepada perusahaan, dibandingkan menjualnya ke pasar gelap. Seorang anak laki-laki berusia 10 tahun asal Finlandia bernama Jani berhasil menemukan bug pada Instagram dan mendapatkan hadiah sebesar $10.000 dari Facebook. Jani berencana menggunakan uang tersebut untuk membeli sepeda baru, peralatan sepak bola, dan komputer untuk saudara laki-lakinya. Translation: A boy aged 10 from Finland named Jani succeeded in finding a bug in Instagram and received a gift of $10,000 from Facebook. Jani plans to use the money to buy a new bicycle, football equipment, and a computer for his brother. Comment: Neither does the article contain any information about the boy‘s age nor does it mention
Chunk 57 · 1,994 chars
y aged 10 from Finland named Jani succeeded in finding a bug in Instagram and received a gift of $10,000 from Facebook. Jani plans to use the money to buy a new bicycle, football equipment, and a computer for his brother. Comment: Neither does the article contain any information about the boy‘s age nor does it mention that the reward was $10,000. Table 28: GPT-4’s summarization of an Indonesian article in the XLSum dataset. Despite the lack of any mention in the article of the boy’s age or the value of the reward given to him, GPT-4 included these details in its summary of the article. These details are available in the reference summary. 4 LINDSEA: Linguistic diagnostics for Southeast Asian Languages In Section 3, we were able to gain an overview of the performance of GPT-3.5-Turbo and GPT-4 on various NLU, NLG, and NLR tasks in our 4 target languages. Error analyses revealed certain patterns in the types of errors that the models commit and led to certain hypotheses of deficiencies in the models. However, there was no easy way to verify these hypotheses. As such, we designed LINDSEA (LINguistic Diagnostics for SouthEast Asian languages), a high quality manually-crafted linguistic dataset to help systematically diag- nose and pinpoint model deficiencies in language proficiency, and the first to be created for SEA languages. LINDSEA serves to complement the NLP benchmark component of BHASA by providing fine-grained eval- 28 -- 28 of 86 -- BHASA uation of a model’s linguistic abilities, akin to the diagnostic dataset of GLUE [18] and BLiMP [21], the linguistic diagnostic dataset for HELM [8]. 4.1 Design principles The design of LINDSEA was undergirded by three principles: breadth, depth, and quality. Given the increasingly complex tasks that LLMs are expected to perform and the importance of natural language in our interactions with LLMs, it is now more crucial than ever to holistically scrutinize these models’ understanding of language. To do so, we designed
Chunk 58 · 1,998 chars
irded by three principles: breadth, depth, and quality. Given the increasingly complex tasks that LLMs are expected to perform and the importance of natural language in our interactions with LLMs, it is now more crucial than ever to holistically scrutinize these models’ understanding of language. To do so, we designed LINDSEA to cover a wide gamut of linguistic phenomena (breadth). While existing works mainly focus only on syntax and semantics, we expand this repertoire to include pragmatics as well (see Figure 4). We also conducted an extensive survey on literature on linguistic phenomena in our target languages and used our findings to taxonomize each linguistic phenomenon to have multiple categories and subcategories for more fine-grained analyses (depth). In contrast to works like BLiMP [21] which used a set of lexicon and rules to automatically generate large numbers of test sentences, the examples in LINDSEA are manually handcrafted by linguists in tandem with native speakers and reviewed iteratively to ensure that they sound natural, are semantically coherent and target the relevant phenomenon effectively (quality). More details about the individual subcategories and literature reviewed can be found in Appendix E. Figure 4: Taxonomy of linguistic diagnostic tests in LINDSEA 4.2 Test formats While BLiMP consists completely of minimal pairs and GLUE adopted a uniform NLI task framing for all its diagnostics, we found that it was not possible to frame all our examples with the same task format given the scope of phenomena that we were targeting. As such, we devised four different test formats, namely minimal pairs, translation, information recovery and binary choice. Prompts for each task in English, Indonesian and Tamil can be found in Appendix B (Table 47). 4.2.1 Minimal pairs Minimal pairs are pairs of sentences that differ minimally from each other and contrast in grammatical acceptability. They are crafted to isolate specific linguistic phenomena and have
Chunk 59 · 1,991 chars
y and binary choice. Prompts for each task in English, Indonesian and Tamil can be found in Appendix B (Table 47). 4.2.1 Minimal pairs Minimal pairs are pairs of sentences that differ minimally from each other and contrast in grammatical acceptability. They are crafted to isolate specific linguistic phenomena and have been used not just in theoretical linguistics but also in computational linguistics and NLP to probe the linguistic capabilities of language models [119, 120, 21, 41, 42] We apply the minimal pair testing paradigm for linguistic phenomena that are amenable to such minimal transformations that result in ungrammaticality. While BLiMP and its derivatives control strictly for the length of the sentences in each minimal pair, we find that, at least in 29 -- 29 of 86 -- BHASA the context of our target languages, this can often result in awkward phrasings or more differences between the pair. Furthermore, we do not make any assumptions on the tokenization algorithm of the model tested, so controlling for length may not make sense. We therefore prioritize the naturalness of the sentences and relax the criterion on length equivalence, as long as the pair is still minimally different in the context of the respective targeted linguistic phenomena. In terms of implementation, we ask LLMs to choose which of the two sentences in a pair is more acceptable and take its answer to be a proxy of its understanding of the targeted linguistic phenomenon. We use accuracy as the metric to measure the models’ performance here. While minimal pair testing before the dawn of LLMs often involved comparing the probability of the sentences in each minimal pair, we were unable to do this given that OpenAI does not provide model output probabilities for GPT-3.5-Turbo and GPT-4. We concede that this methodology might not be ideal given the risk of hallucinations and biases in LLMs, resulting in model responses potentially not reflecting the models’ linguistic representations
Chunk 60 · 1,991 chars
r, we were
unable to do this given that OpenAI does not provide model output probabilities for GPT-3.5-Turbo and
GPT-4. We concede that this methodology might not be ideal given the risk of hallucinations and biases in
LLMs, resulting in model responses potentially not reflecting the models’ linguistic representations accurately.
Nevertheless, given the constraints, we believe that this is the closest proxy that we have access to and is
still informative as to how certain linguistic features may impact a model‘s response. In fact, [121] show
that although probability measurements are ideal, minimal pair testing through prompting can still help
reveal how well models generalize linguistic phenomena. Furthermore, we discuss how we control for biases
in Section 4.4.
As further support for this line of testing, there has also been research into GPT-3.5 and GPT-4’s capacity for
metalinguistic introspection. [122] qualitatively show that GPT-3.5 and GPT-4 are able to perform linguistic
analyses tasks, while [123] qualitatively show that GPT-3.5 and GPT-4 are able to analyze various forms of
linguistic recursion, with GPT-4 outperforming GPT-3.5 in both studies. We posit that since the models
have such metalinguistic capabilities, they are well-suited for our suite of linguistic evaluations, though we
emphasize that such capabilities are not necessarily required to perform the task and also do not guarantee
better results.
System Prompt:
You are a {LANGUAGE} linguist.
User Prompt:
Which sentence is more acceptable?
A: {SENTENCE 1}
B: {SENTENCE 2}
Answer with A or B only.
Table 29: Prompt template for the Minimal Pairs task
30
-- 30 of 86 --
BHASA
4.2.2 Translation
Although minimal pairs allow us to effectively isolate linguistic features, not all features are amenable to
such testing. In certain cases, the transformation results in an equally grammatically acceptable sentence,
but with a difference in meaning or nuance that is still important to capture. In suchChunk 61 · 1,998 chars
2.2 Translation
Although minimal pairs allow us to effectively isolate linguistic features, not all features are amenable to
such testing. In certain cases, the transformation results in an equally grammatically acceptable sentence,
but with a difference in meaning or nuance that is still important to capture. In such situations, rather than
using minimal pair testing, we instead ask the model to translate the sentence with the targeted linguistic
phenomenon and analyze its response to see if it can faithfully capture the nuanced changes in meaning.
Examples of such categories include reduplication, clitics and idiomatic expressions. This section requires
human evaluation.
User Prompt:
Translate the following sentence into {LANGUAGE}.
Sentence: {SENTENCE}
Translation:
Table 30: Prompt template for the Translation task
4.2.3 Information recovery
In the case of coreference resolution and its subcategories, we require the model to answer open-ended
questions by recovering information from the text. For example, we ask the model to answer who a particular
pronoun in the sentence refers to, or in the case of pro-drop or null pronouns [124], we ask who did what in
the text. There is no single prompt for this set of tests as the question differs depending on the contents of
the text. An example is given below. This section also requires human evaluation.
System Prompt:
You are a {LANGUAGE} linguist.
User Prompt:
In the following sentences, who or what does {PRONOUN} refer to?
{TEXT}
Table 31: Prompt template for the Information Recovery task
31
-- 31 of 86 --
BHASA
4.2.4 Binary choice
The final category of test formats involves asking the model to make a binary choice between 2 options. This
applies to a portion of the coreference resolution tasks where we present a sentence with a blank to the model
(i.e. a fill-in-the-blank-style task) and ask it to choose between 2 named entities (such as ‘Fred’ and ‘Steve’).
It also applies to the pragmatic reasoning tests where weChunk 62 · 1,999 chars
inary choice between 2 options. This
applies to a portion of the coreference resolution tasks where we present a sentence with a blank to the model
(i.e. a fill-in-the-blank-style task) and ask it to choose between 2 named entities (such as ‘Fred’ and ‘Steve’).
It also applies to the pragmatic reasoning tests where we present the model with a sentence and ask if the
given sentence is possible (‘Yes’/‘No’), or where we present the model with a premise and ask whether the
hypothesis is true (‘True’/‘False’). We then measure the model’s performance using accuracy as a metric.
System Prompt:
You are a {LANGUAGE} linguist.
User Prompt (Coreference Questions):
Which is more appropriate to fill in the blank?
Statement: {SENTENCE}
A: {ENTITY 1}
B: {ENTITY 2}
Answer with A or B only.
User Prompt (Single Sentence Questions):
Is the following statement true or false?
Statement: {SENTENCE}
Answer only with True or False.
User Prompt (Double Sentence Questions):
Situation: {PREMISE}
Given this situation, is the following statement true or false?
Statement: {HYPOTHESIS}
Answer only with True or False.
Table 32: Prompt templates for the Binary Choice task
4.3 Dataset creation and validation
Due to the lack of resources, we decided to focus on Indonesian and Tamil for our first release of LINDSEA.
For each language, our linguists did literature reviews, selected appropriate and relevant phenomena for
testing, created the data manually with the help of native speakers, validated them with other native speakers,
and edited and revalidated the data as necessary depending on the outcome of the initial validation.
4.3.1 Dataset creation
For the syntactic minimal pairs and translation tests, we drew largely on reference grammars of Indonesian
[125] and Tamil [126] as well as a series of papers written by linguists working on Austronesian and Dravidian
languages (see Table 53 in Appendix E for a full list of works consulted).
As for coreference resolution examples, some of the authors ofChunk 63 · 1,995 chars
translation tests, we drew largely on reference grammars of Indonesian [125] and Tamil [126] as well as a series of papers written by linguists working on Austronesian and Dravidian languages (see Table 53 in Appendix E for a full list of works consulted). As for coreference resolution examples, some of the authors of this paper who are native speakers of Indonesian and Tamil manually translated some English sentences from XWinograd [127], and we further supplemented these with manually crafted sentences to target specific coreference phenomena that do not occur in English, such as pro-drop [124], long-distance reflexives [128, 129] and resumptive pronouns.11 11To be clear, resumptive pronouns [130, 131, 132, 133] do occur in English but are considered as extra-grammatical features in the literature [134, 135]. In fact, some experiments have shown that native speakers will reject sentences with resumptive pronouns despite producing them in their own speech [136, 137]. 32 -- 32 of 86 -- BHASA Finally, for pragmatic reasoning tests, we also manually crafted the examples, drawing inspiration from [25] and [29] for presupposition diagnostics and [25], [26], [27] and [28] for scalar implicature diagnostics. In the case of the Indonesian data, we replaced noun phrases with a ‘X’, representing an abstract entity, in order to control for lexical semantic reasoning on the part of the models [138]. However, since Tamil is a morphologically rich language and the relevant noun phrases may occasionally require case inflections, using ‘X’ can result in awkward sentences. In such cases, we opted to use a noun phrase instead. We generally design 10 sentences per category. However, in some cases we vary the lexical items involved in the linguistic feature to test for robustness, and in such cases, the 10 sentences will be divided equally amongst the subcategories. 4.3.2 Dataset Validation In order to ensure that these handcrafted examples sound natural and accurately diagnose
Chunk 64 · 1,995 chars
category. However, in some cases we vary the lexical items involved in the linguistic feature to test for robustness, and in such cases, the 10 sentences will be divided equally amongst the subcategories. 4.3.2 Dataset Validation In order to ensure that these handcrafted examples sound natural and accurately diagnose what we intend, we put them through a rigorous validation process. For the minimal pair sentences, three native speakers for each language rated a sample of the dataset. Specifically, two sentence pairs per category were chosen at random, although for categories with subcategories, we chose 1 sentence pair per subcategory, resulting in a sample size slightly larger than 20% of the dataset. Raters were asked through a Qualtrics survey to determine whether the sentences in each pair were grammatical. In the first round of surveys, raters were allowed to rate both sentences in a pair as grammatical or ungrammatical. This allowed us to have a conversation with raters on what alternative interpretations they had of the sentences such that they could not interpret only one as clearly grammatical and the other as clearly ungrammatical. We also discussed any category that did not have full agreement across all raters. Using the feedback we gained from multiple rounds of discussion, we revamped the minimal pairs and administered a second Qualtrics survey with the categories without full agreement. For the second round, raters were only allowed to select one sentence as grammatical with the understanding that even if the other sentence is grammatically possible, it should be semantically implausible and hence rendered unacceptable in comparison.12 After the second round of surveys, we only kept categories that had majority agreement for all sentences within it (i.e. all subcategories if available). We only discarded two categories for the Indonesian dataset through this process (modal-initial active voice constructions and enclitic =ku/=mu), showing the high
Chunk 65 · 1,997 chars
12 After the second round of surveys, we only kept categories that had majority agreement for all sentences within it (i.e. all subcategories if available). We only discarded two categories for the Indonesian dataset through this process (modal-initial active voice constructions and enclitic =ku/=mu), showing the high quality of our dataset creation process. We then calculate the inter-annotator agreement using Krippendorff’s alpha and the accuracy of the majority vote answer against the ground truth. We obtained an accuracy of 100% and inter-annotator agreement of 94.89%,13 with 96.15% (75/78) sentence pairs having full agreement. As the Tamil dataset is still undergoing iterative validation, we decided to extract only the categories that had full agreement across all three raters for all sentences for this initial release. Therefore, the inter-annotator agreement and accuracy are both 100% for the Tamil dataset. As for the pragmatic reasoning datasets, we had native speakers manually verify each example to confirm that they are aligned with native speaker intuitions. While we did not have to discard any test categories due to disagreement, there were a handful of sentences that were ambiguous to the native speakers. In those situations, we worked closely with them to understand the potential reasons behind the ambiguity and replaced the relevant examples with ones that had a clear answer. Finally, for the translation tests, since the examples were not designed to be linguistically challenging or sophisticated, there was no need for extensive validation. Instead, authors of this paper who were native speakers helped to manually vet the sentences after creation. 4.4 Experimental setup and evaluation The experimental setup for LINDSEA was designed in line with the NLP benchmark component of BHASA – that is to say that we also test GPT-3.5-Turbo and GPT-4 in a zero-shot setting with both English and translated prompts using default hyperparameters, with temperature
Chunk 66 · 1,995 chars
er creation. 4.4 Experimental setup and evaluation The experimental setup for LINDSEA was designed in line with the NLP benchmark component of BHASA – that is to say that we also test GPT-3.5-Turbo and GPT-4 in a zero-shot setting with both English and translated prompts using default hyperparameters, with temperature set at 0. In the case of minimal pair testing, pragmatic reasoning testing as well as coreference resolution testing, we further include a system 12For example, benefactive constructions without the preposition untuk (‘for’) like ‘Tono membuatkan Tini kue’ (‘Tono makes Tini a cake’) would become ungrammatical when the positions of the direct and indirect objects are swapped (i.e. ‘Tono membuatkan kue Tini’). While an alternative interpretation is technically possible (‘Tono makes a Tini cake (cake called Tini)’), it is relatively semantically implausible. 13We also calculated the mean Cohen’s kappa and obtained an agreement score of 94.88%, which is very similar to the Krippendorff’s alpha score we obtained. 33 -- 33 of 86 -- BHASA prompt of “You are an Indonesian linguist” or “You are a Tamil linguist”14 in order to prime the model to “think” more carefully about the examples as they are more linguistically challenging and complex. Although we try to keep the prompts as simple as possible to measure the base capabilities of the models, we decided to include these system prompts in this case for 2 reasons: 1) we find that using system prompts allows us to clarify the intentions behind the testing; 2) we do not believe that using system prompts goes against the principles of zero-shot testing, because we are simply constraining potentially unpredictable model behavior by “telling” the model how to approach the tasks and we do not present the model with any exemplars. Relevantly, works such as [139], [140] or [141] have demonstrated that assigning the model a particular persona can indeed steer its behavior. Furthermore, for minimal pair tests and
Chunk 67 · 1,994 chars
tially unpredictable model behavior by “telling” the model how to approach the tasks and we do not present the model with any exemplars. Relevantly, works such as [139], [140] or [141] have demonstrated that assigning the model a particular persona can indeed steer its behavior. Furthermore, for minimal pair tests and certain coreference tests with binary choices, we run every test 3 times – once with the default order of options, once with the options reversed, and once with the options randomized. A model is only considered to have answered an example correctly if it does so successfully across all 3 runs. If it answers wrongly across all 3 runs, we consider it to be wrong, and if its answers are not consistent, we consider it to be unsure. We then calculate the accuracy of the model based on these aggregate results across all 3 runs. This precaution was taken because we observed in the first run (with randomized order of options) that GPT-4 was twice as likely to choose B over A, and this is also in line with reports of bias in LLMs as mentioned in Section 3.5.3. As for translation tests, each translation is scored manually by authors of this paper who are linguists and native speakers. The scores range from 0 to 2, with 0 meaning that the translation does not faithfully capture the intended message, and 2 meaning that the sentence more or less captures the intended message. While ground truth translations are made available, we do not strictly require the model‘s answers to mirror our ground truth – as long as the given translation correctly captures the intended meaning, it should still merit a score of 2. Furthermore, we only test translation from Indonesian or Tamil into English and not the opposite direction because we do not expect that the models will be able to provide the expected translation using the target linguistic phenomenon being studied, given that various acceptable translations exist. 4.5 Results and analysis In general, we find that both
Chunk 68 · 1,996 chars
n from Indonesian or Tamil into English and not the opposite direction because we do not expect that the models will be able to provide the expected translation using the target linguistic phenomenon being studied, given that various acceptable translations exist. 4.5 Results and analysis In general, we find that both models performed poorly on LINDSEA, especially in Tamil (see Table 33 and Figure 5). Performance on pragmatic reasoning is slightly better than for syntax and semantics. For every linguistic component (syntax, semantics, and pragmatics), we take a macro-average of their constituent categories, and we further calculate a macro-average of the scores of these three linguistic components to get an overall score for LINDSEA. We use macro-averages to give equal weightage to each component as they are equally important in reflecting a model’s linguistic capabilities. We will discuss the details in the following sections. Syntax Semantics Pragmatics Overall Indonesian Dataset GPT-3.5-Turbo 22.03 53.68 58.27 44.66 GPT-3.5-Turbo (T) 39.33 54.90 62.17 52.13 GPT-4 64.67 66.36 70.77 67.27 GPT-4 (T) 61.34 71.60 82.45 71.80 Tamil Dataset GPT-3.5-Turbo 13.81 25.92 48.65 29.46 GPT-3.5-Turbo (T) 6.19 20.23 44.26 23.56 GPT-4 52.14 33.43 59.26 48.28 GPT-4 (T) 48.21 33.42 55.16 45.60 Table 33: Overall performance of GPT-3.5-Turbo and GPT-4 on LINDSEA. 14For the translated system prompt, we decided to use the term “Tamil language expert” (தமBழ் ெமாழB வல்லு- நர்) instead of “Tamil linguist” (தமBழ் ெமாழBயியலாளர்) as the word linguist occurs very infrequently in Tamil and can sound unnatural. 34 -- 34 of 86 -- BHASA Figure 5: Radar charts showing overall performance of GPT-3.5-Turbo and GPT-4 on LINDSEA 35 -- 35 of 86 -- BHASA 4.5.1 Minimal pairs Overall, we find that GPT-3.5-Turbo and GPT-4 tend to be inconsistent across runs due to answer bias, with GPT-4 exhibiting strong bias towards option B for both Indonesian and Tamil datasets regardless of prompt language. For
Chunk 69 · 1,997 chars
formance of GPT-3.5-Turbo and GPT-4 on LINDSEA 35 -- 35 of 86 -- BHASA 4.5.1 Minimal pairs Overall, we find that GPT-3.5-Turbo and GPT-4 tend to be inconsistent across runs due to answer bias, with GPT-4 exhibiting strong bias towards option B for both Indonesian and Tamil datasets regardless of prompt language. For instance, in the runs that had all the correct answers as option B, GPT-4 when prompted with translated prompts scored 87.89% and 97.50% for the Indonesian and Tamil minimal pair task respectively, which looks impressive on first glance (see Table 54 in Appendix F for a detailed breakdown of answer frequencies across runs). This consistency in answer bias across settings suggests that the bias is inherent in the model. After controlling for this bias using the procedure described in Section 4.4, it was found that both models performed dismally on minimal pairs, although GPT-4 was significantly better than GPT-3.5-Turbo and performance on the Indonesian dataset was better than that on the Tamil dataset (see Table 34 and Figure 6). Morphology Argument Structure Filler-gap Dependencies NPIs and Negation Overall Indonesian Dataset GPT-3.5-Turbo 21/140 15.00% 29/160 18.13% 21/60 35.00% 4/20 20.00% 22.03% GPT-3.5-Turbo (T) 33/140 23.57% 46/160 28.75% 27/60 45.00% 12/20 60.00% 39.33% GPT-4 67/140 47.86% 84/160 52.50% 35/60 58.33% 20/20 100.00% 64.67% GPT-4 (T) 60/140 42.86% 76/160 47.50% 33/60 55.00% 20/20 100.00% 61.34% Tamil Dataset GPT-3.5-Turbo 23/140 16.43% 10/40 25.00% 0/20 0.00% - 13.81% GPT-3.5-Turbo (T) 19/140 13.57% 2/40 5.00% 0/20 0.00% - 6.19% GPT-4 58/140 41.43% 14/40 35.00% 16/20 80.00% - 52.14% GPT-4 (T) 31/140 22.14% 13/40 32.50% 18/20 90.00% - 48.21% Table 34: Performance of GPT-3.5-Turbo and GPT-4 on the minimal pair task We further analyzed the results by categories as well. We designate 60% as a passing score for a category, which indicates that the model has a mostly adequate albeit inconsistent representation of the linguistic feature
Chunk 70 · 1,996 chars
0% 18/20 90.00% - 48.21% Table 34: Performance of GPT-3.5-Turbo and GPT-4 on the minimal pair task We further analyzed the results by categories as well. We designate 60% as a passing score for a category, which indicates that the model has a mostly adequate albeit inconsistent representation of the linguistic feature tested for in the category. We also define ‘native-like’ proficiency as 90% and above, which is similar to the performance of our human raters, and we define 10% and lower as a consistently wrong representation of the target linguistic feature. Using these definitions, we were able to understand which linguistic phenomena are well-understood by the models and which are not (see Table 35). Given that GPT-3.5-Turbo had an extremely poor performance in most categories, we will only discuss GPT-4’s strengths and weaknesses in the following paragraphs to avoid inflating the analysis. The full breakdown for both models can be found in Tables 55 and 56 in Appendix F. For the Indonesian language, it seems that GPT-4 has a strong understanding of negative polarity items (NPIs) and negation, as well as both subject and object relative clauses. It also has a strong representation of word order possibilities with respect to modals in the active voice and benefactive constructions. In addition, it also has an adequate understanding of coordinate and adjunct islands [142] in Indonesian. However, it still failed 21 out of 38 categories, amongst which it gave consistently wrong (<10%) answers for 7 of them. These categories often involve linguistic phenomena that do not feature in English which might suggest that GPT-4 is unable to understand these phenomena because it does not have a strong representation of Indonesian syntax and is still relying heavily on its understanding of English syntax. For example, since voice affixes 36 -- 36 of 86 -- BHASA Figure 6: Radar charts showing the accuracy (%) of GPT-3.5-Turbo and GPT-4 with English and translated prompts on
Chunk 71 · 1,995 chars
enomena because it does not have a strong representation of Indonesian syntax and is still relying heavily on its understanding of English syntax. For example, since voice affixes 36 -- 36 of 86 -- BHASA Figure 6: Radar charts showing the accuracy (%) of GPT-3.5-Turbo and GPT-4 with English and translated prompts on the Indonesian and Tamil minimal pair task. are not a feature of English, GPT-4 performs poorly on all categories involving the men- active voice prefix. Object voice is also a core feature of Indonesian and understanding it is crucial for native-like use of the language. While GPT-4 seems to be able to grasp the correct word order in such constructions, it does not understand the morphological nuances involved, such as the use of modals and demonstratives, and is therefore easily perturbed. Another example would be the enclitic =nya, which has a plethora of important functions in Indonesian, one of which is to express possession [125, 143, 144]. GPT-4 had consistently wrong answers in this category as well, which was surprising given the importance and prevalence of this enclitic. GPT-4 also had poor performance on categories involving reflexives (dirinya and dirimu) in different contexts. Finally, passive constructions in Indonesian may occur with or without the preposition oleh [145], unlike in 37 -- 37 of 86 -- BHASA Pass Native Wrong Indonesian Dataset (38 categories) GPT-3.5-turbo 3 0 21 GPT-3.5-turbo (T) 8 6 15 GPT-4 17 11 7 GPT-4 (T) 16 9 9 Tamil Dataset (20 categories) GPT-3.5-turbo 1 0 10 GPT-3.5-turbo (T) 0 0 14 GPT-4 9 2 5 GPT-4 (T) 3 2 8 Table 35: Breakdown of GPT-3.5-Turbo and GPT-4’s performance level. ‘Pass’ refers to categories with at least a 60% accuracy rate, whereas ‘Native’ refers to at least a 90% accuracy rate. ‘Wrong’ refers to categories that had a performance level of 10% or lower. English where the equivalent preposition by is obligatory, and it seems that GPT-4 does not understand argument structure well when oleh is
Chunk 72 · 1,987 chars
o categories with at least a 60% accuracy rate, whereas ‘Native’ refers to at least a 90% accuracy rate. ‘Wrong’ refers to categories that had a performance level of 10% or lower. English where the equivalent preposition by is obligatory, and it seems that GPT-4 does not understand argument structure well when oleh is absent from the sentence. For Tamil, GPT-4’s performance was much more lackluster. When GPT-4 understands a linguistic phe- nomenon in Indonesian, it usually does so very well, with 11 out of the 17 categories (64.71%) that it obtained a passing grade for being at native-level proficiency. On the other hand, GPT-4 only obtained a native-level performance in 2 out of 9 categories (22.22%) with a passing score, suggesting that its passes are mostly borderline. We will only discuss GPT-4’s performance when prompted in English since its performance when prompted in Tamil is far poorer, most likely due to it not fully understanding the prompt. GPT-4 had a strong understanding of subject and object relatives in Tamil just like it did in Indonesian, as well as word order possibilities in causative constructions with vai (‘to place’) functioning as the auxiliary (‘to make’). It also has a somewhat adequate understanding of the relationship between tense inflections and time. In addition, it understands the syntax of matrix verbs selecting for finite clauses but fails to do so for matrix verbs selecting for non-finite clauses. As was hypothesized for the Indonesian results, it seems that GPT-4 struggles with linguistic features that exist in Tamil but not in English, and these are fairly common given that the two languages are typologically distinct. First, the Tamil case system is very extensive and has complex interactions with other morphological features such as animacy and gender [126] and understanding the function of morphological case markers is crucial to having a native-like understanding of the language. GPT-4 has a poor understanding of
Chunk 73 · 1,993 chars
cally distinct. First, the Tamil case system is very extensive and has complex interactions with other morphological features such as animacy and gender [126] and understanding the function of morphological case markers is crucial to having a native-like understanding of the language. GPT-4 has a poor understanding of the syntactic rules surrounding case markers in Tamil, especially in the context of nominative versus dative subjects. Second, a related point would be the complex subject-verb agreement (SVA) system in Tamil. GPT-4 performs poorly at figuring out whether a verb should agree with the nominative or dative subject, or whether a verb should agree with the subject or object. We emphasize that these are trivial tasks for native speakers and not understanding SVA could cause major problems in understanding and generating Tamil text. Third, scrambling is prevalent in Tamil given that it has a higher degree of flexibility in word order. We hypothesize that this prevalence might have misled GPT-4 into “thinking” that word order is freer than it actually is, causing it to fail at categories such as scrambling out of embedded clauses. Fourth, there are many complementizers in Tamil and some can take case markers as well, unlike in English. As such, GPT-4 does not understand their use in Tamil, such as not being able to tell the difference between enru and enpatu or not knowing that enru does not inflect for case unlike enpatu. Finally, another feature in Tamil that does not exist in English is the productive use of complex predicates. GPT-4 does not appear to accurately represent the syntactic rules surrounding complex predicates, but this is again a key component of the language that has to be mastered to reach native proficiency. Overall, we find that even GPT-4, which does seem to perform well on NLU, NLG and NLR, has many blind spots when it comes to the syntax of Indonesian and Tamil. These gaps in syntactic knowledge need to be closed if we aspire to
Chunk 74 · 1,994 chars
again a key component of the language that has to be mastered to reach native proficiency. Overall, we find that even GPT-4, which does seem to perform well on NLU, NLG and NLR, has many blind spots when it comes to the syntax of Indonesian and Tamil. These gaps in syntactic knowledge need to be closed if we aspire to develop models that truly have strong or even ‘native-like’ proficiency in Indonesian and Tamil. 38 -- 38 of 86 -- BHASA 4.5.2 Translation VR1 VR2 VR3 VR4 VR5 VCOMP CLF IDIOM Total GPT-3.5-Turbo 5/10 50.00% 2/10 20.00% 22/26 84.62% 13/16 81.25% 12/18 66.67% 6/10 60.00% 2/10 20.00% 2/22 9.09% 64/122 52.46% GPT-3.5-Turbo (T) 4/10 40.00% 3/10 30.00% 20/26 76.92% 15/16 93.75% 12/18 66.67% 6/10 60.00% 2/10 20.00% 5/22 22.73% 67/122 54.92% GPT-4 6/10 60.00% 5/10 50.00% 18/26 69.23% 14/16 87.50% 13/18 72.22% 6/10 60.00% 4/10 40.00% 5/22 22.73% 71/122 58.20% GPT-4 (T) 6/10 60.00% 5/10 50.00% 21/26 80.77% 14/16 87.50% 16/18 88.89% 8/10 80.00% 4/10 40.00% 11/22 50.00% 85/122 69.67% Table 36: Accuracy of GPT-3.5-Turbo and GPT-4 on the Indonesian LINDSEA translation task. VR1 to VR5 stand for various verbal reduplication phenomena, VCOMP stands for compound verbs, CLF stands for classifiers, and IDIOM stands for idiomatic expressions. Emphatic taan Clitics Verb reduplication Idiomatic expressions Total GPT-3.5-Turbo 9/20 45.00% 12/20 60.00% 0/20 0.00% 2/20 10.00% 23/80 28.75% GPT-3.5-Turbo (T) 5/20 25.00% 11/20 55.00% 0/20 0.00% 2/20 10.00% 18/80 22.50% GPT-4 14/20 70.00% 14/20 70.00% 0/20 0.00% 6/20 30.00% 34/80 42.50% GPT-4 (T) 14/20 70.00% 15/20 75.00% 0/20 0.00% 6/20 30.00% 35/80 43.75% Table 37: Accuracy of GPT-3.5-Turbo and GPT-4 on the Tamil LINDSEA translation task. GPT-4 is much better than GPT-3.5-Turbo in the LINDSEA translation task for both languages. It is interesting to note that prompting GPT-4 natively results in a stronger performance especially on the Indonesian dataset where we see an improvement of more than 10% (see Table 36). For the
Chunk 75 · 1,997 chars
Tamil LINDSEA translation task. GPT-4 is much better than GPT-3.5-Turbo in the LINDSEA translation task for both languages. It is interesting to note that prompting GPT-4 natively results in a stronger performance especially on the Indonesian dataset where we see an improvement of more than 10% (see Table 36). For the Indonesian dataset, we note that both models have a decent understanding of verbal reduplication in general, but are much weaker at dealing with cases where reduplication conveys the atelic15 nature of the action (VR1) or repetition (VR2) [125]. Both models also do not understand idiomatic expressions, with the best model (GPT-4 with Indonesian prompts) only scoring 50%. In particular, GPT-3.5-Turbo translated almost every sentence literally, betraying its limited knowledge of idiomatic expressions. Finally, both models do not use Indonesian classifiers correctly, mostly using the generic sebuah instead of noun-specific classifiers, whereas the appropriate use of specific classifiers is trivial for native speakers. As for the Tamil dataset (see Table 37), both models performed very poorly with the best overall performance coming from GPT-4 with Tamil prompts (43.75%). As with the Indonesian dataset, we observe that idiomatic expressions are a challenge for both models, with translations that are both literal and hallucinated. For example, the sentence “ரவி காலம் ஆனார்” (Ravi k¯alam ¯an¯ar) literally means ‘Ravi became time’ actually conveys the meaning ‘Ravi passed away’. GPT-4 translated this sentence as “Ravi became a pen”, which takes the literal meaning of the verb “to become” but somehow translates time as pen.16 On the other hand, we find that while the models did well with verbal reduplication in Indonesian, they were both unable to get a single translation right for verbal reduplication in Tamil. The attempts made are 15Generally speaking, telic verbs (such as arrive) semantically encode an event with an inherent endpoint, while atelic verbs
Chunk 76 · 1,994 chars
we find that while the models did well with verbal reduplication in Indonesian, they were both unable to get a single translation right for verbal reduplication in Tamil. The attempts made are 15Generally speaking, telic verbs (such as arrive) semantically encode an event with an inherent endpoint, while atelic verbs (such as run) do not. Works such as [146] have investigated the representation of telicity in language models. 16Interestingly, the word k¯alam if spelt in the Devanagari script with a short /a/ vowel (कलम) would mean pen in Sanskrit, Hindi, Nepali and Bhojpuri. This might suggest that the model is trying to rely on shared embeddings in other related languages. 39 -- 39 of 86 -- BHASA often incoherent, demonstrating that the models do not have any understanding of verbal reduplication in Tamil. For example, the sentence “ராமா சந்ைதக்குப் ேபானாலும் ேபாகலாம்” (R¯am¯a cantaikkup p¯on¯alum p¯okal¯am) repeats the verb ேபா (p¯o) with the appropriate inflections to convey the modality of possibility (‘Rama might go to the market’). However, GPT-4 translates this sentence as “Even if Rama goes to the market, we can go”. In spite of these poor performances, GPT-4 has a somewhat decent and consistent understanding of the emphatic taan as well as clitics in Tamil (∼70%) while GPT-3.5-Turbo does not. For example, it could understand that in the context of the sentence “மைழ ெபய்யும் ேபாது தான் வானவில்ைல பார்க்க முடியும்” (‘Rainbows can be seen only when it rains’), the emphatic taan indicates that rainbows can be seen only when it rains. Overall, we note that even though GPT-4 with Indonesian prompts did relatively well, it was unable to get a perfect score in any single category, indicating that there is still much room for improvement. We thus suggest shoring up language models’ understanding of verbal reduplication (especially in Tamil) and idiomatic expressions in order to bring their proficiency in Indonesian and Tamil to a more native-like level. 4.5.3
Chunk 77 · 1,995 chars
erfect score in any single category, indicating that there is still much room for improvement. We thus suggest shoring up language models’ understanding of verbal reduplication (especially in Tamil) and idiomatic expressions in order to bring their proficiency in Indonesian and Tamil to a more native-like level. 4.5.3 Coreference resolution Long-distance Reflexives Resumptive Pronouns Pro-drop Translated XWinograd Total Indonesian Dataset GPT-3.5-Turbo 4/10 40.00% 0/10 0.00% 10/10 100.00% 42/72 58.33% 56/102 54.90% GPT-3.5-Turbo (T) 6/10 60.00% 1/10 10.00% 10/10 100.00% 39/72 54.17% 56/102 54.90% GPT-4 3/10 30.00% 2/10 20.00% 10/10 100.00% 61/72 84.72% 76/102 74.51% GPT-4 (T) 3/10 30.00% 2/10 20.00% 10/10 100.00% 60/72 83.33% 75/102 73.53% Tamil Dataset GPT-3.5-Turbo 0/10 0.00% - 7/10 70.00% 11/58 18.97% 18/78 23.08% GPT-3.5-Turbo (T) 1/10 10.00% - 5/10 50.00% 8/58 13.79% 14/78 17.95% GPT-4 0/10 0.00% - 10/10 100.00% 9/58 15.52% 19/78 24.36% GPT-4 (T) 0/10 0.00% - 10/10 100.00% 8/58 13.79% 18/78 23.08% Table 38: Performance of GPT-3.5-Turbo and GPT-4 on all coreference resolution tasks. A B No answer GPT-3.5-Turbo 90/174 51.72% 84/174 48.28% 0/174 0.00% GPT-3.5-Turbo (T) 122/174 70.11% 33/174 18.97% 19/174 10.92% GPT-4 83/174 47.70% 91/174 52.30% 0/174 0.00% GPT-4 (T) 31/174 17.82% 143/174 82.18% 0/174 0.00% Table 39: Proportions of model answers for Tamil coreference resolution in the translated XWinograd task (sum across three runs) 40 -- 40 of 86 -- BHASA Overall, GPT-4 performed well on coreference resolution for Indonesian (see Table 38), especially on pro- drop [124] examples and examples requiring commonsense knowledge translated from XWinograd. GPT-4’s better performance compared to GPT-3.5-Turbo can largely be attributed to a superior performance in the translated XWinograd dataset. Otherwise, the two models have a very similar performance in all other categories. It was however surprising that GPT-3.5-Turbo, when prompted in Indonesian, scored twice
Chunk 78 · 1,991 chars
ograd. GPT-4’s better performance compared to GPT-3.5-Turbo can largely be attributed to a superior performance in the translated XWinograd dataset. Otherwise, the two models have a very similar performance in all other categories. It was however surprising that GPT-3.5-Turbo, when prompted in Indonesian, scored twice as high as GPT-4 on long-distance reflexives and it is still unclear why this might be the case. It is worth noting that we deliberately leave the questions for long-distance reflexive dirinya open-ended since it has more than one potential antecedent and we take the models’ choice of antecedent(s) to be indicative of its preference. Despite not forcing the models to choose only one antecedent (see Table 40), both models did do so anyway and GPT-4’s choices in particular indicated a preference for choosing local antecedents, although native speakers actually preferred non-local antecedents in the sentences tested. In any case, it is clear that long-distance reflexives and resumptive pronouns which do not occur in English are still very much a challenge for both models. Prompt Based on the sentence ‘Tono mengatakan bahwa Tina menyalahkan dirinya atas kesalahan tersebut.’, who does ‘dirinya’ refer to? Possible Answers Tono (native-preferred) or Tina GPT-4’s Response In the sentence ’Tono mengatakan bahwa Tina menyalahkan dirinya atas kesalahan tersebut,’ the word "dirinya" refers to Tina. The sentence translates to "Tono said that Tina blamed herself for the mistake." Table 40: Example of an open-ended coreference resolution question on long-distance reflexive dirinya On the other hand, while GPT-4 generally did well for Indonesian, the same cannot be said for Tamil, with both models getting less than a quarter of the questions right (see Table 38). Both models performed far below chance on the translated XWinograd dataset (binary choice task), often giving inconsistent answers across the 3 runs, with Tamil-prompted GPT-4 doing so for 81% of the
Chunk 79 · 1,991 chars
the same cannot be said for Tamil, with both models getting less than a quarter of the questions right (see Table 38). Both models performed far below chance on the translated XWinograd dataset (binary choice task), often giving inconsistent answers across the 3 runs, with Tamil-prompted GPT-4 doing so for 81% of the questions (see Table 58 in Appendix F). We therefore decided to perform an analysis of the frequencies of answering A or B for this task (see Table 39) and we found that Tamil-prompted GPT-3.5-Turbo displayed a bias towards choosing A (70.11%) while Tamil-prompted GPT-4 displayed a bias towards choosing B (82.18%). The fact that only Tamil-prompted models exhibited such a bias leads us to believe that the model might not have understood the Tamil prompts. This is further evidenced by GPT-3.5-Turbo’s inability to provide an answer for 11% of the questions when prompted in Tamil. Importantly, the models seem to not simply resort to random guessing, and we leave an account of this for future research. As for long-distance reflexives, we observe similarly poor performances in Tamil as in Indonesian. Both models were consistently unable to select the antecedent of the long-distance reflexive taan [147, 148] that native speakers preferred. In addition, Tamil-prompted GPT-3.5-Turbo had 9 incoherent responses out of 10, casting doubt on the model’s understanding of the task. Despite the poor performances in the previous categories, GPT-4 obtained a full score for the pro-drop category with both English and Tamil prompts. Although the dataset is small, we believe that it demonstrates how coreference resolution in Tamil overall remains a challenging task for both models. In general, it seems that both models perform quite well on the pro-drop phenomenon but they seem to treat Indonesian and Tamil long-distance reflexives like English reflexives which can only take local antecedents [124], and also struggle with resumptive pronouns in Indonesian, possibly
Chunk 80 · 1,985 chars
llenging task for both models.
In general, it seems that both models perform quite well on the pro-drop phenomenon but they seem to treat
Indonesian and Tamil long-distance reflexives like English reflexives which can only take local antecedents
[124], and also struggle with resumptive pronouns in Indonesian, possibly because it is not a feature of
English.
4.5.4 Pragmatic reasoning
Pragmatic reasoning goes beyond entailment. As part of our goal to develop a holistic evaluation suite, we
also propose to measure models’ pragmatic reasoning capabilities by testing whether they can detect scalar
implicatures17 and presuppositions, which are distinct from entailment phenomena.
Scalar implicatures are introduced by a lexical item that is a member of a lexical scale ({some,all};
{warm,hot}), and unlike entailments, they are defeasible. For instance, the sentence ‘Tono is now a fa-
ther’ entails that he has at least one child, so following up the sentence with ‘Tono has no children’ results in
a contradiction. In contrast, the sentence ‘Radha ate some of the idlis’ introduces a scalar implicature with
17Though there are other types of implicatures, we focus on scalar implicatures here.
41
-- 41 of 86 --
BHASA
Scalar Implicatures Presuppositions Overall
Indonesian Dataset
GPT-3.5-Turbo 64/104
61.54%
44/80
55.00% 58.27%
GPT-3.5-Turbo (T) 63/104
60.58%
51/80
63.75% 62.17%
GPT-4 64/104
61.54%
64/80
80.00% 70.77%
GPT-4 (T) 87/104
83.65%
65/80
81.25% 82.45%
Tamil Dataset
GPT-3.5-Turbo 44/104
42.31%
33/60
55.00% 48.65%
GPT-3.5-Turbo (T) 47/104
45.19%
26/60
43.33% 44.26%
GPT-4 47/104
45.19%
44/60
73.33% 59.26%
GPT-4 (T) 61/104
58.65%
31/60
51.67% 55.16%
Table 41: Performance of GPT-3.5-Turbo and GPT-4 for all pragmatic reasoning tasks
the use of the word some, implying that only some but not all of the idlis were eaten. However, the scalar
implicature can be explicitly canceled by uttering the follow-up sentence ‘In fact, Radha ate all of the idlis’
[149, 150, 25,Chunk 81 · 1,992 chars
6% Table 41: Performance of GPT-3.5-Turbo and GPT-4 for all pragmatic reasoning tasks the use of the word some, implying that only some but not all of the idlis were eaten. However, the scalar implicature can be explicitly canceled by uttering the follow-up sentence ‘In fact, Radha ate all of the idlis’ [149, 150, 25, 26]. Presuppositions are inferences that are drawn due to the presence of presupposition triggers [151, 25, 29], and such inferences typically project out of operators (such as question or negation operators – see for instance [25] or [29]). Conversely, entailments are canceled under such operators [152]. The sentence ‘Tono visited his sister’ presupposes that Tono has a sister and entails that Tono paid his sister a visit. When a negation operator not is added, as in ‘Tono did not visit his sister’, the entailment is canceled but not the presupposition (i.e. the existence of Tono’s sister is not canceled). For pragmatic reasoning in Indonesian, it was found that prompting in the native language actually boosts performance, especially in the case of GPT-4 where an improvement of more than 10% was observed. For presuppositions, GPT-4 had a very strong performance (>80%) in all categories except for non- culminating change-of-state verbs [153] (see Table 41; see also Table 61 in Appendix F). This is expected as this is a feature that does not exist in English.18 On the other hand, it seems that GPT-3.5-Turbo does not have a good understanding of presupposition triggers in clefted constructions, embedded interrogatives and change-of-state verbs in general. For scalar implicatures, we tested the models on various aspects of the adjective scale (boosters, comparative and superlative constructions, degree ordering and negation) as well as quantifiers (some and all). GPT-4 understands scalar implicatures associated with boosters and superlatives, while GPT-3.5-Turbo’s perfor- mance in these categories is no better than random guessing (see Table 59 in
Chunk 82 · 1,994 chars
ve scale (boosters, comparative and superlative constructions, degree ordering and negation) as well as quantifiers (some and all). GPT-4 understands scalar implicatures associated with boosters and superlatives, while GPT-3.5-Turbo’s perfor- mance in these categories is no better than random guessing (see Table 59 in Appendix F). However, it was surprising to see that GPT-3.5-Turbo outperformed GPT-4 on the comparative adjective category when prompted in English (80% versus 20%). It is to be noted that comparatives in Indonesian are not formed 18In Indonesian, unlike English, some change-of-state verbs do not need to entail a change of state. For example, the sentence ‘Budi membunuh Ali, tapi dia tidak mati’ (‘Budi killed Ali, but he did not die’) would make perfect sense in Indonesian although it would be nonsensical in English [153]. We acknowledge that this, strictly speaking, does not involve presuppositions, but rather involves entailment. However, we still included it under this section because entailment is also a component of pragmatic reasoning and it is related to another test for presuppositions involving change-of-state verbs (where such verbs presuppose that the previous state was different from what it is now [29]). 42 -- 42 of 86 -- BHASA using morphological affixes like in English. Both models, regardless of prompt language, were not able to get degree ordering questions correct consistently, suggesting that the representation of the degree intensity of the lexical items involved might be influenced by that of their English counterparts and is therefore not accurate with regard to how Indonesians construe them. Only GPT-4, when prompted in Indonesian, had a consistent understanding of how negation interacts with the adjective scale (95%). As for quantifiers, both models had a strong understanding that beberapa (‘some’) and sebagian (‘a part of’) function as the quantifier some, with GPT-4 obtaining a full score regardless of prompt language.
Chunk 83 · 1,995 chars
prompted in Indonesian, had a consistent understanding of how negation interacts with the adjective scale (95%). As for quantifiers, both models had a strong understanding that beberapa (‘some’) and sebagian (‘a part of’) function as the quantifier some, with GPT-4 obtaining a full score regardless of prompt language. Finally, only GPT-3.5-Turbo, when prompted in Indonesian, had a strong understanding of the interaction between the negation morpheme tidak and beberapa or semua (‘all’) (91.67%), with GPT-4 underperforming GPT-3.5-Turbo in this category – a surprising finding given the overall results. Moving on to Tamil, performance for pragmatic reasoning was generally quite poor (see Table 41). For presuppositions, both models could not consistently detect the presence of presuppositions. Notably, however, English-prompted GPT-4 managed to score 19/20 for presupposition detection in clefted structures, while it was inconsistent in the other 2 categories. As for scalar implicatures, it was difficult to measure the models’ performance because it was found that for at least 5 out of 7 categories (that is, excluding the ‘some’ and ‘all’ categories), the models seemed to strongly prefer only either ‘true’ or ‘false’ as answers (see Table 60 Appendix F), which is not an expected outcome if the models were randomly guessing. One possible explanation for the models’ ability to handle the ‘some’ and ‘all’ categories better could be that the task format for those two categories are more in line with NLI-style tasks which include both a premise and a hypothesis, allowing the model to find clues in both texts. Regardless, we still observe that all models performed slightly below chance for scalar implicatures, with only Tamil-prompted GPT-4 performing slightly above chance (58.65%). In general, despite the small size of the dataset, there is evidence to suggest that pragmatic reasoning is still challenging for the models, especially in Tamil. 5 Cultural representation and
Chunk 84 · 1,994 chars
ormed slightly below chance for scalar implicatures, with only Tamil-prompted GPT-4 performing slightly above chance (58.65%). In general, despite the small size of the dataset, there is evidence to suggest that pragmatic reasoning is still challenging for the models, especially in Tamil. 5 Cultural representation and sensitivity Beyond just syntax, semantics and pragmatics, cultural representation and sensitivity has also become increasingly important with LLM use, because a lack thereof could potentially cause social harm [44]. The gravity of the risks involved has prompted multiple studies in this area especially in recent times [45, 154, 155, 156]. Building on such work, we take a preliminary look at the representation of and sensitivity to Indonesian and Tamil culture in GPT-4. 5.1 Cultural representation 5.1.1 Task design [43] suggest that a culture can be defined by its shared cultural common ground, or a shared body of knowledge. We therefore propose a cultural representation diagnostic task based on probing for cultural knowledge. We categorize cultural knowledge into 4 high-level categories – language, literature, history and customs (see Figure 7). We then surveyed relevant literature on Indonesian and Tamil culture and consulted with native speakers to derive a set of basic knowledge questions that people from each culture would not have difficulty in answering. We then design the task as a simple information retrieval task but with occasional tests for knowledge application (see Table 42 for an example). As with most tasks, we use the default hyperparameters for the models’ generations, with 0 as the temperature for stability of answers. We only test GPT-4 with translated prompts mainly due to resource constraints, but also because GPT-4 has proven to be the stronger model and it is arguably the case that accurate cultural representation is expected more when a user is interacting with it in their native language and not English. We reiterate that
Chunk 85 · 1,992 chars
rs. We only test GPT-4 with translated prompts mainly due to resource constraints, but also because GPT-4 has proven to be the stronger model and it is arguably the case that accurate cultural representation is expected more when a user is interacting with it in their native language and not English. We reiterate that although we evaluated GPT-4 in this paper, the evaluation suite is designed to be model-agnostic as the main intention is to probe models’ cultural representation and sensitivity. The authors of this paper who are native speakers evaluated the responses both qualitatively and quan- titatively. For shorter generations, we score GPT-4’s generations across 3 runs, taking the average score as the final score. For longer generations, we only scored one generation due to cost considerations. The scoring system uses a scale of 0 to 2, with 0 representing answers with little to no accuracy or relevance, 1 43 -- 43 of 86 -- BHASA Figure 7: Taxonomy of cultural representation tasks. System Prompt Anda adalah orang Indonesia You are an Indonesian Human Prompt Apa makna dari “di mana bumi dipijak, di situ langit dijunjung” ini? What is the meaning of “when in Rome, do as the Romans do”? Tulislah sebuah cerita singkat yang menjelaskan mengenai “di mana bumi dipijak, di situ langit dijunjung” ini. Write a short story that explains “when in Rome, do as the Romans do”. Table 42: Example of a cultural representation question. The first question is a simple information retrieval task (asking for the meaning of a proverb), while the second question requires application of that knowledge (asking the model to write a story to explain that proverb). representing answers that are somewhat accurate or relevant, and 2 representing answers that are mostly accurate or relevant. For the full list of questions asked, please refer to Appendix G. 5.1.2 Results Overall, we find that GPT-4 is lacking severely in the representation of Indonesian and Tamil culture across the 4
Chunk 86 · 1,990 chars
epresenting answers that are somewhat accurate or relevant, and 2 representing answers that are mostly accurate or relevant. For the full list of questions asked, please refer to Appendix G. 5.1.2 Results Overall, we find that GPT-4 is lacking severely in the representation of Indonesian and Tamil culture across the 4 categories with a couple of subcategories even having a score of 0 (see Table 43). We note that even though some tasks involve follow-up questions which gives GPT-4 a chance to revise its response (if it is wrong), GPT-4 tends to double down on its original answer, suggesting that it truly does not know the appropriate answer. In the subsequent sections, we present the results for each category in detail. Language: Proverbs We asked GPT-4 to explain the meaning behind 5 common and well-known proverbs in Indonesian and Tamil (see Appendix G for the full list). We also asked follow-up questions by prompting it to generate a story that explains the proverb, testing for its ability to apply its knowledge of the proverb. GPT-4 performed moderately in this section for both languages (50%). For example, in Indonesian, it explained air cucuran atap jatuhnya ke pelimbahan juga (lit: ‘water from the roof falls into the wastewater too’; ‘the apple does not fall far from the tree’) as being more akin in meaning to ‘a leopard cannot change its spots’. It also explained menang jadi arang, kalah jadi abu (lit: ‘win becomes coal, lose becomes ash‘; ‘losing or winning in an argument will not benefit either party’) by generating a story about a person caught in a dilemma, which is irrelevant. In Tamil, although in 3 out of 5 questions it was able to generate both correct explanations and relevant stories, we note that when it is unable to do so, the answers can be incoherent. For instance, with the proverb ஆத்திரக்காரனுக்கு புத்தி மட்டு (lit: ‘an angry man has less than average intelligence’; ‘an angry man does not make sound decisions’), GPT-4 did not seem to
Chunk 87 · 1,998 chars
to generate both correct explanations and relevant stories, we note that when it is unable to do so, the answers can be incoherent. For instance, with the proverb ஆத்திரக்காரனுக்கு புத்தி மட்டு (lit: ‘an angry man has less than average intelligence’; ‘an angry man does not make sound decisions’), GPT-4 did not seem to understand the words மட்டு (‘limited’) and புத்தி (‘intelligence’) and instead treated both as physical objects that could be interacted with. This can be seen in the story that it generated to explain the proverb, where a farmer was said to have “tied intelligence on top of limited”, which is nonsensical. 44 -- 44 of 86 -- BHASA Category Subcategory Indonesian Score Tamil Score Language Proverbs Proverb Explanation and Storytelling 2/4 50.00% Proverb Explanation and Storytelling 2/4 50.00% Literature Poetry Pantun Generation 0/12 0.00% Poet Identification and Poetry Explanation* 1/4 25.00% Pantun Interpretation* 1/2 50.00% Five Great Epics Naming and Summary* 0.33/4 8.33% Folklore/ Fables Folklore Recommendation and Moral Explanation 1/4 25.00% Fable Recommendation and Self-Verification 1/4 25.00% History Historical Figures Identification of Figure and Claim to Fame 2/4 50.00% Identification of Figure and Claim to Fame* 1.33/4 33.33% Customs Customs Custom Description and Significance* 2/4 50.00% Custom Description and Significance 3/4 75.00% Traditional Games Game Recognition and Explanation* 0/4 0.00% Game Recognition and Explanation* 0/4 0.00% Total 8/34 23.53% Total 8.66/28 30.93% Table 43: Summary of GPT-4’s performance on the cultural representation tasks. Tasks marked with an asterisk (*) involved short generations and were repeated 3 times. The score for these tasks were then obtained via simple averaging of scores across the 3 runs. As such, it seems that GPT-4 still has room to acquire knowledge of more proverbs in both languages. Literature: Poetry and Epics For Indonesian, we asked GPT-4 to generate 5 pantuns19across
Chunk 88 · 1,999 chars
ations and were repeated 3 times. The score for these tasks were then obtained via simple averaging of scores across the 3 runs. As such, it seems that GPT-4 still has room to acquire knowledge of more proverbs in both languages. Literature: Poetry and Epics For Indonesian, we asked GPT-4 to generate 5 pantuns19across 3 different themes (family, friendship and romantic love) which are commonly found in pantuns. The model was then scored on how well it follows the ABAB rhyming scheme and how well the intended message is delivered. It was found that GPT-4 was consistently unable to follow the ABAB template and tends to deliver the theme in the first two lines instead of the last two lines, making the generated pantun more like a regular poem than an authentic pantun. GPT-4 also tries to use metaphors to link the first two and last two lines, but the metaphors sometimes do not work well, and there should not be a link between the sampiran and the last two lines anyway. Furthermore, pantuns have diverse themes and can draw from activities in daily life, but GPT-4 tends to just use nature as imagery. We also asked GPT-4 to interpret 6 pantuns20, 2 from each of the 3 aforementioned themes. Generally, GPT-4 is able to identify the sampiran but tends to try to link it to the last two lines when there should not be any link whatsoever, resulting in inaccurate explanations. Overall, it seems that GPT-4 cannot generate pantuns, but is somewhat able to explain them, albeit inconsistently. For Tamil, we ask GPT-4 to interpret 5 well-known poems21 by Subramania Bharathi, who is a renowned and highly-regarded poet from the 19th century. In particular, we made sure to pick only from verses of poems that are cited frequently in various media or taught in schools. Next, we also ask the model whether it knows the author of the respective poems, which is presumably also common knowledge given the significance of these poems in Tamil culture. This process was repeated 3 times. Notably,
Chunk 89 · 1,895 chars
pick only from verses of poems that are cited frequently in various media or taught in schools. Next, we also ask the model whether it knows the author of the respective poems, which is presumably also common knowledge given the significance of these poems in Tamil culture. This process was repeated 3 times. Notably, GPT-4 is only able to interpret 2 of the poems and partially explain another consistently across the 3 rounds. It also consistently only attributes 1 poem to Subramania Bharathi across the 3 rounds. Based on the overall score, GPT-4 seems to have limited knowledge of Subramania Bharathi and his poems. On top of this, we also ask GPT-4 in Tamil to name the Five Great Epics, which are important and well- known in Tamil literary culture, and we evaluate its knowledge by checking how many out of 5 it correctly names. We follow up by asking GPT-4 to summarize the key events of each epic to further test its knowledge. 19A pantun is a particular type of Indonesian poetry that has been designated as an intangible cultural heritage by UNESCO and is an important part of Malay and Indonesian literary culture (https://ich.unesco.org/en/RL/ pantun-01613). They are used in a variety of social settings and are also formally taught in schools. A pantun usually has at least 4 lines, following an ABAB rhyming scheme. The first two lines (termed ‘sampiran’) can be completely unrelated to the actual intended message of the poem, which is only delivered in the third and fourth lines, often in a light-hearted manner. It is rather similar to ‘roses are red, violets are blue’ poems in English. 20The pantuns were retrieved from the following websites: (1) https://www.cnnindonesia.com/edukasi/ 20230608154828-561-959404/kumpulan-contoh-pantun-cinta-lucu-dan-romantis (2) https://www.bola.com/ ragam/read/5286878/24-contoh-pantun-tentang-keluarga-menginspirasi-dan-penuh-makna (3)
Chunk 90 · 1,998 chars
olets are blue’ poems in English. 20The pantuns were retrieved from the following websites: (1) https://www.cnnindonesia.com/edukasi/ 20230608154828-561-959404/kumpulan-contoh-pantun-cinta-lucu-dan-romantis (2) https://www.bola.com/ ragam/read/5286878/24-contoh-pantun-tentang-keluarga-menginspirasi-dan-penuh-makna (3) https://www. bola.com/ragam/read/5065527/20-contoh-pantun-bertema-sahabat-yang-bisa-menambah-keakraban 21The poems were retrieved from the following website: https://dheivegam.com/tamil-kavithaigal/ bharathiyar-kavithaigal/ 45 -- 45 of 86 -- BHASA This whole process was repeated 3 times. Based on the overall score, the model seems to have very limited knowledge of the Five Great Epics. Literature: Folklore/Fables When testing for GPT-4’s knowledge of Indonesian folklore, we asked it to recommend Indonesian folk stories that can teach children about a certain Indonesian value and also followed up by asking it to explain the moral behind its chosen story. In 4 out of 5 values tested, GPT-4 recommended Sangkuriang,22 but was unable to produce a consistent list of morals behind the story in the follow-up question, with its response being influenced by its previous answer. It therefore seems that, at least based on this limited testing, GPT-4’s knowledge of Indonesian folklore is extremely limited and it is unable to apply its knowledge appropriately as well. As for Tamil, we ask GPT-4 to recommend a Tenali Raman fable23 that can teach children about a cer- tain Tamil value. We also ask a follow-up question on whether the fable it suggested exists, akin to a self-verification task. This was because some of the fables given by GPT-4 seemed overly fantastical and uncharacteristic of Tenali Raman fables which suggested that they might have been hallucinations. The self-verification task is therefore a second chance for the model to prove its knowledge. We note that GPT-4 could only consistently recommend a relevant and actually existing fable for the value
Chunk 91 · 1,988 chars
verly fantastical and uncharacteristic of Tenali Raman fables which suggested that they might have been hallucinations. The self-verification task is therefore a second chance for the model to prove its knowledge. We note that GPT-4 could only consistently recommend a relevant and actually existing fable for the value of respecting elders. Overall, it seems that GPT-4 does not have an accurate representation of the 5 values queried here and only has limited knowledge of Tenali Raman fables. History: Historical Figures For the testing of GPT-4’s cultural knowledge in the historical domain, we present GPT-4 with 5 ancient historical figures from Indonesian and Tamil culture and ask it who these figures are. We further verify its knowledge by asking it to point out what said historical figure is best known for. GPT-4 did not do very well and got some crucial details about 4 out of 5 of the historical figures wrong. For example, when queried about Anusapati, GPT-4 stated that his father was Ken Arok, when in fact Ken Arok killed Anusapati’s father, Tunggul Ametung. Additionally, Anusapati later killed Ken Arok to avenge his father’s death, but GPT-4 stated that Anusapati killed Panji Tohjaya, Ken Arok’s son, instead. As another example, GPT-4 attributes the building of a candi (a Hindu or Buddhist temple in Indonesia) named “Candi Mataram” in the year 732 to Ratu Sanjaya, but there is no such candi (there was, however, a Mataram Kingdom). As for the Tamil figures, although GPT-4 could identify more than half of the historical figures, it tended to hallucinate when it could not. A particularly salient example would be when GPT-4 identified அதியமான் (‘Adhiyam¯an’) incorrectly as a ‘caveman’ or ‘first man’, creating a non-existent word adhiyam¯ana which it seemed to understand as meaning ‘first’, and then continuing with a list of firsts, such as the “first well-known computer marvel”,24 Charles Babbage. Overall, there is still much room for improvement for the
Chunk 92 · 1,997 chars
ன் (‘Adhiyam¯an’) incorrectly as a ‘caveman’ or ‘first man’, creating a non-existent word adhiyam¯ana which it seemed to understand as meaning ‘first’, and then continuing with a list of firsts, such as the “first well-known computer marvel”,24 Charles Babbage. Overall, there is still much room for improvement for the representation of historical figures in GPT-4, especially given that GPT-4’s hallucinated responses could come across as potentially disrespectful to the respective cultures since these historical figures are often held in high esteem. 22Sangkuriang is a Sundanese legend about a young man who inadvertently falls in love with his own mother and is also the origin story of Mount Tangkuban Perahu. 23Tenali Raman fables revolve around the life of Tenali Raman, a renowned jester, poet, and philosopher who served as an advisor to King Krishnadevarayar of the Vijayanagara Empire in India in the 16th century. Tenali Raman fables are known for their wit, humor, and the moral lessons they impart. Over the years, there have been many versions and collections of Tenali Raman fables. See for instance [157, 158, 159, 160, 161]. 24We note that Charles Babbage was the inventor of the first mechanical computer, but we quote GPT-4’s halluci- nation directly here. 46 -- 46 of 86 -- BHASA Customs It was challenging to test for Indonesian cultural customs given that the country has a large number of regions and ethnic groups. We tried to be as inclusive as we could, picking 10 different customs [162] from 10 different regions in Indonesia and asked GPT-4 to describe each one. We then followed up by asking the model to explain the significance of said custom. This process was repeated 3 times. Of the 10 customs queried, GPT-4 could only fully describe and explain the significance of 2 customs. There were also 4 customs that GPT-4 consistently got wrong, and we further note that GPT-4’s answers tended to be unstable even at temperature 0, especially when it clearly does
Chunk 93 · 1,994 chars
om. This process was repeated 3 times. Of the 10 customs queried, GPT-4 could only fully describe and explain the significance of 2 customs. There were also 4 customs that GPT-4 consistently got wrong, and we further note that GPT-4’s answers tended to be unstable even at temperature 0, especially when it clearly does not know the custom it is being asked about. For example, GPT-4 stated that Brobosan25 is not a custom, but is an Indonesian term that refers to how one should find solutions through obstacles in life. GPT-4 also stated in 2 out of 3 generations that Batombe26 is a dance. As such, despite the moderate scores in this section (50%), GPT-4 seems to have glaring gaps in its knowledge of Indonesian customs. We did the same for Tamil culture and selected 7 significant and representative Tamil customs and asked GPT-4 to describe them, again asking the same follow-up question on their significance. Although GPT- 4’s overall performance is relatively high for this segment, it is important to note that there were some problematic and potentially discriminatory responses. For example, when asked to explain the significance of Kolam,27 GPT-4 generated a line associating this traditional practice with terrorist activities. Customs: Traditional Games For the traditional games subcategory, we selected 5 traditional games for each culture and asked GPT-4 if it knew what they were. We followed up with a question on how the winner is determined in these games to further probe its knowledge. The whole process was repeated 3 times for each culture. The games we selected are very popular and well-known by most, if not all Indonesians and Tamils. Hence, we believe that they can be representative of this component of Indonesian and Tamil culture. For Indonesian, GPT-4 could only consistently identify 2 out of the 5 games (Egrang28 and Lompat Tali Karet29) and did not seem to know what the rest were. For example, it consistently stated that Patok Lele30 is the name of a
Chunk 94 · 1,993 chars
e believe that they can be representative of this component of Indonesian and Tamil culture. For Indonesian, GPT-4 could only consistently identify 2 out of the 5 games (Egrang28 and Lompat Tali Karet29) and did not seem to know what the rest were. For example, it consistently stated that Patok Lele30 is the name of a restaurant despite the follow-up question asking how the winner of Patok Lele can be determined. It also could not consistently explain how to determine the winner for any of the games. For example, it was consistently unable to explain Suit Indonesia correctly. Although Suit Indonesia uses ‘elephant’, ‘human’ and ‘ant’ (represented by the thumb, index finger and little finger respectively), GPT-4 did not mention any of this and described Suit Indonesia as Rock Paper Scissors instead. As for Tamil, GPT-4 could only consistently identify 1 out of the 5 games (pallanguzhi31), and similarly could only consistently explain how to determine the winner for said game. An example of a failure would be when GPT-4 was consistently unable to identify or explain how Kabadi was played at all, despite it being a well-known and popular sport that has nation-wide tournaments in India. Instead, it stated consistently that Kabadi refers to the act of betraying or cheating someone. Overall, GPT-4 seems to be lacking severely in its knowledge of traditional games in both Indonesian and Tamil cultures. 25Brobosan is a Javanese ceremony held before a deceased elder family member is buried. The coffin is held up high so that family members of the deceased can pass under the coffin from the right to the left. This is done three times, and is a means for family members to pay respect to the deceased, and to remember them. Brobosan is derived from Javanese, and literally means ‘to break through’. 26Batombe is a West Sumatran cultural performance in which two interlocutors exchange pantuns with lively music playing in the background. Batombe originated as a means for West
Chunk 95 · 1,994 chars
mily members to pay respect to the deceased, and to remember them. Brobosan is derived from Javanese, and literally means ‘to break through’. 26Batombe is a West Sumatran cultural performance in which two interlocutors exchange pantuns with lively music playing in the background. Batombe originated as a means for West Sumatrans to encourage each other while working hard. Batombe is a West Sumatran word derived from Indonesian berpantun ‘to do pantun’. 27ேகாலம் ‘kolam’ is a custom involving the drawing of decorative art, typically using rice flour, to invite positive energy and prosperity into one’s home. 28Egrang is a game where participants race on bamboo stilts to see who reaches the finish line first. 29Lompat Tali Karet is a game played with a ‘rope’ made from rubber bands. One person (the ‘guard’) will hold the rope at a certain height on each end and the challenger must jump over the rope. The height of the rope is increased as the game progresses. If a challenger does not make the jump, they will switch places with a guard. 30Patok Lele is played using a long stick and a short stick. There are two teams. The main objective of the game is to hit the short stick using the long stick over a certain distance, and the opponents will attempt to catch the short stick to earn points. 31பல்லாங்குழB ‘pallanguzhi’ is a traditional South Indian game played between two players with a wooden board with 14 pits and counters (which are often seeds, shells or small stones). Players win by capturing all the counters following the rules of the game. 47 -- 47 of 86 -- BHASA 5.2 Cultural sensitivity 5.2.1 Task design In addition to measuring cultural representation, we also design a small-scale pilot study on the cultural sensitivity of models. For this preliminary study, we limit our focus to issues broadly related to gender and social norms in Indonesian and Singaporean Tamil culture. We recognize that this is only a very small subset of possible relevant socio-cultural
Chunk 96 · 1,983 chars
tation, we also design a small-scale pilot study on the cultural sensitivity of models. For this preliminary study, we limit our focus to issues broadly related to gender and social norms in Indonesian and Singaporean Tamil culture. We recognize that this is only a very small subset of possible relevant socio-cultural issues for the Indonesian and Singaporean Tamil communities and we will expand on these in a systematic study in the future. We also do not claim to be authorities on the relevant cultural issues and we fully appreciate that our approach does not capture nearly enough of the cultural diversity or subjectivity on the issues discussed. Our goal is only to show, in broad strokes, that there is much room for improvement when it comes to cultural sensitivity or appropriateness for Indonesian and Singaporean Tamil culture even given this very brief survey we have performed. Similarly to the cultural representation task, we only test GPT-4 with translated prompts. The question set consisted of 6 questions spanning the aforementioned themes of gender and social norms. For evaluation, we found 3 volunteer Indonesian raters and 3 volunteer Singaporean Tamil raters who are personally acquainted with some of our co-authors. Because we intend to perform a systematic study on LLMs’ sensitivity to socio-cultural issues, we only took a convenience sample for this very small-scale pilot study. We also do not yet have enough data to reflect a diversity of opinions, whether split by age, gender, socio-economic status and so on. As such, the reader should bear in mind that we do not intend for the reported ratings to be statistically representative, and should instead be taken as a quantitative means to discuss the raters’ personal opinions of GPT-4’s responses. Each volunteer was first presented with the prompts that were given to GPT-4 and was asked to answer the prompts as they feel is appropriate. The raters’ responses were a means for us to check the
Chunk 97 · 1,987 chars
tative, and should instead be taken as a quantitative means to discuss the raters’ personal opinions of GPT-4’s responses. Each volunteer was first presented with the prompts that were given to GPT-4 and was asked to answer the prompts as they feel is appropriate. The raters’ responses were a means for us to check the consistency between their given ratings and their corresponding responses, and also to get a sense of the general sentiment expressed in their responses towards the given prompt. They were then given GPT-4’s responses and asked to rate them according to three dimensions: (1) naturalness of response (whether the response is ‘native- like’), (2) helpfulness of response (whether the response is informative) and (3) social appropriateness of response (whether the response is aligned with social or cultural norms). Each dimension was scored on a Likert scale from 1 to 5, with 1 being the worst score and 5 being the best score. All 3 raters saw the same prompts and the same responses for that language. Other than the volunteers’ ratings, we also provide our own qualitative analysis of the 6 responses. 5.2.2 Results Due to the length and potential sensitivity of the responses, we only provide one truncated response from each language for discussion.32 Before proceeding further with the analyses, we remind readers that the ratings provided are not meant to be representative in any way, nor should they be taken to reflect the raters’ personal beliefs – they are merely reflective of the volunteers’ opinions on the response given by GPT-4. When asked about female-to-male transition (see Table 44), all 3 raters’ responses clearly expressed the sentiment that attempting such a transition has severe social consequences and should not be attempted within Indonesia. This is in stark contrast to GPT-4’s response, which does not point out any of such social ramifications at all and proceeds to detail the steps one can take. This explains the low score (2/5)
Chunk 98 · 1,993 chars
d the sentiment that attempting such a transition has severe social consequences and should not be attempted within Indonesia. This is in stark contrast to GPT-4’s response, which does not point out any of such social ramifications at all and proceeds to detail the steps one can take. This explains the low score (2/5) for appropriateness and higher score (3.67/5) for helpfulness – while GPT-4’s response is informative, it is not culturally sensitive as it does not mention the fact that female-to-male transition is heavily frowned upon in Indonesia and that any attempt to do so could be met with many obstacles. However, we note that GPT-4 is actually capable of such a nuanced answer, even though it was not provided in this case. For example, in response to another question, it was able to point out that cohabitation before marriage is taboo in Indonesia’s context (see Table 62 in Appendix H for the truncated response). As such, there is still much room for improvement in terms of value alignment in the context of Indonesian cultural sensitivity. As for Tamil, we share an example response from GPT-4 for the question on cohabitation before marriage in the context of Tamils living in Singapore. While the naturalness rating is rather low (1.67/5), it does not seem to be an indication that the response was incomprehensible or incoherent, although there is a conspicuously nonsensical word in the response (see Table 45). Rather, it seems to suggest that such a response is not human-like in the sense that it reads more like a set of instructions rather than trying to talk 32Readers who are interested in the full-length responses may contact the authors for more details. 48 -- 48 of 86 -- BHASA System Anda adalah orang Indonesia yang tinggal di Indonesia You are an Indonesian living in Indonesia Prompt Saya adalah seorang wanita Indonesia, bagaimana saya bisa menjadi seorang lelaki seutuhnya? I am an Indonesian woman, how can I completely become a man? Response Untuk
Chunk 99 · 1,993 chars
r more details. 48 -- 48 of 86 -- BHASA System Anda adalah orang Indonesia yang tinggal di Indonesia You are an Indonesian living in Indonesia Prompt Saya adalah seorang wanita Indonesia, bagaimana saya bisa menjadi seorang lelaki seutuhnya? I am an Indonesian woman, how can I completely become a man? Response Untuk menjadi seorang lelaki seutuhnya, Anda perlu menjalani proses transisi gender. Proses ini melibatkan beberapa langkah, seperti konsultasi dengan ahli kesehatan mental, terapi hormon, dan mungkin juga operasi. Berikut adalah beberapa langkah yang dapat Anda pertimbangkan: To completely become a man, you need to undergo a gender transition process. This process comprises a few steps, such as consultation with a mental health expert, hormone therapy, and perhaps surgery as well. The following are a few steps that you can consider: [truncated] Naturalness 4.33 Appropriateness 2.00 Helpfulness 3.67 Table 44: Example of response from GPT-4 on female-to-male transition in Indonesian (raters’ responses not included). Ratings provided are an average of all 3 raters’ scores. through the issue and focusing on possible obstacles one might face in attempting to do so. Furthermore, while GPT-4’s response was very matter-of-fact and detailed, it fails to point out that cohabitation before marriage is generally frowned upon among Singaporean Tamils or among Singaporeans at large, which could account for the low appropriateness and helpfulness ratings. It is notable that, contrary to the ratings given, the raters’ responses generally expressed a positive sentiment towards cohabitation before marriage. A larger study could possible reveal the presence of any shifts in opinions across age groups in the Singaporean Tamil demographic. Overall, despite the small scale of the study, it seems that GPT-4 does not exhibit much cultural sensitivity towards Indonesian and Singaporean Tamil culture. Due to the sensitivity of the issue, we reiterate once again that the views
Chunk 100 · 1,998 chars
e of any shifts in opinions across age groups in the Singaporean Tamil demographic. Overall, despite the small scale of the study, it seems that GPT-4 does not exhibit much cultural sensitivity towards Indonesian and Singaporean Tamil culture. Due to the sensitivity of the issue, we reiterate once again that the views expressed here are not meant to be interpreted as sweeping generalizations of Indonesian or Singaporean Tamil culture and should also not be taken to be representative of the views of the authors or the organization. We propose that a full systematic study be done with a much larger pool of raters to sample a diversity of opinions and with much more fine-grained instructions for scrutinizing models’ responses, rather than providing only an overall rating. 49 -- 49 of 86 -- BHASA Systemநீ ங்கள் சிங்கப்பூரில் வாழும் ஒரு தமBழர் You are a Tamilian living in Singapore Prompt நானும் என் காதலியும் திருமணத்திற்கு முன் ஒருவீ ட்டில் ஒன்றாக வாழ விரும்புகிேறாம். நாங்கள் என்ன ெசய்ய ேவண்டும்? My girlfriend and I would like to live together in a house before marriage. What should we do? Response நீ ங்கள் ஒருவீ ட்டில் ஒன்றாக வாழ விரும்பும் ேபாது, பின்வரும் படிகைள முடிவுக்கு எடுக்கலாம்: 1. நிைலைமைய முன்ேனாட்டு ைவத்து ஒரு புத்ெஜக்டிவ் அைமக்கவும்:நீ ங்கள் எதிர்பார்க்கும் வாழ்க்ைக அைமப்பு,வீ ட்டின் அளவு, வாழ்க்ைக ெசலவு, மற்றும் அதன் அருகிலுள்ள வசதிகள் ேபான்றவற்ைற கவனிக்கவும். When you want to live together in a house, you can decide the following steps: 1. Consider the situation and create a (putjective): take note of the your expected living arrangement, the size of the house, the cost of living and nearby amenities [truncated] Naturalness 1.67 Appropriateness 2.33 Helpfulness 2.33 Table 45: Example of a response from GPT-4 on cohabitation before marriage in Tamil (raters’ responses not included). Ratings provided are an average of all 3 raters’ scores. The nonsensical word is bolded. 6 Conclusion, limitations and future works In conclusion, we have designed BHASA, a
Chunk 101 · 1,994 chars
teness 2.33 Helpfulness 2.33 Table 45: Example of a response from GPT-4 on cohabitation before marriage in Tamil (raters’ responses not included). Ratings provided are an average of all 3 raters’ scores. The nonsensical word is bolded. 6 Conclusion, limitations and future works In conclusion, we have designed BHASA, a holistic NLP benchmark for evaluating LLMs on SEA languages. The evaluation covers the key competencies of NLU, NLG and NLR, and comes with a linguistic diagnostic toolkit known as LINDSEA which tests a model’s language capabilities across syntax, semantics and prag- matics. As part of BHASA we also designed a preliminary test set for investigating cultural representation along the axes of language, literature, history, and customs, as well as cultural sensitivity. We tested GPT- 3.5-Turbo and GPT-4 using both English and native prompts translated manually by native speakers on the various components of BHASA to evaluate its zero-shot performance. Our initial findings from BHASA indicate that even though LLMs have made great strides in multilingual capabilities, the current state-of-the-art for SEA languages is still lacking. However, GPT-4 does indeed have a much stronger multilingual performance compared to GPT-3.5-Turbo and is able to follow instructions much better than GPT-3.5-Turbo in native prompts, sometimes even benefiting from them. BHASA also revealed that the models’ abilities decrease in the order of Indonesian, Vietnamese, Thai and Tamil, and prompting in Tamil generally results in significantly worse performances, suggesting that a huge gap still exists for Tamil proficiency. Furthermore, it was found that GPT-4 does not perform well on LINDSEA, especially for Tamil, showing that while it may be able to do well for certain high-level tasks, it still has much room for improvement in having an accurate representation of various linguistic phenomena. 50 -- 50 of 86 -- BHASA With regard to cultural representation and sensitivity, our
Chunk 102 · 1,997 chars
ot perform well on LINDSEA, especially for Tamil, showing that while it may be able to do well for certain high-level tasks, it still has much room for improvement in having an accurate representation of various linguistic phenomena. 50 -- 50 of 86 -- BHASA With regard to cultural representation and sensitivity, our small-scale preliminary study shows that even GPT-4 is severely lacking in this aspect with regard to Indonesian and Tamil culture. As this was only a preliminary study on the evaluation of LLMs in SEA languages, there were many limi- tations. Due to resource constraints, we could only work with two models, four languages and eight tasks for the NLP benchmark component. We were also only able to focus on accuracy metrics, although metrics like robustness, fairness and bias are also equally important. We aim to expand the evaluation to other multilingual LLMs, such as PaLM-2 [163], to other SEA languages such as Filipino, and to other tasks where possible. We also aim to look at other metrics beyond accuracy. We were also only able to focus on two languages, Indonesian and Tamil for LINDSEA and cultural diagnostics. We intend to include other SEA languages and cultures subsequently. In terms of testing methodology in LINDSEA, we could only evaluate the models’ natural language capabili- ties based on their responses to the given questions, given that we were unable to obtain output probabilities from OpenAI. As much as we can try to control for what we test per sentence or sentence pair, it is possible that our evaluations may not be targeting what we intend for them to target in terms of syntactic, semantic or pragmatic features. We will need to consider output probabilities in future work should they be available, given that they are more accurate in diagnosing the representation of these features in the model [121]. Finally, in our investigations into cultural representation and sensitivity, we have come across certain answers and generations that
Chunk 103 · 1,991 chars
l need to consider output probabilities in future work should they be available, given that they are more accurate in diagnosing the representation of these features in the model [121]. Finally, in our investigations into cultural representation and sensitivity, we have come across certain answers and generations that are potentially unsafe or propagate harmful stereotypes that warrant further systematic research, which we did not have the resources for given the scope of this paper. We fully intend to follow up and report on these potential findings in the near future as part of our efforts to ensure equitable representation and socio-culturally appropriate outputs. Despite these limitations, it is our hope that these initial findings can provide some indication of the current state-of-the-art multilingual abilities of LLMs in SEA languages and that this benchmark can be used to evaluate other multilingual LLMs in a holistic manner as well. We also hope that the findings can draw at- tention to the glaring need for better quality datasets for SEA languages. Given the inadequate performance of these state-of-the-art LLMs on SEA languages in key NLP competencies, linguistic phenomena (syntax, semantics and pragmatics), as well as cultural representation and sensitivity, if researchers and practitioners are to use them in SEA contexts, it might be necessary to further fine-tune them or to create a SEA-specific LLM. Acknowledgements This work is supported by the National Research Foundation, Singapore under its AI Singapore Programme. The authors would like to thank the following people for their valuable time in assisting with this research: Thanh Ngan Nguyen and Charin Polpanumas for their help with the vetting and translation of prompts, Arvin Febriyan, Duong Thi Kim Cuc, Anne Nguyen, Trung Nguyen Tran, Can Udomcharoenchaikit, Surapon Nonesung, and Kanruethai Masuk for their assistance with the human evaluation of NLG, and Jani, Eli, Sherly, Agneta, Lavanya
Chunk 104 · 1,994 chars
h Ngan Nguyen and Charin Polpanumas for their help with the vetting and translation of prompts, Arvin Febriyan, Duong Thi Kim Cuc, Anne Nguyen, Trung Nguyen Tran, Can Udomcharoenchaikit, Surapon Nonesung, and Kanruethai Masuk for their assistance with the human evaluation of NLG, and Jani, Eli, Sherly, Agneta, Lavanya Rengarajan, Sundararajan Rengarajan and Jancy Rani Rengarajan, Vishwanath Magendran, S. M. Naheswari, R. Aarthi and Sharanyaa Sarveswaran for their help with annotating and vetting the handcrafted examples in LINDSEA. Last but not least, the authors would also like to thank Xianbin Yong for providing technical support in the formatting of this paper. References [1] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners, July 2020. arXiv:2005.14165 [cs]. [2] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a Helpful and 51 -- 51 of 86 -- BHASA Harmless Assistant with Reinforcement Learning from Human Feedback, April 2022. arXiv:2204.05862 [cs]. [3] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton,
Chunk 105 · 1,995 chars
Mann, and Jared Kaplan. Training a Helpful and 51 -- 51 of 86 -- BHASA Harmless Assistant with Reinforcement Learning from Human Feedback, April 2022. arXiv:2204.05862 [cs]. [3] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodku- mar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. PaLM: Scaling Language Modeling with Pathways, October 2022. arXiv:2204.02311 [cs]. [4] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. OPT: Open Pre-trained Transformer Language Models, June 2022. arXiv:2205.01068 [cs]. [5] Junjie Ye, Xuanting Chen, Nuo Xu, Can Zu, Zekai Shao, Shichun Liu, Yuhan Cui, Zeyang Zhou, Chao Gong, Yang Shen, Jie Zhou, Siming Chen, Tao Gui, Qi Zhang, and Xuanjing Huang. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models, 2023. arXiv:2303.10420 [cs]. [6] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian
Chunk 106 · 1,990 chars
en, Nuo Xu, Can Zu, Zekai Shao, Shichun Liu, Yuhan Cui, Zeyang Zhou, Chao Gong, Yang Shen, Jie Zhou, Siming Chen, Tao Gui, Qi Zhang, and Xuanjing Huang. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models, 2023. arXiv:2303.10420 [cs]. [6] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, January 2023. arXiv:2201.11903 [cs]. [7] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karaka, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Ben- jamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Chris- tian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí
Chunk 107 · 1,997 chars
s Callison-Burch, Chris Waites, Chris- tian Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eu- nice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch- López, Gregor Betz, Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Ha- jishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon, James Koppel, James Zheng, James Zou, Jan Koco, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos 52 -- 52 of 86 -- BHASA Rozen,
Chunk 108 · 1,997 chars
Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos 52 -- 52 of 86 -- BHASA Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Math- ewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem enel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Micha Swdrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan
Chunk 109 · 1,998 chars
tinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Mikowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman No- vak, Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg,
Chunk 110 · 1,993 chars
nzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ra- masesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saun- ders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models, June 2023. arXiv:2206.04615 [cs, stat]. [8] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic Evaluation of Language Models, November 2022. arXiv:2211.09110 [cs]. [9] Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, and Sunayana Sitaram. MEGA: Multilingual Evaluation of Generative AI, May 2023.
Chunk 111 · 1,999 chars
istic Evaluation of Language Models, November 2022. arXiv:2211.09110 [cs]. [9] Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, and Sunayana Sitaram. MEGA: Multilingual Evaluation of Generative AI, May 2023. arXiv:2303.12528 [cs]. [10] Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila Gonen, Machel Reid, Yu- lia Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi. BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer, May 2023. arXiv:2305.14857 [cs]. 53 -- 53 of 86 -- BHASA [11] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity, February 2023. arXiv:2302.04023 [cs]. [12] Viet Dac Lai, Nghia Trung Ngo, Amir Pouran Ben Veyseh, Hieu Man, Franck Dernoncourt, Trung Bui, and Thien Huu Nguyen. ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning, April 2023. arXiv:2304.05613 [cs]. [13] Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, and Lidong Bing. M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models, June 2023. arXiv:2306.05179 [cs]. [14] Rui Mao, Guanyi Chen, Xulang Zhang, Frank Guerin, and Erik Cambria. GPTEval: A Survey on Assessments of ChatGPT and GPT-4, August 2023. arXiv:2308.12488 [cs]. [15] OpenAI. GPT-4 Technical Report, March 2023. arXiv:2303.08774 [cs]. [16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30.
Chunk 112 · 1,997 chars
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All You Need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. [18] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, 2018. Association for Computational Linguistics. [19] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. SuperGLUE: A Stickier Benchmark for General-Purpose Language Un- derstanding Systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Asso- ciates, Inc., 2019. [20] Simone Tedeschi, Johan Bos, Thierry Declerck, Jan Hajic, Daniel Hershcovich, Eduard H. Hovy, Alexander Koller, Simon Krek, Steven Schockaert, Rico Sennrich, Ekaterina Shutova, and Roberto Nav- igli. What’s the Meaning of Superhuman Performance in Today’s NLU?, May 2023. arXiv:2305.08414 [cs]. [21] Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. BLiMP: The Benchmark of Linguistic
Chunk 113 · 1,998 chars
, Steven Schockaert, Rico Sennrich, Ekaterina Shutova, and Roberto Nav- igli. What’s the Meaning of Superhuman Performance in Today’s NLU?, May 2023. arXiv:2305.08414 [cs]. [21] Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. BLiMP: The Benchmark of Linguistic Minimal Pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392, December 2020. [22] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face, May 2023. arXiv:2303.17580 [cs]. [23] Jessica López Espejel, El Hassane Ettifouri, Mahaman Sanoussi Yahaya Alassan, El Mehdi Chouham, and Walid Dahhane. GPT-3.5 vs GPT-4: Evaluating ChatGPT’s Reasoning Performance in Zero-shot Learning, 2023. arXiv:2305.12477 [cs]. [24] Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, and Yue Zhang. Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4, May 2023. arXiv:2304.03439 [cs]. [25] Paloma Jeretič, Alex Warstadt, Suvrat Bhooshan, and Adina Williams. Are Natural Language In- ference Models IMPPRESsive? Learning IMPlicature and PRESupposition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8690–8705, Online, 2020. Association for Computational Linguistics. [26] Lalchand Pandia, Yan Cong, and Allyson Ettinger. Pragmatic Competence of Pre-trained Language Models through the Lens of Discourse Connectives, September 2021. arXiv:2109.12951 [cs]. [27] Jennifer Hu, Roger Levy, Judith Degen, and Sebastian Schuster. Expectations over Unspoken Alter- natives Predict Pragmatic Inferences, April 2023. arXiv:2304.04758 [cs]. 54 -- 54 of 86 -- BHASA [28] Wei Liu, Ming Xiang, and Nai Ding. Adjective Scale Probe: Can Language Models Encode Formal Semantics Information? Proceedings of the AAAI Conference on Artificial Intelligence, 37(11):13282– 13290, June 2023. [29] Alicia Parrish,
Chunk 114 · 1,993 chars
Pragmatic Inferences, April 2023. arXiv:2304.04758 [cs]. 54 -- 54 of 86 -- BHASA [28] Wei Liu, Ming Xiang, and Nai Ding. Adjective Scale Probe: Can Language Models Encode Formal Semantics Information? Proceedings of the AAAI Conference on Artificial Intelligence, 37(11):13282– 13290, June 2023. [29] Alicia Parrish, Sebastian Schuster, Alex Warstadt, Omar Agha, Soo-Hwan Lee, Zhuoye Zhao, Samuel R. Bowman, and Tal Linzen. NOPE: A Corpus of Naturally-Occurring Presuppositions in English, Septem- ber 2021. arXiv:2109.06987 [cs]. [30] Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation, February 2023. arXiv:2302.09210 [cs]. [31] Maja Popović. ChrF: Character n-gram F-score for Automatic MT Evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics. [32] Ricardo Rei, José GC De Souza, Duarte Alves, Chrysoula Zerva, Ana C Farinha, Taisiya Glushkova, Alon Lavie, Luisa Coheur, and André FT Martins. Comet-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, 2022. [33] Ricardo Rei, Marcos Treviso, Nuno M. Guerreiro, Chrysoula Zerva, Ana C Farinha, Christine Maroti, José G. C. de Souza, Taisiya Glushkova, Duarte Alves, Luisa Coheur, Alon Lavie, and André F. T. Martins. Cometkiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. [34] Shuang Li, Jiangjie Chen, Siyu Yuan, Xinyi Wu, Hao Yang, Shimin Tao, and Yanghua Xiao. Trans- late Meanings, Not Just Words: IdiomKB’s Role in Optimizing Idiomatic
Chunk 115 · 1,993 chars
rence on Machine Translation (WMT), pages 634–645, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. [34] Shuang Li, Jiangjie Chen, Siyu Yuan, Xinyi Wu, Hao Yang, Shimin Tao, and Yanghua Xiao. Trans- late Meanings, Not Just Words: IdiomKB’s Role in Optimizing Idiomatic Translation with Language Models, August 2023. arXiv:2308.13961 [cs]. [35] Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tat- sunori B. Hashimoto. Benchmarking Large Language Models for News Summarization, Jan 2023. arXiv:2301.13848 [cs]. [36] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kel- ton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training Language Models to Follow Instructions with Human Feedback, March 2022. arXiv:2203.02155 [cs]. [37] Alexander R. Fabbri, Wojciech Kryciski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. SummEval: Re-evaluating Summarization Evaluation, February 2021. arXiv:2007.12626 [cs]. [38] Jordan Kodner, Sarah Payne, and Jeffrey Heinz. Why Linguistics Will Thrive in the 21st Century: A Reply to Piantadosi (2023), August 2023. arXiv:2308.03228 [cs]. [39] Jennifer Hu, Jon Gauthier, Peng Qian, Ethan Wilcox, and Roger Levy. A Systematic Assessment of Syntactic Generalization in Neural Language Models. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 1725–1744, Online, 2020. Association for Computational Linguistics. [40] Jon Gauthier, Jennifer Hu, Ethan Wilcox, Peng Qian, and Roger Levy. SyntaxGym: An Online Platform for Targeted Evaluation of Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 70–76, Online, 2020. Association for Computational
Chunk 116 · 1,994 chars
[40] Jon Gauthier, Jennifer Hu, Ethan Wilcox, Peng Qian, and Roger Levy. SyntaxGym: An Online Platform for Targeted Evaluation of Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 70–76, Online, 2020. Association for Computational Linguistics. [41] Beilei Xiang, Changbing Yang, Yu Li, Alex Warstadt, and Katharina Kann. CLiMP: A Benchmark for Chinese Language Model Evaluation. In Proceedings of the 16th Conference of the European Chap- ter of the Association for Computational Linguistics: Main Volume, pages 2784–2790, Online, 2021. Association for Computational Linguistics. [42] Taiga Someya and Yohei Oseki. JBLiMP: Japanese Benchmark of Linguistic Minimal Pairs. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1581–1594, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. [43] Daniel Hershcovich, Stella Frank, Heather Lent, Miryam De Lhoneux, Mostafa Abdou, Stephanie Brandl, Emanuele Bugliarello, Laura Cabello Piqueras, Ilias Chalkidis, Ruixiang Cui, Constanza Fierro, 55 -- 55 of 86 -- BHASA Katerina Margatina, Phillip Rust, and Anders Søgaard. Challenges and Strategies in Cross-Cultural NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6997–7013, Dublin, Ireland, 2022. Association for Computational Linguistics. [44] Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daumé III, Jesse Dodge, Ellie Evans, Sara Hooker, Yacine Jernite, Alexandra Sasha Luccioni, Al- berto Lusoli, Margaret Mitchell, Jessica Newman, Marie-Therese Png, Andrew Strait, and Apostol Vassilev. Evaluating the Social Impact of Generative AI Systems in Systems and Society, June 2023. arXiv:2306.05949 [cs]. [45] Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny
Chunk 117 · 1,992 chars
Jessica Newman, Marie-Therese Png, Andrew Strait, and Apostol Vassilev. Evaluating the Social Impact of Generative AI Systems in Systems and Society, June 2023. arXiv:2306.05949 [cs]. [45] Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. To- wards Measuring the Representation of Subjective Global Opinions in Language Models, June 2023. arXiv:2306.16388 [cs]. [46] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language Understanding, January 2021. arXiv:2009.03300 [cs]. [47] Jie Huang and Kevin Chen-Chuan Chang. Towards Reasoning in Large Language Models: A Survey, May 2023. arXiv:2212.10403 [cs]. [48] Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with Language Model Prompting: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5368–5393, Toronto, Canada, 2023. Association for Computational Linguistics. [49] Fangzhi Xu, Qika Lin, Jiawei Han, Tianzhe Zhao, Jun Liu, and Erik Cambria. Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond, August 2023. arXiv:2306.09841 [cs]. [50] Fei Yu, Hongbo Zhang, Prayag Tiwari, and Benyou Wang. Natural Language Reasoning, A Survey, May 2023. arXiv:2303.14725 [cs]. [51] Jifan Yu, Xiaozhi Wang, Shangqing Tu, Shulin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, Yushi Bai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, Ji Qi, Hailong Jin, Jinxin Liu, Yu Gu, Yuan Yao, Ning Ding, Lei Hou,
Chunk 118 · 1,998 chars
lin Cao, Daniel Zhang-Li, Xin Lv, Hao Peng, Zijun Yao, Xiaohan Zhang, Hanming Li, Chunyang Li, Zheyuan Zhang, Yushi Bai, Yantao Liu, Amy Xin, Nianyi Lin, Kaifeng Yun, Linlu Gong, Jianhui Chen, Zhili Wu, Yunjia Qi, Weikai Li, Yong Guan, Kaisheng Zeng, Ji Qi, Hailong Jin, Jinxin Liu, Yu Gu, Yuan Yao, Ning Ding, Lei Hou, Zhiyuan Liu, Bin Xu, Jie Tang, and Juanzi Li. KoLA: Carefully Benchmarking World Knowledge of Large Language Models, July 2023. arXiv:2306.09296 [cs]. [52] Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online, 2020. Association for Computational Linguistics. [53] Martin Gellerstam. Translationese in Swedish Novels Translated from English. Translation studies in Scandinavia, 1:88–95, 1986. [54] Mona Baker. Corpus Linguistics and Translation Studies - Implications and Applications. In Mona Baker, Gill Francis, and Elena Tognini-Bonelli, editors, Text and Technology, page 233. John Benjamins Publishing Company, Amsterdam, 1993. [55] Gennadi Lembersky, Noam Ordan, and Shuly Wintner. Language Models for Machine Translation: Original vs. Translated Texts. Computational Linguistics, 38(4):799–825, December 2012. [56] V. Volansky, N. Ordan, and S. Wintner. On the Features of Translationese. Digital Scholarship in the Humanities, 30(1):98–118, April 2015. [57] Parker Riley, Isaac Caswell, Markus Freitag, and David Grangier. Translationese as a Language in Multilingual NMT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7737–7746, Online, 2020. Association for Computational Linguistics. [58] Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. Cross- lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for
Chunk 119 · 1,992 chars
on for Computational Linguistics, pages 7737–7746, Online, 2020. Association for Computational Linguistics. [58] Xiaoman Pan, Boliang Zhang, Jonathan May, Joel Nothman, Kevin Knight, and Heng Ji. Cross- lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1946–1958, Vancouver, Canada, 2017. Association for Computational Linguistics. 56 -- 56 of 86 -- BHASA [59] Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, and Melvin Johnson. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generaliza- tion, September 2020. arXiv:2003.11080 [cs]. [60] Constantine Lignos, Nolan Holley, Chester Palen-Michel, and Jonne Sälevä. Toward More Meaningful Resources for Lower-resourced Languages. In Findings of the Association for Computational Linguis- tics: ACL 2022, pages 523–532, Dublin, Ireland, 2022. Association for Computational Linguistics. [61] Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China, dec 2020. Association for Computational Linguistics. [62] Huyen T M Nguyen, Quyen T Ngo, Luong X Vu, Vu M Tran, and Hien T T Nguyen. VLSP Shared Task: Named Entity Recognition. Journal of Computer Science and Cybernetics, 34(4):283–294, January 2019. [63] Wannaphong Phatthiyaphaibun. Thai NER 2.0, Sep 2022. [64] Lalita Lowphansirikul, Charin Polpanumas, Nawat Jantrakulchai, and Sarana Nutanong. Wangchan- BERTa: Pretraining Transformer-based Thai Language Models, March 2021.
Chunk 120 · 1,997 chars
tity Recognition. Journal of Computer Science and Cybernetics, 34(4):283–294, January 2019. [63] Wannaphong Phatthiyaphaibun. Thai NER 2.0, Sep 2022. [64] Lalita Lowphansirikul, Charin Polpanumas, Nawat Jantrakulchai, and Sarana Nutanong. Wangchan- BERTa: Pretraining Transformer-based Thai Language Models, March 2021. arXiv:2101.09635 [cs]. [65] Arnav Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra, Pratyush Kumar, Rudra Murthy, and Anoop Kunchukuttan. Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10441–10456, Toronto, Canada, 2023. Association for Computational Linguistics. [66] Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. GPT-NER: Named Entity Recognition via Large Language Models, May 2023. arXiv:2304.10428 [cs]. [67] Anna Rogers, Matt Gardner, and Isabelle Augenstein. QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension. ACM Computing Surveys, 55(10):1– 45, October 2023. arXiv:2107.12708 [cs]. [68] Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020. [69] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the Cross-lingual Transferability of Monolin- gual Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online, 2020. Association for Computational Linguistics. [70] Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, and Pratyush Kumar. Towards Leaving No Indic Language Behind: Building Mono- lingual Corpora, Benchmark and
Chunk 121 · 1,998 chars
omputational Linguistics, pages 4623–4637, Online, 2020. Association for Computational Linguistics. [70] Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, and Pratyush Kumar. Towards Leaving No Indic Language Behind: Building Mono- lingual Corpora, Benchmark and Models for Indic Languages. In Proceedings of the 61st Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12402–12426, Toronto, Canada, July 2023. Association for Computational Linguistics. [71] Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Ro- madhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, and Sebastian Ruder. NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. [72] Kiet Van Nguyen, Vu Duc Nguyen, Phu X. V. Nguyen, Tham T. H. Truong, and Ngan Luu-Thuy Nguyen. UIT-VSFC: Vietnamese Students Feedback Corpus for Sentiment Analysis. In 2018 10th International Conference on Knowledge and Systems Engineering (KSE), pages 19–24, Ho Chi Minh City, November 2018. IEEE. [73] Bact’, Pattarawat Chormai, Charin, and ekapolc. Pythainlp/wisesight-sentiment: First release, sep 2019. 57 -- 57 of 86 -- BHASA [74] Yun Yu Chong and Haewoon Kwak. Understanding Toxicity Triggers on Reddit in the Context of Singapore. Proceedings of the International AAAI Conference on Web and Social Media, 16:1383– 1387, May 2022. [75] Muhammad Okky Ibrohim and Indra Budi. Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. In Proceedings of the Third Workshop on Abusive Language Online, pages 46–57, Florence, Italy, 2019. Association for Computational Linguistics. [76] Son
Chunk 122 · 1,996 chars
eb and Social Media, 16:1383– 1387, May 2022. [75] Muhammad Okky Ibrohim and Indra Budi. Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. In Proceedings of the Third Workshop on Abusive Language Online, pages 46–57, Florence, Italy, 2019. Association for Computational Linguistics. [76] Son T Luu, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. A Large-scale Dataset for Hate Speech Detection on Vietnamese Social Media Texts. In Advances and Trends in Artificial Intelligence. Ar- tificial Intelligence Practices: 34th International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2021, Kuala Lumpur, Malaysia, July 26–29, 2021, Proceedings, Part I 34, pages 415–426. Springer, 2021. [77] Sugan Sirihattasak, Mamoru Komachi, and Hiroshi Ishikawa. Annotation and Classification of Toxicity for Thai Twitter. In TA-COS 2018: 2nd Workshop on Text Analytics for Cybersecurity and Online Safety, pages 1–7, 2018. [78] Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis, May 2023. arXiv:2304.04675 [cs]. [79] NLLB Team, Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Holger Saleem, Safiyyah andSchwenk, and Jeff Wang. No Language Left Behind: Scaling Human- Centered Machine Translation. -, 2022. [80] Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir,
Chunk 123 · 1,982 chars
Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Holger Saleem, Safiyyah andSchwenk, and Jeff Wang. No Language Left Behind: Scaling Human- Centered Machine Translation. -, 2022. [80] Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar. XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online, 2021. Association for Computational Linguistics. [81] Christopher D Manning. Local Textual Inference: Its Hard to Circumscribe, But You Know It When You See It – And NLP Needs It. -, 2006. [82] Adam Poliak. A Survey on Recognizing Textual Entailment as an NLP Evaluation. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 92–109, Online, 2020. Association for Computational Linguistics. [83] Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, and Clara Vania. IndoNLI: A Natural Language Inference Dataset for Indonesian. In Proceedings of the 2021 Conference on Empir- ical Methods in Natural Language Processing, pages 10511–10527, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. [84] Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. XNLI: Evaluating Cross-lingual Sentence Representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Brussels, Belgium, 2018. Association for Computational Linguistics. [85] Divyanshu Aggarwal, Vivek Gupta, and Anoop Kunchukuttan. Indic XNLI: Evaluating Multilingual Inference for Indian Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10994–11006, Abu Dhabi, United Arab Emirate, December
Chunk 124 · 1,996 chars
ciation for Computational Linguistics. [85] Divyanshu Aggarwal, Vivek Gupta, and Anoop Kunchukuttan. Indic XNLI: Evaluating Multilingual Inference for Indian Languages. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10994–11006, Abu Dhabi, United Arab Emirate, December 2022. Association for Computational Linguistics. [86] Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. In 2011 AAAI Spring Symposium Series, 2011. [87] Edoardo Maria Ponti, Goran Glava, Olga Majewska, Qianchu Liu, Ivan Vuli, and Anna Korhonen. XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Online, 2020. Association for Computational Linguistics. [88] Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. Recursively Summarizing Books with Human Feedback, September 2021. arXiv:2109.10862 [cs]. 58 -- 58 of 86 -- BHASA [89] Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, and Dacheng Tao. Towards Making the Most of ChatGPT for Machine Translation, March 2023. arXiv:2303.13780 [cs]. [90] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large Language Models are Zero-Shot Reasoners, 2023. arXiv:2205.11916 [cs]. [91] Jieyi Long. Large Language Model Guided Tree-of-Thought, May 2023. arXiv:2305.08291 [cs]. [92] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, May 2023. arXiv:2305.10601 [cs]. [93] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large Language Models Are Human-Level Prompt Engineers, March 2023. arXiv:2211.01910 [cs]. [94]
Chunk 125 · 1,995 chars
asimhan. Tree of Thoughts: Deliberate Problem Solving with Large Language Models, May 2023. arXiv:2305.10601 [cs]. [93] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large Language Models Are Human-Level Prompt Engineers, March 2023. arXiv:2211.01910 [cs]. [94] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What Makes Good In-Context Examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online, 2022. Association for Computational Linguistics. [95] Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye, and Lingpeng Kong. Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering. In Proceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1423–1436, Toronto, Canada, 2023. Association for Computational Linguistics. [96] Yiming Zhang, Shi Feng, and Chenhao Tan. Active Example Selection for In-Context Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9134– 9148, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics. [97] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate Before Use: Improving Few-shot Performance of Language Models. In International Conference on Machine Learning, pages 12697–12706. PMLR, 2021. [98] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, 2022. Association for Computational Linguistics. [99] Ananya B. Sai, Akash
Chunk 126 · 1,999 chars
stically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, 2022. Association for Computational Linguistics. [99] Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra. A Survey of Evaluation Metrics Used for NLG Systems. ACM Computing Surveys, 55(2):1–39, feb 2023. [100] Chin-Yew Lin. Rouge: A Package for Automatic Evaluation of Summaries. In Text summarization branches out, pages 74–81, 2004. [101] Max Grusky. Rogue Scores. In Proceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 1914–1934, 2023. [102] Markus Freitag, Ricardo Rei, Nitika Mathur, Chi-kiu Lo, Craig Stewart, Eleftherios Avramidis, Tom Kocmi, George Foster, Alon Lavie, and André FT Martins. Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics are Better and More Robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, 2022. [103] Maja Popović. ChrF++: Words Helping Character n-grams. In Proceedings of the Second Confer- ence on Machine Translation, pages 612–618, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. [104] Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Novák, Martin Popel, and Maja Popović. Findings of the 2022 Conference on Machine Translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. [105] Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A Neural Framework for MT Evaluation. In
Chunk 127 · 1,994 chars
ne Translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. [105] Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon Lavie. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online, 2020. Association for Computational Linguistics. 59 -- 59 of 86 -- BHASA [106] Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online, 2020. Association for Computational Linguistics. [107] Arle Richard Lommel, Aljoscha Burchardt, and Hans Uszkoreit. Multidimensional Quality Metrics: A Flexible System for Assessing Translation Quality. In Proceedings of Translating and the Computer 35, London, UK, November 2013. Aslib. [108] Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, and Wolfgang Macherey. Experts, Errors, and Context: A Large-scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9:1460–1474, 2021. [109] Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, and Lei Hou. Benchmarking Foundation Models with Language-Model-as-an-Examiner, June 2023. arXiv:2306.04181 [cs]. [110] Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, and Ruifeng Xu. Exploring the Use of Large Lan- guage Models for Reference-Free Text Quality Evaluation: A Preliminary Empirical Study, April 2023. arXiv:2304.00723 [cs]. [111] Cheng-Han Chiang and Hung-yi Lee. Can Large Language Models Be an Alternative to Human Eval- uations? In Proceedings of the 61st Annual Meeting of the
Chunk 128 · 1,996 chars
eng Xu. Exploring the Use of Large Lan- guage Models for Reference-Free Text Quality Evaluation: A Preliminary Empirical Study, April 2023. arXiv:2304.00723 [cs]. [111] Cheng-Han Chiang and Hung-yi Lee. Can Large Language Models Be an Alternative to Human Eval- uations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada, July 2023. Association for Computa- tional Linguistics. [112] Yunjie Ji, Yan Gong, Yiping Peng, Chao Ni, Peiyan Sun, Dongyu Pan, Baochang Ma, and Xiangang Li. Exploring ChatGPT’s Ability to Rank Content: A Preliminary Study on Consistency with Human Preferences, March 2023. arXiv:2303.07610 [cs]. [113] Tom Kocmi and Christian Federmann. Large Language Models Are State-of-the-Art Evaluators of Translation Quality, May 2023. arXiv:2302.14520 [cs]. [114] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment, May 2023. arXiv:2303.16634 [cs]. [115] Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jian- feng Qu, and Jie Zhou. Is ChatGPT a Good NLG Evaluator? A Preliminary Study, April 2023. arXiv:2303.04048 [cs]. [116] Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are Not Fair Evaluators, August 2023. arXiv:2305.17926 [cs]. [117] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, July 2023. arXiv:2306.05685 [cs]. [118] Kent K. Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. Speak, Memory: An Archae- ology of Books Known to ChatGPT/GPT-4, April 2023. arXiv:2305.00118 [cs]. [119] Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the Ability of
Chunk 129 · 1,998 chars
ge with MT-Bench and Chatbot Arena, July 2023. arXiv:2306.05685 [cs]. [118] Kent K. Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman. Speak, Memory: An Archae- ology of Books Known to ChatGPT/GPT-4, April 2023. arXiv:2305.00118 [cs]. [119] Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the Ability of LSTMs to Learn Syntax- Sensitive Dependencies. Transactions of the Association for Computational Linguistics, 4:521–535, dec 2016. [120] Rebecca Marvin and Tal Linzen. Targeted Syntactic Evaluation of Language Models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1192–1202, Brussels, Belgium, 2018. Association for Computational Linguistics. [121] Jennifer Hu and Roger Levy. Prompt-based Methods May Underestimate Large Language Models’ Linguistic Generalizations, May 2023. arXiv:2305.13264 [cs]. [122] Gasper Begus, Maksymilian Dabkowski, and Ryan Rhodes. Large Linguistic Models: Analyzing The- oretical Linguistic Abilities of LLMs, 2023. arXiv:2305.00948 [cs]. [123] Maksymilian Dabkowski and Gasper Begus. Large Language Models and (Non-)linguistic Recursion, 2023. arXiv:2306.07195 [cs]. [124] Noam Chomsky. Lectures on Government and Binding. Foris, 1981. [125] James Neil Sneddon, Alexander Adelaar, Dwi Noverini Djenar, and Michael C Ewing. Indonesian Reference Grammar. New South Wales: Allen and Unwin, 2010. 60 -- 60 of 86 -- BHASA [126] Thomas Lehmann. A Grammar of Modern Tamil. Pondicherry Institute of Linguistics and Culture publication. Pondicherry Institute of Linguistics and Culture, Pondicherry, second edition, 1993. [127] Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. Crosslingual Generalization through Multitask Finetuning. In Proceedings of
Chunk 130 · 1,995 chars
oberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. Crosslingual Generalization through Multitask Finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada, 2022. Association for Computational Linguistics. [128] Peter Cole, Gabriella Hermon, and Li-May Sung. Principles and Parameters of Long-distance Reflexives. Linguistic inquiry, pages 1–22, 1990. [129] Peter Cole, Gabriella Hermon, and C-T James Huang. Long Distance Reflexives, volume 33. Brill, 2000. [130] Elisabet Engdahl. Parasitic Gaps, Resumptive Pronouns, and Subject Extractions. Linguistics, 23(1):3– 44, 1985. [131] James McCloskey. Resumption, Successive Cyclicity, and the Locality of Operations. Derivation and explanation in the minimalist program, pages 184–226, 2002. [132] James McCloskey. Resumption. The Blackwell companion to syntax, pages 94–117, 2006. [133] Ash Asudeh. Three Kinds of Resumption, June 2007. [134] Wynn Chao and Peter Sells. On the Interpretation of Resumptive Pronouns. In North East Linguistics Society Vol. 13, 1983. [135] Ash Asudeh. Local Grammaticality in Syntactic Production. In Language from a Cognitive Perspective Grammar, Usage, and Processing Studies in honor of Tom Wasow. CSLI publications, 2011. [136] Fernanda Ferreira and Benjamin Swets. The Production and Comprehension of Resumptive Pronouns in Relative Clause “Island” Contexts. Twenty-First Century Psycholinguistics: Four Cornerstones, January 2005. [137] Adam Milton Morgan and Matthew W. Wagers. English Resumptive Pronouns Are More Common Where Gaps Are Less Acceptable. Linguistic Inquiry, 49(4):861–876, October 2018. [138] Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, and Muhan Zhang. Large
Chunk 131 · 1,997 chars
tics: Four Cornerstones, January 2005. [137] Adam Milton Morgan and Matthew W. Wagers. English Resumptive Pronouns Are More Common Where Gaps Are Less Acceptable. Linguistic Inquiry, 49(4):861–876, October 2018. [138] Xiaojuan Tang, Zilong Zheng, Jiaqi Li, Fanxu Meng, Song-Chun Zhu, Yitao Liang, and Muhan Zhang. Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners, June 2023. arXiv:2305.14825 [cs]. [139] Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. Toxicity in ChatGPT: Analyzing Persona-assigned Language Models, April 2023. arXiv:2304.05335 [cs]. [140] Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self- Collaboration, July 2023. arXiv:2307.05300 [cs]. [141] Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Yongdong Zhang, and Zhendong Mao. ExpertPrompting: Instructing Large Language Models to be Distinguished Experts, May 2023. arXiv:2305.14688 [cs]. [142] John Robert Ross. Constraints on Variables in Syntax. PhD thesis, Massachusetts Institute of Tech- nology, 1967. [143] I Wayan Arka et al. On Modality and Finiteness in Indonesian: Complexities of =nya Nominalisation. In Workshop on TAM markers and evidentiality in Indonesian Languages, Tokyo University of Foreign Studies, pages 17–18, 2011. [144] Philippe Grangé. The Indonesian Verbal Suffix -nya – Nominalization or Subordination? Wacana, Journal of the Humanities of Indonesia, 16(1):133, April 2015. [145] Julie Anne Legate. Voice and v: Lessons from Acehnese, volume 69. MIT Press, 2014. [146] Yiyun Zhao, Jian Gang Ngui, Lucy Hall Hartley, and Steven Bethard. Do Pretrained Transformers Infer Telicity Like Humans? In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 72–81, Online, 2021. Association for Computational Linguistics. [147] Elay Annamalai. Modern Tamil.
Chunk 132 · 1,999 chars
14. [146] Yiyun Zhao, Jian Gang Ngui, Lucy Hall Hartley, and Steven Bethard. Do Pretrained Transformers Infer Telicity Like Humans? In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 72–81, Online, 2021. Association for Computational Linguistics. [147] Elay Annamalai. Modern Tamil. In Steever Sanford B, editor, The Dravidian Languages. Routledge, London, 2019. 61 -- 61 of 86 -- BHASA [148] Sandhya Sundaresan. A Plea for Syntax and a Return to First Principles: Monstrous Agreement in Tamil. In Semantics and Linguistic Theory, volume 21, pages 674–693, 2011. [149] Herbert P Grice. Logic and Conversation. In Speech Acts, pages 41–58. Brill, 1975. [150] Christopher Potts. Presupposition and Implicature. The Handbook of Contemporary Semantic Theory, pages 168–202, 2015. [151] Kai von Fintel. Would You Believe It? The King of France Is Back! (Presuppositions and Truth-value Intuitions). In Descriptions and Beyond. Oxford, New York, 2004. [152] Mandy Simons, Judith Tonhauser, David Beaver, and Craige Roberts. What Projects and Why. In Semantics and Linguistic Theory, volume 20, pages 309–327, 2010. [153] Yosuke Sato. Action/Result in Indonesian Accomplishment Verbs and the Agent Control Hypothesis. Oceanic Linguistics, 60(2):263–301, 2021. [154] Tarek Naous, Michael J. Ryan, and Wei Xu. Having Beer after Prayer? Measuring Cultural Bias in Large Language Models, May 2023. arXiv:2305.14456 [cs]. [155] Krithika Ramesh, Sunayana Sitaram, and Monojit Choudhury. Fairness in Language Models Beyond English: Gaps and Challenges. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2106–2119, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. [156] Aida Ramezani and Yang Xu. Knowledge of Cultural Moral Norms in Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 428–446, Toronto, Canada, 2023. Association
Chunk 133 · 1,998 chars
roatia, May 2023. Association for Computational Linguistics. [156] Aida Ramezani and Yang Xu. Knowledge of Cultural Moral Norms in Large Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 428–446, Toronto, Canada, 2023. Association for Computational Linguistics. [157] Brian Jude. The Best of Tenali Raman. Spider Books, 2005. [158] Brian Jude. Famous Stories of Tenali Raman. Spider Books, 2007. [159] Brian Jude. Favourite Stories of Tenali Raman. Spider Books, 2008. [160] Brian Jude. Popular Stories of Tenali Raman. Spider Books, 2008. [161] Brian Jude. Selected Stories of Tenali Raman. Spider Books, 2008. [162] Fitri Haryani Nasution. 70 Tradisi Unik Suku Bangsa di Indonesia. Bhuana Ilmu Populer, Jakarta, 2019. [163] Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Sia- mak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernan- dez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni,
Chunk 134 · 1,999 chars
nd, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pel- lat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodra- halli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. PaLM 2 Technical Report, 2023. arXiv:2305.10403 [cs]. [164] Catherine Rose Fortin. We Need LF Copying: A Few Good Reasons Why. In Proceedings of the 28th West Coast Conference on Formal Linguistics, pages 87–95, 2011. [165] Yosuke Sato. P-stranding under Sluicing and Repair By Ellipsis: Why Is Indonesian (Not) Special? Journal of East Asian Linguistics, 20(4):339–382, November 2011. [166] Yosuke Sato. An In-situ Syntax of Sluicing in Indonesian. In The proceedings of AFLA, volume 23, pages 243–57, 2016. 62 -- 62 of 86 -- BHASA [167] Helen Jeoung. WH-agreement Across Three Domains in Indonesian. University of Pennsylvania Working Papers in Linguistics, 26(1):14, 2020. [168] Helen Jeoung. Categorial Ambiguity in mau, suka, and Other Indonesian Predicates. Language, 96(3):157–172, 2020. [169] Thomas Lehmann. Old Tamil. In The Dravidian Languages, pages 81–103. Routledge, 2019. [170] Harold F Schiffman. The Tamil Case System. South Indian Horizons: Felicitation Volume for Francois Gros on the Occasion
Chunk 135 · 1,998 chars
g. Categorial Ambiguity in mau, suka, and Other Indonesian Predicates. Language, 96(3):157–172, 2020. [169] Thomas Lehmann. Old Tamil. In The Dravidian Languages, pages 81–103. Routledge, 2019. [170] Harold F Schiffman. The Tamil Case System. South Indian Horizons: Felicitation Volume for Francois Gros on the Occasion of his 70th Birthday, pages 293–322, 2004. [171] Elai Annamalai. The Variable Relation of Verbs in Sequence in Tamil. In NINJAL International Symposium, 2013. [172] Tommi Leung. The Syntax of Two Types of Sluicing in Tamil. The Linguistic Review, 35(1):35–82, January 2018. [173] I Wayan Arka, Christopher Manning, et al. Voice and Grammatical Relations in Indonesian: A New Perspective. CSLI Stanford, 1998. [174] Peter Cole, Gabriella Hermon, and Yanti. Voice in Malay/Indonesian. Lingua, 118(10):1500–1553, October 2008. [175] Bambang Purwo. Voice in Indonesian: A Discourse Study. Passive and voice, 16:195, 1988. [176] Elai Annamalai. The Constituent Structure of Tamil. The Yearbook of South Asian Languages and Linguistics, pages 3–46, 2003. [177] Benjamin Wai Hoe Pong. The Syntax of Tamil Periphrastic Causatives: A Morphosemantic Exponence. Bachelor’s thesis, The National University of Singapore, 2022. [178] Catherine Rose Fortin. On the Left Periphery in Indonesian. In Proceedings of the Sixteenth Meeting of the Austronesian Formal Linguistics Association (AFLA), ed. Sandy Chung, Daniel Finer, Ileana Paul, and Eric Postdam, pages 29–43. University of California, Santa Cruz, 2009. [179] Catherine Rose Fortin. Indonesian. In van Craenenbroeck Jeroen and Temmerman Tanja, editors, The Oxford Handbook of Elipsis. Oxford University Press, New York, 2019. [180] Peter Cole and Gabriella Hermon. Subject and Non-subject Relativisation in Indonesian. Journal of East Asian Linguistics, 14:59–88, January 2005. [181] Harold F Schiffman. A Reference Grammar of Spoken Tamil. Cambridge University Press, 1999. [182] Marwati Djoened Poesponegoro and Nugroho Notosusanto.
Chunk 136 · 1,991 chars
ork, 2019. [180] Peter Cole and Gabriella Hermon. Subject and Non-subject Relativisation in Indonesian. Journal of East Asian Linguistics, 14:59–88, January 2005. [181] Harold F Schiffman. A Reference Grammar of Spoken Tamil. Cambridge University Press, 1999. [182] Marwati Djoened Poesponegoro and Nugroho Notosusanto. Sejarah Nasional Indonesia, volume 2. Balai Pustaka, Jakarta, Indonesia, 2019. 63 -- 63 of 86 -- BHASA A Author contributions This project was a concerted team effort, and could not have come together without the invaluable contribu- tions from everyone involved. The below is to recognize each contributing author’s significance in this project. Wei Qi Leong: Led the overall project and implemented the codebase for testing the models. Designed the NLP benchmark component of BHASA and analyzed the data and results for it. Contributed to the quality control of LINDSEA and the analysis of results. Jian Gang Ngui: Led, managed and designed LINDSEA, cultural representation tasks and cultural sensitivity pilot study in close collaboration with Yosephine Susanto and Hamsawardhini Rengarajan. Analyzed results of LINDSEA and cultural representation tasks. Yosephine Susanto: Managed and designed LINDSEA, cultural representation tasks and cultural sensitivity pilot study in close collaboration with Jian Gang Ngui. Performed analysis on the results of LINDSEA, cultural representation and cultural sensitivity tasks. Contributed to the analysis of BHASA Indonesian NLG task results. Hamsawardhini Rengarajan: Managed and designed the LINDSEA dataset, cultural representation task and cultural sensitivity task for Tamil in close collaboration with Jian Gang Ngui. Performed analysis on the results of LINDSEA, cultural representation and cultural sensitivity tasks. Contributed to the analysis of BHASA Tamil NLG task results. Kengatharaiyer Sarveswaran: Provided insights for and contributed to the validation of Tamil minimal pairs for LINDSEA. Assisted with human
Chunk 137 · 1,929 chars
Jian Gang Ngui. Performed analysis on
the results of LINDSEA, cultural representation and cultural sensitivity tasks. Contributed to the analysis
of BHASA Tamil NLG task results.
Kengatharaiyer Sarveswaran: Provided insights for and contributed to the validation of Tamil minimal
pairs for LINDSEA. Assisted with human evaluation of model outputs for Tamil NLG tasks. Advised on
the project in general and reviewed the manuscript.
William Chandra Tjhi: Set the direction for the project. Contributed to the validation of Indonesian
coreference resolution and pragmatic reasoning data in LINDSEA. Assisted with human evaluation of model
outputs for Indonesian NLG tasks. Advised on the project in general and reviewed the manuscript.
B Prompts
Competency Task Lang Prompt
NLU NER en
Perform Named Entity Recognition (NER)
on the following text using only the following labels: {LABELS}.
Please follow the format given in the following example:
Example Text: Asia is a continent.
Example Answer: @@PLACE@@Asia## is a continent.
Text: {TEXT}
Answer:
id
Kerjakan Named Entity Recognition (NER) pada teks berikut
dengan hanya menggunakan label-label yang berikut ini:
TEMPAT, ORANG, ORGANISASI.
Silakan ikuti format yang diberikan dalam contoh berikut:
Contoh teks: Asia adalah sebuah benua.
Contoh jawaban: @@TEMPAT@@Asia## adalah sebuah benua.
Teks: {TEXT}
Jawaban:
vi
Thực hiện Named Entity Recognition (NER) trong văn bản
dưới đây bằng cách chỉ sử dụng các nhãn sau:
ĐỊA ĐIỂM, NGƯỜI, TỔ CHỨC, KHÁC.
Vui lòng làm theo định đạng được cho trong ví dụ sau:
Văn bản mẫu: Châu Á là một lục địa.
Câu trả lời mẫu: @@ĐỊA ĐIỂM@@Châu Á## là một lục địa.
Văn bản: {TEXT}
Câu trả lời:
64
-- 64 of 86 --
BHASA
Competency Task Lang Prompt
th
กรุณาทํา Named Entity Recognition (NER)
ในข้อความต่อไปนี้โดยใข้ป้ายต่อไปนี้เท่านั้น:
สถานที่, บุคคล, องค์กร,
วันที่, เวลา, สิ่งอํานวยความสะดวก,
URL, เบอร์โทรศัพท์, อีเมล, รหัสไปรษณีย์, กฎหมาย,
ความยาว,Chunk 138 · 1,997 chars
ĐIỂM@@Châu Á## là một lục địa.
Văn bản: {TEXT}
Câu trả lời:
64
-- 64 of 86 --
BHASA
Competency Task Lang Prompt
th
กรุณาทํา Named Entity Recognition (NER)
ในข้อความต่อไปนี้โดยใข้ป้ายต่อไปนี้เท่านั้น:
สถานที่, บุคคล, องค์กร,
วันที่, เวลา, สิ่งอํานวยความสะดวก,
URL, เบอร์โทรศัพท์, อีเมล, รหัสไปรษณีย์, กฎหมาย,
ความยาว, เปอร์เซ็นต์.
โปรดทําตามรูปแบบที่กําหนดในตัวอย่างต่อไปนี้:
ตัวอย่างข้อความ: เอเชีย เป็น ทวีป
ตัวอย่างคําตอบ: @@สถานที่@@เอเชีย## เป็น ทวีป
ข้อความ: {TEXT}
คําตอบ:
ta
தரப்பட்ட உைரயில் பின்வரும் சிட்ைடகைள மட்டும்
பயன்படுத்தி Named Entity Recognition (NER)ெசய்யவும்:
நபர், இடம், அைமப்பு.
பின்வரும் எடுத்துக்காட்டில்
ெகாடுக்கப்பட்டுள்ள
வடிவைமப்ைபப் பின்பற்றவும்:
மாதிரி உைர: ஆசியா ஒரு கண்டம்.
மாதிரிப் பதில்: @@இடம்@@ஆசியா## ஒரு கண்டம்.
உைர: {TEXT}
பதில்:
Question
Answering en
You will be given a paragraph and a question.
Answer the question by extracting the answer from the paragraph.
Paragraph: {TEXT}
Question: {QUESTION}
Answer:
id
Anda akan diberikan sebuah paragraf dan sebuah pertanyaan.
Jawablah pertanyaannya dengan mengekstrak jawaban
dari paragraf tersebut.
Paragraf: {TEXT}
Pertanyaan: {QUESTION}
Jawaban:
vi
Bạn sẽ được cho một đoạn văn và một câu hỏi.
Trả lời câu hỏi bằng cách trích xuất câu trả lời từ đoạn văn.
Đoạn văn: {TEXT}
Câu hỏi: {QUESTION}
Câu trả lời:
th
คุณจะได้รับข้อความและคําถาม กรุณาตอบคําถาม
โดยแยกคําตอบจากข้อความ
ข้อความ: {TEXT}
คําถาม: {QUESTION}
คําตอบ:
ta
உங்களுக்கு ஒரு பத்தியும் ஒரு ேகள்வியும் தரப்படும்.
தரப்பட்ட பத்தியிலிருந்து ேகள்விக்கான
பதிைலக் கண்டறBயவும்.
ேகள்விக்குப் பதிலளிக்கமுடியாவிட்டால் ஒற்ைற
எழுத்து X இைனப் பதிலாக அளிக்கவும்.
பத்தி: {TEXT}
ேகள்வி: {QUESTION}
பதில்:
65
-- 65 of 86 --
BHASA
Competency Task Lang Prompt
Sentiment
Analysis en What is the sentiment of the following sentence? {SENTENCE}
Answer only with a single word: Positive/Negative/Neutral
id
Apa sentimen dari kalimat berikut ini? {SENTENCE}
Jawab dengan satu kata saja: Positif/Negatif/Netral
vi
Sắc thái của câu sau đây là gì? {SENTENCE}
Trả lời với mộtChunk 139 · 1,998 chars
Lang Prompt
Sentiment
Analysis en What is the sentiment of the following sentence? {SENTENCE}
Answer only with a single word: Positive/Negative/Neutral
id
Apa sentimen dari kalimat berikut ini? {SENTENCE}
Jawab dengan satu kata saja: Positif/Negatif/Netral
vi
Sắc thái của câu sau đây là gì? {SENTENCE}
Trả lời với một từ duy nhất: Tích cực/Tiêu cực/Trung lập
th
อารมณ์ความรู้สึกของข้อความต่อไปนี้เป็นอย่างไร? {SENTENCE}
โปรดตอบโดยใช้คําเดียวเท่านั้น: แง่บวก/แง่ลบ/เฉยๆ
ถ้าข้อความเป็นคําถาม โปรดตอบด้วยตัวอักษร Q.
ta
பின்வரும் வாக்கியத்தில்
ெவளிப்படுத்தப்படும்
உணர்வு எது? {SENTENCE}
ஒரு ெசால்லில் மட்டும் பதிலளிக்கவும்:
ேநர்மைற/எதிர்மைற
Toxicity
Detection en
You are a hate speech detector.
The label definitions are as follows:
Clean: No harassment.
Abusive: Involves harassment and even profanity,
but does not attack any specific object.
Hate: Directly harasses or abuses a specific object.
When given a sentence, classify it with a single word:
Clean/Abusive/Hate.
Sentence: {TEXT}
id
Anda adalah pendeteksi ujaran kebencian.
Definisi dari labelnya adalah sebagai berikut:
Bersih: Tidak ada ujaran kebencian
Kasar: Ada ujaran kebencian dan kata-kata kasar,
namun tidak menyerang pihak tertentu
Benci: Ada ujaran kebencian langsung atau
serangan terhadap pihak tertentu
Ketika diberikan sebuah kalimat,
klasifikasikan dengan satu kata:
Bersih/Kasar/Benci
Kalimat: {TEXT}
vi
Bạn là máy phát hiện phát ngôn thù ghét.
Các nhãn được định nghĩa như sau:
Sạch: Không quấy rối
Công kích: Bao gồm quấy rối và thậm chí chửi thề,
nhưng không tấn công bất kì đối tượng cụ thể nào
Thù ghét: Trực tiếp quấy rối hay lăng mạ một đối
tượng cụ thể
Cho câu dưới đây, hãy phân loại nó với 1 từ duy nhất:
Sạch/Công kích/Thù ghét
Câu: {TEXT}
66
-- 66 of 86 --
BHASA
Competency Task Lang Prompt
th
คุณเป็นเครื่องตรวจจับข้อความที่เป็นพิษ
ข้อความเป็นพิษถ้ามันทําร้าย สร้างความเสียหาย
หรือสื่อเจตนาเชิงลบถึงบุคคลหรือกลุ่มบุคคล
คุณควรคํานึงถึงการประชดประชันด้วย
เมื่อได้รับข้อความ ให้ตอบด้วยตัวอักษร Y หากเป็นพิษ และ
NChunk 140 · 1,997 chars
duy nhất:
Sạch/Công kích/Thù ghét
Câu: {TEXT}
66
-- 66 of 86 --
BHASA
Competency Task Lang Prompt
th
คุณเป็นเครื่องตรวจจับข้อความที่เป็นพิษ
ข้อความเป็นพิษถ้ามันทําร้าย สร้างความเสียหาย
หรือสื่อเจตนาเชิงลบถึงบุคคลหรือกลุ่มบุคคล
คุณควรคํานึงถึงการประชดประชันด้วย
เมื่อได้รับข้อความ ให้ตอบด้วยตัวอักษร Y หากเป็นพิษ และ
N หากไม่เป็นพิษ
ข้อความ: {TEXT}
NLG
Machine
Translation
(EN→XXX)
en
Translate the following text into {LANGUAGE}.
Text: {TEXT}
Translation:
id
Terjemahkan teks berikut ini ke dalam Bahasa Indonesia.
Teks: {TEXT}
Terjemahan:
vi
Dịch văn bản dưới đây sang Tiếng Việt.
Văn bản: {TEXT}
Bản dịch:
th
กรุณาแปลข้อความต่อไปนี้เป็นภาษาไทย
ข้อความ: {TEXT}
คําแปล:
ta
பின்வரும் உைரையத் தமBழ் ெமாழBக்கு
ெமாழBெபயர்க்கவும்.
உைர: {TEXT}
ெமாழBெபயர்ப்பு:
Machine
Translation
(XXX→EN)
en
Translate the following text into English.
Text: {TEXT}
Translation:
id
Terjemahkan teks berikut ini ke dalam Bahasa Inggris.
Teks: {TEXT}
Terjemahan:
vi
Dịch văn bản dưới đây sang Tiếng Anh.
Văn bản: {TEXT}
Bản dịch:
th
กรุณาแปลข้อความต่อไปนี้เป็นภาษาอังกฤษ
ข้อความ: {TEXT}
คําแปล:
ta
பின்வரும் உைரைய ஆங்கில ெமாழBக்கு
ெமாழBெபயர்க்கவும்.
உைர: {TEXT}
ெமாழBெபயர்ப்பு:
Abstractive
Summarization en
Article: {TEXT}
Summarize this {LANGUAGE} article in 1 or 2 sentences.
The response should be in {LANGUAGE}.
Summary:
67
-- 67 of 86 --
BHASA
Competency Task Lang Prompt
id
Artikel: {TEXT}
Rangkumkan artikel Bahasa Indonesia ini dalam 1 atau 2 kalimat.
Jawabannya harus ditulis dalam Bahasa Indonesia.
Rangkuman:
vi
Bài báo: {TEXT}
Tóm tắt bài báo Tiếng Việt trên với 1 hay 2 câu.
Câu trả lời nên được
viết bằng tiếng Việt. Bản tóm tắt:
th
บทความ: {TEXT}
กรุณาสรุปบทความภาษาไทยฉบับนี้ใน 1 หรือ 2 ประโยค
คําตอบควรเป็นภาษาไทย
บทสรุป:
ta
கட்டுைர: {TEXT}
இந்தத் தமBழ்க் கட்டுைரக்கு 1 அல்லது 2 வாக்கியங்களில்
ெபாழBப்பு எழுதவும்.
பதில் தமBழ் ெமாழBயில் இருக்கேவண்டும்.
கட்டுைரப் ெபாழBப்பு:
NLI en
You will be given two sentences, X and Y.
X: {PREMISE}
Y: {HYPOTHESIS}
Determine which of the following statements appliesChunk 141 · 1,998 chars
2 ประโยค
คําตอบควรเป็นภาษาไทย
บทสรุป:
ta
கட்டுைர: {TEXT}
இந்தத் தமBழ்க் கட்டுைரக்கு 1 அல்லது 2 வாக்கியங்களில்
ெபாழBப்பு எழுதவும்.
பதில் தமBழ் ெமாழBயில் இருக்கேவண்டும்.
கட்டுைரப் ெபாழBப்பு:
NLI en
You will be given two sentences, X and Y.
X: {PREMISE}
Y: {HYPOTHESIS}
Determine which of the following statements applies to
sentences X and Y the best.
A: If X is true, Y must be true.
B: X contradicts Y.
C: When X is true, Y may or may not be true.
Answer strictly with a single letter A, B or C.
id
Anda akan diberikan dua kalimat, X dan Y.
X: {PREMISE}
Y: {HYPOTHESIS}
Tentukan mana dari pernyataan berikut ini yang paling sesuai
untuk kalimat X dan Y.
A: Kalau X benar, maka Y juga harus benar.
B: X bertentangan dengan Y.
C: Ketika X benar, Y mungkin benar atau mungkin tidak benar.
Jawablah hanya dengan menggunakan satu huruf A, B atau C.
vi
Bạn sẽ được cho hai câu, X và Y.
X: {PREMISE}
Y: {HYPOTHESIS}
Xác định câu nào sau đây là câu phù hợp nhất
cho câu X và Y.
A: Nếu X đúng thì Y phải đúng.
B: X mâu thuẫn với Y.
C: Khi X đúng, Y có thể đúng hoặc không đúng.
Trả lời với một chữ cái duy nhất A, B, hoặc C.
68
-- 68 of 86 --
BHASA
Competency Task Lang Prompt
th
คุณจะได้รับสองข้อความ X และ Y
X: {PREMISE}
Y: {HYPOTHESIS}
กรุณาพิจารณาว่า ข้อความใดต่อไปนี้ใช้กับข้อความ X และ Y ได้ดีที่สุด
A: ถ้า X เป็นจริง Y จะต้องเป็นจริง
B: X ขัดแย้งกับ Y
C: เมื่อ X เป็นจริง Y อาจเป็นจริงหรือไม่ก็ได้
กรุณาตอบด้วยตัวอักษร A, B หรือ C ตัวเดียวเท่านั้น
ta
உங்களுக்கு இரண்டு வாக்கியங்கள்,
X மற்றும் Y, தரப்படும்.
X: {PREMISE}
Y: {HYPOTHESIS}
பின்வரும் கூற்றுகளில் எது X மற்றும் Y
வாக்கியங்களுடன்
மBகப் ெபாருந்துகிறது எனக் கண்டறBயவும்.
A: X உண்ைம என்றால் Y உம் உண்ைமயாக
இருக்க ேவண்டும்.
B: X உம் Y உம் முரண்படுகின்றன.
C: X உண்ைமயாக இருக்கும்ேபாது Y
உண்ைமயாக இருக்கலாம்
அல்லது இல்லாமல் இருக்கலாம்.
A அல்லது B அல்லது C
எழுத்தில் மட்டும் பதிலளிக்கவும்.
Causal
Reasoning en
Situation: {PREMISE}
Given this situation, which of the following choices is most likely
to be its {QUESTION}?
A: {CHOICE 1}
B: {CHOICEChunk 142 · 1,994 chars
B: X உம் Y உம் முரண்படுகின்றன.
C: X உண்ைமயாக இருக்கும்ேபாது Y
உண்ைமயாக இருக்கலாம்
அல்லது இல்லாமல் இருக்கலாம்.
A அல்லது B அல்லது C
எழுத்தில் மட்டும் பதிலளிக்கவும்.
Causal
Reasoning en
Situation: {PREMISE}
Given this situation, which of the following choices is most likely
to be its {QUESTION}?
A: {CHOICE 1}
B: {CHOICE 2}
Respond strictly with the letters A or B only.
{QUESTION} = cause/effect
id
Situasi: {PREMISE}
Berdasarkan situasi di atas, mana dari pilihan-pilihan berikut ini
yang lebih mungkin menjadi {QUESTION}?
A: {CHOICE 1}
B: {CHOICE 2}
Jawab dengan hanya menggunakan A atau B.
{QUESTION} = sebab/akibat
vi
Tình huống: {PREMISE}
Với tình huống trên, lựa chọn nào dưới đây có khả năng
cao là {QUESTION} của nó hơn?
A: {CHOICE 1}
B: {CHOICE 2}
Chỉ trả lời bằng chữ cái A hoặc B.
{QUESTION} = nguyên nhân/kết quả
th
สถานการณ์: {PREMISE}
เมื่อพิจารณาจากสถานการณ์นี้ ตัวเลือกใดต่อไปนี้น่าจะเป็น
{QUESTION}มากกว่ากัน?
A: {CHOICE1}
B: {CHOICE2}
กรุณาตอบด้วยตัวอักษร A หรือ B เท่านั้น
{QUESTION}=สาเหตุ/ผล
69
-- 69 of 86 --
BHASA
Competency Task Lang Prompt
ta
சூழ்நிைல: {PREMISE}
பின்வரும் வாக்கியங்களில் ெபரும்பாலும் எது தரப்பட்ட
சூழ்நிைலக்குரிய {QUESTION} இருக்கும்?
A: {CHOICE 1}
B: {CHOICE 2}
A அல்லது B எழுத்தில் மட்டும் பதிலளிக்கவும்.
{QUESTION} =காரணமாக/விைளவாக
Table 46: Prompts in English, Indonesian, Vietnamese, Thai and Tamil for the NLP benchmark component
of BHASA
Task Lang Prompt
Minimal Pair en
System Prompt:
You are a {LANGUAGE} linguist
Human Prompt:
Which sentence is more acceptable?
A: {SENTENCE 1}
B: {SENTENCE 2}
Answer with A or B only.
id
System Prompt:
Anda adalah seorang ahli bahasa Indonesia
Human Prompt:
Kalimat mana yang lebih mungkin?
A: {SENTENCE 1}
B: {SENTENCE 2}
Jawablah dengan menggunakan A atau B saja.
ta
System Prompt:
நீ
ங்கள் ஒரு தமBழ் ெமாழB வல்லுநர்
Human Prompt:
எந்த வாக்கியம் ெபரும்பாலும் சரியாக இருக்கும்?
A: {SENTENCE 1}
B: {SENTENCE 2}
A அல்லது B எழுத்தில் மட்டும் பதிலளிக்கவும்.
Translation en
Translate the following sentence intoChunk 143 · 1,990 chars
in?
A: {SENTENCE 1}
B: {SENTENCE 2}
Jawablah dengan menggunakan A atau B saja.
ta
System Prompt:
நீ
ங்கள் ஒரு தமBழ் ெமாழB வல்லுநர்
Human Prompt:
எந்த வாக்கியம் ெபரும்பாலும் சரியாக இருக்கும்?
A: {SENTENCE 1}
B: {SENTENCE 2}
A அல்லது B எழுத்தில் மட்டும் பதிலளிக்கவும்.
Translation en
Translate the following sentence into {LANGUAGE}.
Sentence: {SENTENCE}
Translation:
id
Terjemahkan kalimat berikut ini ke dalam bahasa Indonesia.
Kalimat: {SENTENCE}
Terjemahan:
ta
பின்வரும் வாக்கியத்ைத ஆங்கில
ெமாழBக்கு ெமாழBெபயர்க்கவும்.
வாக்கியம்: {SENTENCE}
ெமாழBெபயர்ப்பு:
Coreference
(Open-ended) en
System Prompt:
You are a {LANGUAGE} linguist
Human Prompt:
Based on the sentence ‘{SENTENCE}’, {QUESTION}?
70
-- 70 of 86 --
BHASA
Task Lang Prompt
id
System Prompt:
Anda adalah seorang ahli bahasa Indonesia
Human Prompt:
Berdasarkan kalimat ‘{SENTENCE}’, {QUESTION}?
ta
System Prompt:
நீ
ங்கள் ஒரு தமBழ் ெமாழB வல்லுநர்
Human Prompt:
‘{SENTENCE}’ என்னும் வாக்கியத்தில், {QUESTION}?
Coreference
(Binary Choice) en
System Prompt:
You are a {LANGUAGE} linguist
Human Prompt:
{SENTENCE}
Which is more appropriate to fill in the blank?
A: {ENTITY 1}
B: {ENTITY 2}
Answer only with A or B.
id
System Prompt:
Anda adalah seorang ahli bahasa Indonesia
Human Prompt:
{SENTENCE}
Manakah yang lebih tepat untuk mengisi bagian yang kosong?
A: {ENTITY 1}
B: {ENTITY 2}
Jawablah hanya dengan menggunakan A atau B.
ta
System Prompt:
நீ
ங்கள் ஒரு தமBழ் ெமாழB வல்லுநர்
Human Prompt:
{SENTENCE}
ேகாடிட்ட இடத்ைத நிரப்ப எது
மBகவும் ெபாருத்தமானது?
A: {ENTITY 1}
B: {ENTITY 2}
A அல்லது B எழுத்தில் மட்டும் பதிலளிக்கவும்.
Coreference (Indonesian
Pronoun Resolution) en
System Prompt:
You are an Indonesian linguist
Human Prompt:
In the following sentences, who or what does ‘dia’ refer to?
{SENTENCE}
id
System Prompt:
Anda adalah seorang ahli bahasa Indonesia
Human Prompt:
Dalam kalimat berikut ini, ‘dia’ mengacu pada apa atau siapa?
{SENTENCE}
Pragmatic Reasoning
(Single Sentence) en
System Prompt:
You are a {LANGUAGE}Chunk 144 · 1,980 chars
linguist
Human Prompt:
In the following sentences, who or what does ‘dia’ refer to?
{SENTENCE}
id
System Prompt:
Anda adalah seorang ahli bahasa Indonesia
Human Prompt:
Dalam kalimat berikut ini, ‘dia’ mengacu pada apa atau siapa?
{SENTENCE}
Pragmatic Reasoning
(Single Sentence) en
System Prompt:
You are a {LANGUAGE} linguist
Human Prompt:
Is the following statement true or false?
Statement: {SENTENCE}
Answer only with True or False.
71
-- 71 of 86 --
BHASA
Task Lang Prompt
id
System Prompt:
Anda adalah seorang ahli bahasa Indonesia
Human Prompt:
Apakah pernyataan berikut ini benar atau salah?
Pernyataan: {SENTENCE}
Jawablah hanya dengan Benar atau Salah.
ta
System Prompt:
நீ
ங்கள் ஒரு தமBழ் ெமாழB வல்லுநர்
Human Prompt:
பின்வரும் கூற்று உண்ைமயா அல்லது ெபாய்யா?
கூற்று: {SENTENCE}
உண்ைம அல்லது ெபாய் என்ற வார்த்ைதயில்
மட்டும் பதிலளிக்கவும்.
Pragmatic Reasoning
(Sentence Pair) en
System Prompt:
You are a {LANGUAGE} linguist
Human Prompt:
Situation: {PREMISE}
Given this situation, is the following statement true or false?
Statement: {HYPOTHESIS}
Answer only with True or False.
id
System Prompt:
Anda adalah seorang ahli bahasa Indonesia
Human Prompt:
Situasi: {PREMISE}
Berdasarkan situasi ini, apakah pernyataan berikut ini
benar atau salah?
Pernyataan: {HYPOTHESIS}
Jawablah hanya dengan Benar atau Salah.
ta
System Prompt:
நீ
ங்கள் ஒரு தமBழ் ெமாழB வல்லுநர்
Human Prompt:
சூழ்நிைல: {PREMISE}
இந்த சூழ்நிைலயில்,
பின்வரும் கூற்று உண்ைமயா
அல்லது ெபாய்யா?
கூற்று: {HYPOTHESIS}
உண்ைம அல்லது ெபாய் என்ற வார்த்ைதயில்
மட்டும் பதிலளிக்கவும்.
Table 47: Prompts in English, Indonesian, and Tamil for LINDSEA
72
-- 72 of 86 --
BHASA
C Data quality issues
Lang Premise Question Answer Correct
Question
id Tinta di poster itu luntur.
The ink on the poster smeared. Effect Saya mencipratkan air ke poster.
I spilled water on the poster. Cause
Keluarga itu mencari
tetangga-tetangganya.*
Should be “Keluarga itu mencari
sekitar lingkungan sekitar rumah.”
The family searched theChunk 145 · 1,993 chars
Lang Premise Question Answer Correct Question id Tinta di poster itu luntur. The ink on the poster smeared. Effect Saya mencipratkan air ke poster. I spilled water on the poster. Cause Keluarga itu mencari tetangga-tetangganya.* Should be “Keluarga itu mencari sekitar lingkungan sekitar rumah.” The family searched the neighborhood. Effect Anjing mereka kabur dari rumah. The dog ran away from home. Cause Ovennya memanas. The oven became hot. Effect Saya menyalakan oven tersebut. I turned the oven on. Cause Perempuan itu tersanding* di trotoar. Should be tersandung The woman tripped on the sidewalk. Effect Ada retakan di lantai trotoar. There was a crack in the cement. Cause th ที่ทํางานปิด The office was closed. Effect มันคือวันหยุด It was a holiday. Cause ผู้หญิงเสียที่ในคิว The woman lost her place in line. Effect เขาเดินออกนอกคิวไป She stepped out of line. Cause ผู้ชายถูกปฎิเสธสินเชื่อ The man was denied for a loan. Effect เขามีหนี้สิน He was in debt. Cause ฉันบ้วนนมออกมา I spit out the milk. Effect นมมีรสเปรี้ยว The milk tasted sour. Cause Table 48: Examples of errors in the question label in XCOPA. The premises and hypotheses are taken directly from the XCOPA dataset and mistakes in translation from the COPA dataset are also pointed out using the * symbol, with the suggested translation given after the erroneous one. 73 -- 73 of 86 -- BHASA D NLG Human Evaluation guidelines As mentioned in Section 3.5.2, we perform human evaluation for machine translation and abstractive summarization for a small sample of sentences and articles. The error typology for machine translation (Table 49) and the definitions of the dimensions on which to score summarizations (Table 50) can be found below. Category Subcategory Explanation Accuracy Addition Translated content that includes content not present in the source. Omission Errors where content is missing from the translation but is present in the source. Over-translation Translated content that is inappropriately more
Chunk 146 · 1,998 chars
mmarizations (Table 50) can be found below. Category Subcategory Explanation Accuracy Addition Translated content that includes content not present in the source. Omission Errors where content is missing from the translation but is present in the source. Over-translation Translated content that is inappropriately more specific than the source text. E.g. The source refers to a boy, but the translation uses a term that refers only to young boys. Under-translation Translated content that is inappropriately less specific than the source. E.g. A specific type of military officer (such as an admiral) is referred to in the source, but the translation uses a term that refers to military officers in general. Mistranslation Target content that does not accurately represent the source content. E.g. A source text states that a medicine should not be administered in doses greater than 200mg. However, the translation says that it should be administered in doses greater than 200mg. (i.e. the opposite meaning is achieved due to not translating the negation) Untranslated Text segments that were not translated. E.g. The entire sentence or a part of the sentence is somehow left in the source language without being translated. Terminology Inconsistent Terminology Use of multiple terms for the same concept when the same word should be used for consistency E.g. A text referring to a particular lever as “brake release lever” and “brake disengagement lever” at the same time. Inappropriate Terminology Use of a term that is inappropriate in the context of the text. E.g. “The Seine river empties into the English Channel” In this sentence, it is inappropriate to translate the word river as rivière in French because a rivière empties into another river, whereas a fleuve empties into a sea or ocean. Linguistic Conventions Grammar Translated text contains grammatical errors. E.g. “The man was seeing the his wife.” Punctuation Translated text contains punctuation errors. Spelling Translated text
Chunk 147 · 1,997 chars
rd river as rivière in French because a rivière empties into another river, whereas a fleuve empties into a sea or ocean. Linguistic Conventions Grammar Translated text contains grammatical errors. E.g. “The man was seeing the his wife.” Punctuation Translated text contains punctuation errors. Spelling Translated text contains spelling errors. E.g. “Despicable” spelled as “Depiscable”. In Vietnamese and Thai this could also involve a wrong tone marker. Unintelligible Translated text (partially or as a whole) is incomprehensible. E.g. “The brake from whe this S149235 part numbr„.” 74 -- 74 of 86 -- BHASA Category Subcategory Explanation Style Naturalness Style that is grammatical but unnatural, usually due to interference from the source language. E.g. “We thanked him with heart” is understandable and grammatically acceptable, but is a direct translation from German and would be better translated as “We thanked him heartily”. Register Register of source and translated text do not match. E.g. Source uses formal language but translation uses informal language. Inconsistent Style The translated text itself is inconsistent in style. E.g. One part of a text is written in a concise, straight-to-the-point manner, while another is written in a flowery, more wordy style. Locale Conventions Format Errors where the translation violates locale-specific format for things such as numbers, currencies, measurements, time and date. E.g. Indonesian and Vietnamese use commas to separate the decimals and periods to separate thousands (1.000.000,00 instead of 1,000,000.00). US English expresses dates as mm/dd/yyyy while most other places use dd/mm/yyyy instead, and Thai uses Buddhist Era. Table 49: Error Typology for Human Evaluation of Machine Translation Criterion Score range Explanation Relevance 1/2/3/4/5 This rating measures how well the summary captures the key points of the article. Consider whether all and only the important aspects are contained in the summary. Redundancies
Chunk 148 · 1,994 chars
uses Buddhist Era. Table 49: Error Typology for Human Evaluation of Machine Translation Criterion Score range Explanation Relevance 1/2/3/4/5 This rating measures how well the summary captures the key points of the article. Consider whether all and only the important aspects are contained in the summary. Redundancies and excess or unimportant information should be penalized. Faithfulness 0/1 This rating measures whether the facts in the summary are consistent with the facts in the original article (i.e. whether the summary is faithful to the article). Consider whether the summary produces all facts accurately and does not make up untrue information. Hallucinated facts should be penalized. Fluency 1/2/3/4/5 This rating measures the quality of individual sentences in terms of whether they are well-written and grammatically correct. Formatting problems, capitalisation errors and ungrammatical sentences should be penalized. Coherence 1/2/3/4/5 This rating measures the quality of the summary as a whole in terms of whether it is well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic. Table 50: Explanations for the 4 rating criteria for summarization 75 -- 75 of 86 -- BHASA E LINDSEA Linguistic phenomena categories Linguistic Phenomena Category Subcategory Morphology Men- with clefted sentences - Men- with control verbs mau and coba Coba Mau Men- with dirinya/dirimu subjects Dirimu Dirinya Men- with objects subject to Left Branch Condition WH-in-situ WH-movement from island Adjective Reduplication - Noun Reduplication (Endocentric Compounds) - Noun Reduplication (Plurals) - Benefactive constructions without untuk - Dative constructions without kepada - Preposition stranding - Correlative conjunctions Biarpun + namun Meskipun + namun Meskipun + tetapi Walaupun + tapi Complementizers (bahwa/kalau) Ungrammatical with bahwa Ungrammatical
Chunk 149 · 1,996 chars
ndocentric Compounds) - Noun Reduplication (Plurals) - Benefactive constructions without untuk - Dative constructions without kepada - Preposition stranding - Correlative conjunctions Biarpun + namun Meskipun + namun Meskipun + tetapi Walaupun + tapi Complementizers (bahwa/kalau) Ungrammatical with bahwa Ungrammatical with kalau -i/-kan suffix Grammatical with -i Grammatical with -kan Verb + -kah - Argument Structure Dirimu sendiri with dia + Dirinya sendiri with kamu Dirimu sendiri + dia Dirinya sendiri + kamu Dirimu sendiri with/without kamu + Dirinya sendiri with/without dia Dirimu sendiri Dirinya sendiri Dirimu with dia + Dirinya with kamu Dirimu + dia Dirinya + kamu Dirimu with/without kamu + Dirinya with/without dia Dirimu without kamu Dirinya without dia Modals in Active Voice - Modals in Object Voice With -kan on verb Without -kan on verb Temporal nouns + -lah - Di-VP topicalization with =nya - Ada With =nya With itu Asymmetry between animates and inanimates in passive voice - No oleh in passive voice - Oleh-stranding in passive voice No additional argument With additional argument Demonstratives on objects in object voice - Possessive =nya - Word order in benefactive constructions - Word order in object voice With -kan on verb Without -kan on verb 76 -- 76 of 86 -- BHASA Linguistic Phenomena Category Subcategory Filler-gap Dependencies Verb phrase ellipsis Aspect mismatch Word order WH-movement with Left Branch Condition - WH-movement with adjunct and coordinate islands Adjunct islands Coordinate islands Object relatives and demonstratives - Subject relatives and demonstratives - Topicalization with Left Branch Condition - NPIs and Negation NPI licensing in-situ - NPI licensing with movement - Table 51: Categories and subcategories of linguistic phenomena in the Indonesian dataset of LINDSEA Linguistic Phenomena Category Subcategory Morphology Nominative subject - Obligatory accusative case marking Demonstrative Proper Noun Enpatu with case Accusative
Chunk 150 · 1,999 chars
on NPI licensing in-situ - NPI licensing with movement - Table 51: Categories and subcategories of linguistic phenomena in the Indonesian dataset of LINDSEA Linguistic Phenomena Category Subcategory Morphology Nominative subject - Obligatory accusative case marking Demonstrative Proper Noun Enpatu with case Accusative case Instrumental case Sociative case Enru - Matrix verbs selecting for finite clauses - Matrix verbs selecting for non-finite clauses - Verbal reduplication with numerals One (oru) Two (irantu) Agreement with active subject vs. object - Agreement with passive subject vs. agent - Agreement with dative vs. nominative subjects - Agreement with nominative vs. dative subjects - Gender mismatch - Person mismatch - Tense and temporal mismatch - Argument Structure Complex predicate ordering - Vai instrumental ordering - Adding accusative case to passive subject - Scrambling out of embedded clauses Animate object Inanimate object Filler-gap Dependencies Subject relatives - Object relatives - Table 52: Categories and subcategories of linguistic phenomena in the Tamil dataset of LINDSEA 77 -- 77 of 86 -- BHASA Indonesian References Tamil References Morphology [125]; [164]; [165]; [166]; [167]; [168] [126];[169]; [170]; [171]; [147]; [172] Argument Structure [173]; [174]; [175]; [145]; [125] [176];[177]; [126]; Filler-gap Dependencies [178];[179]; [125] [126]; [172]; [177] NPIs and Negation [125] [126] Translation [173]; [175]; [125] [126]; [170] Coreference Resolution [173]; [180] [125] [126]; [181]; [148]; [147] Scalar Implicatures [125] [126] Table 53: References consulted in the design of LINDSEA F LINDSEA detailed results Answer Frequency Run Accuracy A B Others Indonesian Dataset Random 40.00 31.84 39.21 28.95 All A 40.79 40.79 33.68 25.53 GPT-3.5-Turbo All B 48.68 27.63 48.68 23.69 Random 52.36 33.95 66.05 0.00 All A 40.53 40.53 58.95 0.52 GPT-3.5-Turbo (T) All B 69.74 30.26 69.74 0.00 Random 70.79 37.63 62.37 0.00 All A 60.79 60.79 39.21 0.00 GPT-4 All
Chunk 151 · 1,984 chars
Accuracy A B Others Indonesian Dataset Random 40.00 31.84 39.21 28.95 All A 40.79 40.79 33.68 25.53 GPT-3.5-Turbo All B 48.68 27.63 48.68 23.69 Random 52.36 33.95 66.05 0.00 All A 40.53 40.53 58.95 0.52 GPT-3.5-Turbo (T) All B 69.74 30.26 69.74 0.00 Random 70.79 37.63 62.37 0.00 All A 60.79 60.79 39.21 0.00 GPT-4 All B 82.11 17.89 82.11 0.00 Random 69.21 31.84 68.16 0.00 All A 52.63 52.63 47.37 0.00 GPT-4 (T) All B 87.89 12.11 87.89 0.00 Tamil Dataset Random 48.50 28.00 52.50 19.50 All A 35.00 35.00 43.00 22.00 GPT-3.5-Turbo All B 60.50 20.00 60.50 19.50 Random 39.50 20.50 45.50 34.00 All A 21.00 21.00 42.50 36.50 GPT-3.5-Turbo (T) All B 48.50 11.50 48.50 40.00 Random 73.50 24.50 75.50 0.00 All A 47.00 47.00 53.00 0.00 GPT-4 All B 96.50 3.50 96.50 0.00 Random 70.50 20.50 79.50 0.00 All A 34.00 34.00 66.00 0.00 GPT-4 (T) All B 97.50 2.50 97.50 0.00 Table 54: Answer frequencies for the minimal pair task across 3 runs. GPT-4 seems to display a form of answer bias, preferring to answer B on most questions. Instances where the rate of choosing B is at least twice that of choosing A have been highlighted. 78 -- 78 of 86 -- BHASA Linguistic Phenomenon Category GPT-3.5- Turbo GPT-3.5- Turbo (T) GPT-4 GPT-4 (T) Morphology -i/-kan suffix 2 4 6 5 V + -kah 1 1 3 2 Adjective reduplication 1 5 9 8 Benefactive constructions without untuk 0 0 4 3 Complementizers (bahwa/kalau) 4 5 9 8 Dative constructions without kepada 0 3 4 6 Men- with clefted sentences 1 1 5 7 Men- with control verbs mau and coba 0 0 0 0 Men- with dirinya/dirimu subjects 0 0 0 0 Men- with objects subject to Left Branch Condition 0 0 6 3 Noun reduplication (endocentric compounds) 3 2 7 5 Noun reduplication (plurals) 4 3 4 3 Preposition stranding 5 9 10 10 Correlative conjunctions 0 0 0 0 Argument Structure Ada 0 4 5 5 Asymmetry between animates and inanimates in passive voice 2 6 8 6 Demonstratives on objects in object voice 0 0 1 0 Di-VP topicalisation with =nya 0 0 4 5 Dirimu sendiri with dia +
Chunk 152 · 1,990 chars
7 5 Noun reduplication (plurals) 4 3 4 3 Preposition stranding 5 9 10 10 Correlative conjunctions 0 0 0 0 Argument Structure Ada 0 4 5 5 Asymmetry between animates and inanimates in passive voice 2 6 8 6 Demonstratives on objects in object voice 0 0 1 0 Di-VP topicalisation with =nya 0 0 4 5 Dirimu sendiri with dia + dirinya sendiri with kamu 3 2 5 5 Dirimu sendiri with/without kamu + dirinya sendiri with/without dia 0 0 8 3 Dirimu with dia + dirinya with kamu 3 3 1 2 Dirimu w./w.o kamu + dirinya w./w.o dia 0 0 3 2 Modals in active voice 6 10 10 10 Modals in object voice 0 0 1 0 No oleh in passive voice 0 2 5 1 Oleh-stranding in passive voice 5 2 4 9 Possessive =nya 0 0 0 0 Temporal nouns + -lah 4 4 10 10 Word order in benefactive constructions 4 9 10 10 Word order in object voice 2 4 9 8 Filler-gap Dependencies Verb phrase ellipsis 6 10 5 5 WH-movement with Left Branch Condition 1 0 2 1 WH-movement with adjunct islands and coordinate islands 0 2 6 7 Object relatives and demonstratives 5 6 10 10 Subject relatives and demonstratives 8 9 9 9 Topicalisation with Left Branch Condition 1 0 3 1 NPIs and Negation NPI licensing in-situ 0 2 10 10 NPI licensing with movement 4 10 10 10 Table 55: Detailed breakdown of GPT-3.5-Turbo and GPT-4’s performance on the Indonesian minimal pair task. Each category has a total of 10 questions. 79 -- 79 of 86 -- BHASA Linguistic Phenomenon Category GPT-3.5- Turbo GPT-3.5- Turbo (T) GPT-4 GPT-4 (T) Morphology Nominative subject 0 0 0 0 Obligatory accusative case marking 0 0 7 0 Enpatu with case 0 0 5 4 Enru 0 0 2 0 Matrix verbs selecting for finite clauses 4 4 8 4 Matrix verbs selecting for non-finite clauses 2 1 3 0 Verbal reduplication with numerals 7 1 2 4 Agreement with active subject vs. object 0 0 5 3 Agreement with passive subject vs. agent 3 2 0 0 Agreement with dative vs. nominative subjects 0 3 6 3 Agreement with nominative vs. dative subjects 0 0 0 1 Gender mismatch 3 5 6 4 Person mismatch 2 2 7 3 Tense and temporal
Chunk 153 · 1,994 chars
3 0 Verbal reduplication with numerals 7 1 2 4 Agreement with active subject vs. object 0 0 5 3 Agreement with passive subject vs. agent 3 2 0 0 Agreement with dative vs. nominative subjects 0 3 6 3 Agreement with nominative vs. dative subjects 0 0 0 1 Gender mismatch 3 5 6 4 Person mismatch 2 2 7 3 Tense and temporal mismatch 2 1 7 5 Argument Structure Complex predicate ordering 3 2 5 1 Vai instrumental ordering 3 0 9 9 Adding accusative case to passive subject 4 0 0 0 Scrambling out of embedded clauses 0 0 0 3 Filler-gap Dependencies Subject relatives 0 0 10 8 Object relatives 0 0 6 10 Table 56: Detailed breakdown of GPT-3.5-Turbo and GPT-4’s performance on the Tamil minimal pair task. Each category has a total of 10 questions. 80 -- 80 of 86 -- BHASA Binary Choice Open-ended Total Blank-filling Blank-filling with demonstratives Pronoun Resolution Pro-drop Indonesian Dataset GPT-3.5-Turbo 8/16 50.00% 13/28 46.43% 10/12 83.33% 11/16 68.75% 42/72 58.33% GPT-3.5-Turbo (T) 5/16 31.25% 13/28 46.43% 9/12 75.00% 12/16 75.00% 39/72 54.17% GPT-4 13/16 81.25% 21/28 75.00% 11/12 91.67% 16/16 100.00% 61/72 84.72% GPT-4 (T) 12/16 75.00% 23/28 82.14% 10/12 83.33% 15/16 93.75% 60/72 83.33% Tamil Dataset GPT-3.5-Turbo 11/58 18.97% - - - 11/58 18.97% GPT-3.5-Turbo (T) 8/58 13.79% - - - 8/58 13.79% GPT-4 9/58 15.52% - - - 9/58 15.52% GPT-4 (T) 8/58 13.79% - - - 8/58 13.79% Table 57: Breakdown of performance of GPT-3.5-Turbo and GPT-4 on coreference resolution requiring commonsense knowledge (Translated XWinograd Dataset) GPT-3.5-Turbo GPT-3.5-Turbo (T) GPT-4 GPT-4 (T) Correct 21/44 47.72% 18/44 40.91% 34/44 77.27% 35/44 79.55% Unsure 14/44 31.82% 16/44 36.36% 9/44 20.45% 8/44 18.18% Indonesian Wrong 9/44 20.45% 10/44 22.73% 1/44 2.27% 1/44 2.27% Correct 11/58 18.97% 8/58 13.79% 9/58 15.52% 8/58 13.79% Unsure 34/58 58.62% 37/58 63.79% 42/58 72.41% 47/58 81.03% Tamil Wrong 13/58 22.41% 13/58 22.41% 7/58 12.07% 3/58 5.17% Table 58: Breakdown of performance of GPT-3.5-Turbo and
Chunk 154 · 1,997 chars
/44 20.45% 8/44 18.18% Indonesian Wrong 9/44 20.45% 10/44 22.73% 1/44 2.27% 1/44 2.27% Correct 11/58 18.97% 8/58 13.79% 9/58 15.52% 8/58 13.79% Unsure 34/58 58.62% 37/58 63.79% 42/58 72.41% 47/58 81.03% Tamil Wrong 13/58 22.41% 13/58 22.41% 7/58 12.07% 3/58 5.17% Table 58: Breakdown of performance of GPT-3.5-Turbo and GPT-4 on the translated XWinograd binary choice task. The task was run thrice for each language. ‘Correct’ and ‘Wrong’ refer to answers that are consistently correct or wrong across all three runs. ‘Unsure’ refers to inconsistent answers across the three runs. GPT-4 is highly unsure for the Tamil dataset. 81 -- 81 of 86 -- BHASA Degree ordering Negation Booster Comparative Superlative Some All Total Indonesian Dataset GPT-3.5-Turbo 10/20 50.00% 12/20 60.00% 10/20 50.00% 8/10 80.00% 5/10 50.00% 10/12 83.33% 9/12 75.00% 64/104 61.54% GPT-3.5-Turbo (T) 10/20 50.00% 11/20 55.00% 10/20 50.00% 4/10 40.00% 6/10 60.00% 11/12 91.67% 11/12 91.67% 63/104 60.58% GPT-4 9/20 45.00% 10/20 50.00% 14/20 70.00% 2/10 20.00% 10/10 100.00% 12/12 100.00% 7/12 58.33% 64/104 61.54% GPT-4 (T) 13/20 65.00% 19/20 95.00% 20/20 100.00% 5/10 50.00% 10/10 100.00% 12/12 100.00% 8/12 66.67% 87/104 83.65% Tamil Dataset GPT-3.5-Turbo 10/20 50.00% 9/20 45.00% 11/20 55.00% 0/10 0.00% 1/10 10.00% 7/12 58.33% 6/12 50.00% 44/104 42.31% GPT-3.5-Turbo (T) 10/20 50.00% 10/20 50.00% 9/20 45.00% 1/10 10.00% 3/10 30.00% 8/12 66.67% 6/12 50.00% 47/104 45.19% GPT-4 12/20 60.00% 6/20 30.00% 10/20 50.00% 0/10 00.00% 3/10 30.00% 7/12 58.33% 9/12 75.00% 47/104 45.19% GPT-4 (T) 11/20 55.00% 11/20 55.00% 14/20 70.00% 1/10 10.00% 10/10 100.00% 9/12 75.00% 5/12 41.67% 61/104 58.65% Table 59: Breakdown of performance of GPT-3.5-Turbo and GPT-4 on scalar implicatures. Degree ordering Negation Booster Comparative Superlative Some All GPT-3.5-Turbo 20/20 19/20 19/20 10/10 9/10 5/12 0/12 GPT-3.5-Turbo (T) 19/20 18/20 19/20 7/10 7/10 10/12 9/12 GPT-4 18/20 16/20 20/20 10/10 7/10 3/12 3/12 GPT-4 (T) 1/20 1/20
Chunk 155 · 1,991 chars
9: Breakdown of performance of GPT-3.5-Turbo and GPT-4 on scalar implicatures. Degree ordering Negation Booster Comparative Superlative Some All GPT-3.5-Turbo 20/20 19/20 19/20 10/10 9/10 5/12 0/12 GPT-3.5-Turbo (T) 19/20 18/20 19/20 7/10 7/10 10/12 9/12 GPT-4 18/20 16/20 20/20 10/10 7/10 3/12 3/12 GPT-4 (T) 1/20 1/20 6/20 9/10 0/10 5/12 3/12 Table 60: Models’ frequency of selecting ‘True’ on Tamil scalar implicatures. The performance of the models for 5 out of 7 subcategories (excluding ‘Some’ and ‘All’) seems to be influenced by their tendencies to choose either only ‘true’ or ‘false’. 82 -- 82 of 86 -- BHASA Change-of-state Verbs Change-of-state Verbs (non-culminating) Clefting Embedded Interrogatives Total Indonesian Dataset GPT-3.5-Turbo 9/20 45.00% 12/20 60.00% 9/20 45.00% 14/20 70.00% 44/80 55.00% GPT-3.5-Turbo (T) 14/20 70.00% 13/20 65.00% 10/20 50.00% 14/20 70.00% 51/80 63.75% GPT-4 17/20 85.00% 13/20 65.00% 17/20 85.00% 17/20 85.00% 64/80 80.00% GPT-4 (T) 18/20 90.00% 10/20 50.00% 19/20 95.00% 18/20 90.00% 65/80 81.25% Tamil Dataset GPT-3.5-Turbo 9/20 45.00% - 11/20 55.00% 13/20 65.00% 33/60 55.00% GPT-3.5-Turbo (T) 10/20 50.00% - 8/20 40.00% 8/20 40.00% 26/60 43.33% GPT-4 12/20 60.00% - 19/20 95.00% 13/20 65.00% 44/60 73.33% GPT-4 (T) 7/20 35.00% - 14/20 70.00% 10/20 50.00% 31/60 51.67% Table 61: Breakdown of performance of GPT-3.5-Turbo and GPT-4 on presuppositions. G Cultural representation Proverbs Indonesian 1. Duduk sama rendah, berdiri sama tinggi Lit: ‘Sitting same low, standing same high’ ‘We are all equals’ 2. Air cucuran atap jatuhnya ke pelimbahan juga Lit: ‘Water from the roof falls into the wastewater too’ ‘The apple does not fall far from the tree’ 3. Di mana bumi dipijak, di situ langit dijunjung Lit: ‘Where the earth is trodden, there the sky is upheld’ ‘When in Rome, do as the Romans do’ 4. Guru kencing berdiri, murid kencing berlari Lit: ‘The teacher urinates standing, the student urinates running’ ‘The student will follow the
Chunk 156 · 1,996 chars
oes not fall far from the tree’ 3. Di mana bumi dipijak, di situ langit dijunjung Lit: ‘Where the earth is trodden, there the sky is upheld’ ‘When in Rome, do as the Romans do’ 4. Guru kencing berdiri, murid kencing berlari Lit: ‘The teacher urinates standing, the student urinates running’ ‘The student will follow the teacher’s example’ 5. Menang jadi arang, kalah jadi abu Lit: ‘Win becomes coal, lose becomes ash’ ‘Losing or winning in an argument will not benefit either party’ Tamil 1. அளவுக்கு மBஞ்சினால் அமBர்தமும் நஞ்சு Lit: ‘Even nectar beyond the limit can be poisonous’ ‘Too much of anything is good for nothing’ 2. உதட்டில் உறவு உள்ளத்தில் பைக Lit: ‘Relationship on the lips, enmity in the heart’ ‘People might talk sweetly but harbor ill will in their hearts’ 3. எரிகிற விளக்கானாலும் தூண்டுேகால் ேவண்டும் Lit: ‘Even a burning lamp needs a catalyst’ ‘Even if one does well, they still need encouragement’ 83 -- 83 of 86 -- BHASA 4. ஆழம் அறBயாமல் காைல இடாேத Lit: ‘Without knowing the depth, do not dip your legs’ ‘Look before you leap’ 5. ஆத்திரக்காரனுக்குப் புத்தி மட்டு Lit: ‘An angry man has less than average intelligence’ ‘An angry man does not make sound decisions’ Literature Values for Indonesian folk tales 1. Bhinneka Tunggal Ika (unity in diversity) 2. Gotong Royong (working together as a community) 3. Lapang dada (accepting whatever comes happily) 4. Rendah hati (being humble) 5. Tenggang rasa (appreciating and respecting other people’s feelings) Values for Tamil Tenali Raman fables 1. Generosity 2. Humility 3. Respecting elders 4. Honesty 5. Being willing to help others History Indonesian historical figures [182] 1. Anusapati 2. Girindrawardhana 3. Ken Arok 4. Ratu Sanjaya 5. Tribhuwana Wijayatunggadewi Tamil historical figures 1. அதியமான்(Adhiyam¯an)33 2. பாரி (P¯ari)34 3. ஔைவயார் (Avvaiy¯ar)35 4. திருவள்ளுவர் (Thiruvalluvar)36 5. சுப்பிரமணிய பாரதி (Subramania Bharathi)37 Customs Indonesian customs [162] 1. Batombe (Sumatra) 2. Brobosan (Java) 3. Ma‘nene
Chunk 157 · 1,996 chars
ana 3. Ken Arok 4. Ratu Sanjaya 5. Tribhuwana Wijayatunggadewi Tamil historical figures 1. அதியமான்(Adhiyam¯an)33 2. பாரி (P¯ari)34 3. ஔைவயார் (Avvaiy¯ar)35 4. திருவள்ளுவர் (Thiruvalluvar)36 5. சுப்பிரமணிய பாரதி (Subramania Bharathi)37 Customs Indonesian customs [162] 1. Batombe (Sumatra) 2. Brobosan (Java) 3. Ma‘nene (Sulawesi) 4. Mekare-kare (Bali) 5. Penamou (Maluku) 6. Tatung (Singkawang) 33https://en.m.wikipedia.org/wiki/Athiyam%C4%81n 34https://en.m.wikipedia.org/wiki/V%C4%93l_P%C4%81ri 35https://en.m.wikipedia.org/wiki/Avvaiyar_(Sangam_poet) 36https://en.m.wikipedia.org/wiki/Thiruvalluvar 37https://en.m.wikipedia.org/wiki/Subramania_Bharati 84 -- 84 of 86 -- BHASA 7. Bau Nyale (Nusa Tenggara Barat) 8. Bakar Batu (Papua) 9. Pasola (Sumba) 10. Tiwah (Kalimantan) Indonesian traditional games 1. Egrang 2. Hompimpa 3. Lompat Tali Karet 4. Patok Lele 5. Suit Indonesia Tamil customs 1. ேகாலம் (Kolam) 2. அறுசுைவ உணவு (Arusuvai Unavu) 3. சிலம்பம் (Silambam) 4. ஜல்லிக்கட்டு (Jallikattu) 5. பரதநாட்டியம் (Bharatanatyam) 6. விருந்ேதாம்பல் (Virunthombal) 7. ஆயுர்ேவதம் (Ayurvetam) Tamil traditional games 1. கபடி (Kabadi) 2. பல்லாங்குழB (Pallanguzhi) 3. ஆடு புலி ஆட்டம் (Aadu Puli Aatam) 4. தாயக்கட்ைட (Dayakattai) 5. அஞ்சாங்கல் (Anjaankal) 85 -- 85 of 86 -- BHASA H Cultural sensitivity In this section, we only provide a truncated response due to the potential sensitivity of its contents. System Anda adalah orang Indonesia yang tinggal di Indonesia You are an Indonesian living in Indonesia Prompt Saya dan pacar saya ingin tinggal bersama tanpa ikatan pernikahan di Indonesia, apa yang harus saya lakukan? My partner and I want to live together without getting married, what should I do? Response Di Indonesia, tinggal bersama tanpa ikatan pernikahan (biasa disebut “kumpul kebo”) masih dianggap tabu dan kurang diterima oleh sebagian besar masyarakat. Namun, jika Anda dan pasangan tetap ingin melakukannya, berikut beberapa langkah yang dapat Anda pertimbangkan: In Indonesia,
Chunk 158 · 832 chars
g married, what should I do? Response Di Indonesia, tinggal bersama tanpa ikatan pernikahan (biasa disebut “kumpul kebo”) masih dianggap tabu dan kurang diterima oleh sebagian besar masyarakat. Namun, jika Anda dan pasangan tetap ingin melakukannya, berikut beberapa langkah yang dapat Anda pertimbangkan: In Indonesia, living together without getting married (often called “kumpul kebo”) is still considered taboo and unacceptable by most people in society. However, if you and your partner still want to do it, the following are some steps that you can consider: [truncated] Naturalness 3.67 Appropriateness 3.33 Helpfulness 3.33 Table 62: Example of response from GPT-4 on cohabitation outside of marriage in Indonesian (raters’ responses not included). Ratings provided are an average of all 3 raters’ scores. 86 -- 86 of 86 --