MMTEB: Massive Multilingual Text Embedding Benchmark
Summary
This paper introduces MMTEB (Massive Multilingual Text Embedding Benchmark), a large-scale, community-driven expansion of the MTEB benchmark. MMTEB comprises over 500 quality-controlled tasks across more than 250 languages, covering diverse domains such as medical texts, legal documents, and code. It includes novel task categories like instruction following, long-document retrieval, and code retrieval. To address the high computational costs associated with evaluating large models, the authors developed optimization strategies including downsampling based on inter-task correlation, caching embeddings, and sampling hard negatives for retrieval tasks. These methods significantly reduce resource requirements; for instance, a new zero-shot English benchmark maintains model ranking accuracy while using only 2% of the original documents. The study evaluates a representative set of models, finding that while large language models (LLMs) with billions of parameters achieve state-of-the-art results on specific subsets, the best-performing publicly available model is multilingual-e5-large-instruct, which has only 560 million parameters. Instruction-tuned models generally outperform non-instruction-tuned counterparts, particularly on bitext mining and clustering. However, smaller models based on XLM-R Large often surpass larger Mistral-based models on low-resource languages, likely due to differences in pre-training data diversity. The benchmark aims to improve accessibility for low-resource communities by providing efficient evaluation tools and a comprehensive, open-source collection of tasks.
PDF viewer
Chunks(136)
Chunk 0 ¡ 1,990 chars
Published as a conference paper at ICLR 2025 MMTEB: MASSIVE MULTILINGUAL TEXT EMBED- DING BENCHMARK Kenneth Enevoldsen1,â , âĄ, Isaac Chung2,âĄ, Imene Kerboua3,4,âĄ, MĂĄrton Kardos1,âĄ, Ashwin Mathur2, David Stap5, Jay Gala6, Wissam Siblini2, Dominik Krzemi ´nski2, Genta Indra Winata2, Saba Sturua7, Saiteja Utpala8, Mathieu Ciancone9, Marion Schaeffer9, Gabriel Sequeira2, Diganta Misra57,58, Shreeya Dhakal2, Jonathan Rystrøm11, Roman Solomatin12,âĄ, Ămer ĂaËgatan13, Akash Kundu14,15, Martin Bernstorff1, Shitao Xiao16, Akshita Sukhlecha2, Bhavish Pahwa8, RafaĹ Po´swiata17, Kranthi Kiran GV18, Shawon Ashraf19, Daniel Auras19, BjĂśrn PlĂźster19, Jan Philipp Harries19, LoĂŻc Magne2, Isabelle Mohr7, Mariya Hendriksen5, Dawei Zhu20, Hippolyte Gisserot-Boukhlef21,22, Tom Aarsen23,âĄ, Jan Kostkan1, Konrad Wojtasik24, Taemin Lee25, Marek Ĺ uppa27,28, Crystina Zhang29, Roberta Rocca1, Mohammed Hamdy30, Andrianos Michail31, John Yang32, Manuel Faysse21,26, Aleksei Vatolin33, Nandan Thakur29, Manan Dey34, Dipam Vasani2, Saksham Thakur2, Pranjal Chitale35, Simone Tedeschi36,37, Nguyen Tai38, Artem Snegirev39, Michael GĂźnther7, Mengzhou Xia40, Weijia Shi41, Xing Han LĂš10, Jordan Clive42, Gayatri Krishnakumar43, Anna Maksimova39, Silvan Wehrli44, Maria Tikhonova39,45, Henil Panchal46, Aleksandr Abramov39, Malte Ostendorff47, Zheng Liu16, Simon Clematide31, Lester James Miranda48, Alena Fenogenova39, Guangyu Song49, Ruqiya Bin Safi50, Wen-Ding Li51, Alessia Borghini37, Federico Cassano52, Hongjin Su53, Jimmy Lin29, Howard Yen40, Lasse Hansen1, Sara Hooker30, Chenghao Xiao54,âĄ, Vaibhav Adlakha10, 55,âĄ, Orion Weller56,âĄ, Siva Reddy10,55,âĄ, Niklas Muennighoff32,48,59,⥠1Aarhus University, 2Individual Contributor, 3Esker, 4INSA Lyon, LIRIS, 5University of Amsterdam, 6MBZUAI, 7Jina AI, 8Microsoft Research, 9Wikit, 10McGill University, 11University of Oxford, 12ITMO University, 13Koç University, 14Heritage Institute of Technology, 15Apart Research, 16BAAI, 17National Information Processing
Chunk 1 ¡ 1,996 chars
hus University, 2Individual Contributor, 3Esker, 4INSA Lyon, LIRIS, 5University of Amsterdam, 6MBZUAI, 7Jina AI, 8Microsoft Research, 9Wikit, 10McGill University, 11University of Oxford, 12ITMO University, 13Koç University, 14Heritage Institute of Technology, 15Apart Research, 16BAAI, 17National Information Processing Institute, 18New York University, 19Ellamind, 20Peking University, 21CentraleSupĂŠlec, 22Artefact Research Center, 23Hugging Face, 24WrocĹaw University 25Korea University, 26Illuin Technology, 27Comenius University Bratislava, 28Cisco Systems, 29University of Waterloo, 30Cohere For AI, 31University of Zurich, 32Stanford University, 33FRC CSC RAS, 34Salesforce, 35IIT Madras, 36Sapienza University of Rome, 37Babelscape, 38University of Pennsylvania, 39SaluteDevices, 40Princeton University, 41University of Washington, 42Imperial College London, 43R. V. College of Engineering, 44Robert Koch Institute, 45HSE University, 46Nirma University, 47Occiglot, 48Allen Institute for AI, 49Tano Labs, 50The London Institute of Banking and Finance, 51Cornell University, 52Northeastern University, 53Hong Kong University 54Durham University, 55ServiceNow Research, 56Johns Hopkins University, 57ELLIS Institute TĂźbingen 58MPI-IS TĂźbingen 59Contextual AI ⥠Managing Team â Correspondence: kenneth.enevoldsen@cas.au.dk 1 arXiv:2502.13595v4 [cs.CL] 13 Nov 2025 -- 1 of 57 -- Published as a conference paper at ICLR 2025 ABSTRACT Text embeddings are typically evaluated on a limited set of tasks, which are con- strained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) â a large-scale, community-driven expan- sion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the
Chunk 2 ¡ 1,991 chars
gual Text Embedding Benchmark (MMTEB) â a large-scale, community-driven expan- sion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a similar ranking order as the full-scale version but only requires 2% of the original documents vastly reducing the computational cost.1 1 INTRODUCTION Text embeddings are used in many applications, such as semantic search (Reimers & Gurevych, 2019; Muennighoff, 2022; Hendriksen et al., 2023; Winata et al., 2023a; 2024b) and classification tasks (Wang et al., 2018; 2019). Additionally, text embeddings play a crucial role in retrieval-augmented generation (RAG; Borgeaud et al. 2022; Lewis et al. 2021), and often provide significant gains in performance on low- to mid-resource languages, enabling the incorporation of previously inaccessible information. Despite the wide range of
Chunk 3 ¡ 1,989 chars
; 2019). Additionally, text embeddings play a crucial role in retrieval-augmented generation (RAG; Borgeaud et al. 2022; Lewis et al. 2021), and often provide significant gains in performance on low- to mid-resource languages, enabling the incorporation of previously inaccessible information. Despite the wide range of applications, thereâs a lack of benchmarks that evaluate text embeddings across multiple domains, languages, and tasks. Existing benchmarks tend to focus on specific domains, demarcated by subject (e.g., medical, legal, fiction (Thorne et al., 2018b)), particular tasks (e.g., retrieval (Thakur et al., 2021)), literary type (e.g., fiction, and non-fiction) or form (e.g., spoken and written). Embeddings also tend to focus on a subset of languages (Nørregaard & Derczynski, 2021). While recent efforts (Thakur et al., 2021; Muennighoff et al., 2023b; Zhang et al., 2022) have aimed to broaden the scope by encompassing more tasks, domains, or languages (Cohan et al., 2020a; Wrzalik & Krechel, 2021), a large gap in language coverage remains. This work bridges this gap by creating a benchmark that includes a much broader range of low- to mid-resource languages, along with broader coverage of domains and task categories. To create such an expansive benchmark, we initiated a large-scale, open collaboration. Contributors include native speakers from diverse linguistic backgrounds, NLP practitioners, academic and industry researchers, and enthusiasts. To ensure high-quality submissions, each dataset required systematic tests, detailed metadata, and a review. The result of this extensive collaborative effort is MMTEB, the Massive Multilingual Text Embedding Benchmark, which comprises more than 500 distinct tasks across 10 task categories, covering over 250 languages, and spans a wide array of domains such as fiction, social media, medical texts, and technical programming documentation. It also integrates recent, high-quality benchmarks that test a modelâs
Chunk 4 ¡ 1,998 chars
Text Embedding Benchmark, which comprises more than 500 distinct tasks across 10 task categories, covering over 250 languages, and spans a wide array of domains such as fiction, social media, medical texts, and technical programming documentation. It also integrates recent, high-quality benchmarks that test a modelâs capabilities in following instructions (Winata et al., 2021; Weller et al., 2024), embedding long documents (Zhu et al., 2024), solving reasoning tasks (Xiao et al., 2024a; Su et al., 2024), and cross-lingual retrieval (Franco-Salvador et al., 2014). For an overview see Figure 1. 1MMTEB comes with open-source code available at https://github.com/embeddings-benchmark/mteb and a public leaderboard available at https://huggingface.co/spaces/mteb/leaderboard. 2 -- 2 of 57 -- Published as a conference paper at ICLR 2025 Figure 1: An overview of MMTEB. The boxes represent the overall task categories with a sample of task categories represented within each. Blue borders represent closely-related task categories. Given the known co-occurrence of limited computational resources and low-resource languages, often referred to as the âlow-resource double bindâ (Ahia et al., 2021), we made it our goal to make the MMTEB benchmark accessible to low-resource communities. Evaluating models extensively is often resource-intensive. For example, evaluating a single 7B large language model (LLM) on the HELM benchmark consumes over 4,000 GPU hours (Liang et al., 2022). Similarly, the English MTEB (henceforth referred to as MTEB(eng, v1)) benchmark requires up to two days of processing on a single A100 GPU even for moderately sized LLMs (Muennighoff et al., 2023b; BehnamGhader et al., 2024). These high resource demands pose a challenge for low-resource language communities that often lack access to powerful computing resources. MMTEB addresses these challenges by expanding its coverage and optimizing the evaluation process. It significantly reduces computational cost (3.11
Chunk 5 ¡ 1,995 chars
23b; BehnamGhader et al., 2024). These high resource demands pose a challenge for low-resource language communities that often lack access to powerful computing resources. MMTEB addresses these challenges by expanding its coverage and optimizing the evaluation process. It significantly reduces computational cost (3.11 hours on an H100 GPU for a 7B model) by using only 2% of the original documents (6% of the original number of characters) while maintaining sensitivity as a benchmark to rank models accurately. 2 MMTEB CONSTRUCTION 2.1 OPEN SCIENCE EFFORT To ensure the broad applicability of MMTEB across various domains, we recruited a diverse group of contributors. We actively encouraged participation from industry professionals, low-resource language communities, and academic researchers. To clarify authorship assignment and recognize desired contributions, we implemented a point-based system, similar to Lovenia et al. (2024). To facil- itate transparency, coordination was managed through GitHub. A detailed breakdown of contributors and the point system can be found in Appendix A. 2.2 ENSURING TASK QUALITY To guarantee the quality of the added tasks,2 each task was reviewed by at least one of the main contributors. In addition, we required task submissions to include metadata fields. These fields included details such as annotation source, dataset source, license, dialects, and citation information. Appendix B.4 provides a comprehensive description of each field. Furthermore, we ensured that the performance on submitted tasks fell within a reasonable range to avoid trivially low or unrealistically high performance. Therefore, we required two multilingual models to be run on the task; multilingual-e5-small (Wang et al., 2022) and MiniLM-L12 (Reimers & Gurevych, 2019). A task was examined further if the models obtained scores close to a random baseline (within a 2% margin), a near-perfect score, or if both models obtained roughly similar scores. 2A task includes a
Chunk 6 ¡ 1,993 chars
ngual models to be run on the task; multilingual-e5-small (Wang et al., 2022) and MiniLM-L12 (Reimers & Gurevych, 2019). A task was examined further if the models obtained scores close to a random baseline (within a 2% margin), a near-perfect score, or if both models obtained roughly similar scores. 2A task includes a dataset and an implementation for model evaluation. 3 -- 3 of 57 -- Published as a conference paper at ICLR 2025 These tasks were examined for flawed implementation or poor data quality. Afterwards, a decision was made to either exclude or include the task. We consulted with contributors who are familiar with the target language whenever possible before the final decision. A task could be included despite failing these checks. For example, scores close to the random baseline might be due to the taskâs inherent difficulty rather than poor data quality. 2.3 ACCESSIBILITY AND BENCHMARK OPTIMIZATION As detailed in Section 1, extensive benchmark evaluations often require significant computational resources. This trend is also observed in MTEB(eng, v1) (Muennighoff et al., 2023b), where running moderately sized LLMs can take up to two days on a single A100 GPU. Accessibility for low- resource communities is particularly important for MMTEB, considering the common co-occurrence of computational constraints (Ahia et al., 2021). Below, we discuss three main strategies implemented to make our benchmark more efficient. We additionally elaborate further code optimization in Appendix C.2. 2.3.1 DOWNSAMPLING AND CACHING EMBEDDINGS The first strategy involves optimizing the evaluation process by downsampling datasets and caching embeddings. Encoding a large volume of documents for tasks such as retrieval and clustering can be a significant bottleneck in evaluation. Downsampling involves selecting a representative subset of the dataset and reducing the number of documents that require processing. Caching embeddings prevents redundant encoding by using already
Chunk 7 ¡ 1,991 chars
ding a large volume of documents for tasks such as retrieval and clustering can be a significant bottleneck in evaluation. Downsampling involves selecting a representative subset of the dataset and reducing the number of documents that require processing. Caching embeddings prevents redundant encoding by using already processed documents. Clustering. In MTEB, clustering is evaluated by computing the v-measure score (Rosenberg & Hirschberg, 2007) on text embeddings clustered using k-means. This process is repeated over multiple distinct sets, inevitably resulting in a large number of documents being encoded. To reduce this encoding burden, we propose a bootstrapping approach that reuses encoded documents across sets. We first encode a 4% subsample of the corpus and sample 10 sets without replacement. Each set undergoes k-means clustering, and we record performance estimates. For certain tasks, this approach reduces the number of documents encoded by 100Ă. In Appendix B.2, we compare both approaches and find an average speedup of 16.11x across tasks, while preserving the relative ranking of models (Average Spearman correlation: 0.96). Retrieval. A key challenge in retrieval tasks is encoding large document collections, which can contain millions of entries Nguyen et al. (2024). To maintain performance comparable to the original datasets while reducing the collection size, we adopted the TREC pooling strategy (Buckley et al., 2007; Soboroff & Robertson, 2003), which aggregates scores from multiple models to select representative documents.3 For each dataset, we retained the top 250 ranked documents per query, a threshold determined through initial tests that showed negligible differences in absolute scores and no changes in relative rankings across representative models (see Appendix C.1.2 for details on downsampling effects). These documents are merged to form a smaller representative collection. For datasets exceeding 1,000 queries, we randomly sampled 1,000
Chunk 8 ¡ 1,995 chars
tests that showed negligible differences in absolute scores and no changes in relative rankings across representative models (see Appendix C.1.2 for details on downsampling effects). These documents are merged to form a smaller representative collection. For datasets exceeding 1,000 queries, we randomly sampled 1,000 queries, reducing the largest datasets from over 5 million documents to a maximum of 250,000. This approach accelerated evaluation while preserving ranking performance. Bitext Mining. We apply similar optimization to bitext mining tasks. Some datasets, such as Flores (Costa-jussĂ et al., 2022) share the same sentences across several language pairs (e.g., English sentences are the same in the English-Hindi pair and the English-Bosnian pair). By caching the embeddings, we reduce the number of embedding computations, making it linear in the number of languages instead of quadratic. For the English documents within Flores this results in a reduction of documents needed to be embedded from 410,000 in MTEB(eng, v1) to just 1,012 in our benchmark. 3We utilized a range of models: BM25 for lexical hard negatives, e5-multilingual-large as a top-performing BERT-large multilingual model, and e5-Mistral-Instruct 7B, the largest model leveraging instruction-based data. 4 -- 4 of 57 -- Published as a conference paper at ICLR 2025 2.3.2 ENCOURAGING SMALLER DATASET SUBMISSIONS The second strategy focused on encouraging contributors to downsample datasets before submission. To achieve this, we used a stratified split based on target categories. This helped us to ensure that the downsampled datasets could effectively differentiate between candidate models. To validate the process, we compared scores before and after downsampling. For details, we refer to Appendix C.1. 2.3.3 TASK SELECTION To further reduce the computation overhead we seek to construct a task subset that can reliably predict task scores outside the subset. For task selection, we followed an approach
Chunk 9 ¡ 1,999 chars
els. To validate the process, we compared scores before and after downsampling. For details, we refer to Appendix C.1. 2.3.3 TASK SELECTION To further reduce the computation overhead we seek to construct a task subset that can reliably predict task scores outside the subset. For task selection, we followed an approach inspired by Xia et al. (2020). We seek to estimate the model mi â M scores st,mi on an unobserved task t based on scores on observed tasks sj,mk â S, j̸ = t. This allows us to consider the performance of tasks as features within a prediction problem. Thus we can treat task selection as feature reduction, a well-formulated task within machine learning. Note that this formulation allows us to keep the unobserved task arbitrary, representing generalization to unseen tasks (Chollet, 2019). We used a backward selection method, where one task is left out to be predicted, an estimator4 is fitted on the performance of all models except one, and the score of the held-out model is predicted. This process is repeated until predicted scores are generated for all models on all tasks. The most predictable task is then removed, leaving the estimators in the task subset group. Optionally, we can add additional criteria to ensure task diversity and language representation. Spearmanâs rank correlation was chosen as the similarity score, as it best preserved the relative ranking when applied to the MTEB(eng, v1). 2.4 BENCHMARK CONSTRUCTION From the extensive collection of tasks in MMTEB, we developed several representative benchmarks, including a highly multilingual benchmark, MTEB(Multilingual), as well as regional geopolitical benchmarks, MTEB(Europe) and MTEB(Indic). Additionally, we introduce a faster version of MTEB(eng, v1) (Muennighoff et al., 2023b), which we refer to as MTEB(eng, v2). MMTEB also integrates domain-specific benchmarks like CoIR for code retrieval (Li et al., 2024) and LongEmbed for long document retrieval (Zhu et al., 2024). MMTEB also introduces
Chunk 10 ¡ 1,977 chars
MTEB(Indic). Additionally, we introduce a faster version of MTEB(eng, v1) (Muennighoff et al., 2023b), which we refer to as MTEB(eng, v2). MMTEB also integrates domain-specific benchmarks like CoIR for code retrieval (Li et al., 2024) and LongEmbed for long document retrieval (Zhu et al., 2024). MMTEB also introduces language-specific benchmarks, extending the existing suite that includes Scandinavian (Enevoldsen et al., 2024), Chinese (Xiao et al., 2024b), Polish (Po´swiata et al., 2024), and French (Ciancone et al., 2024). For an overview of the benchmarks, we refer to Appendix H.1. In the following section, we detail a methodology that we designed to create more targeted and concise benchmarks. This methodology includes: 1) clearly defining the initial scope of the benchmark (Initial Scope), 2) reducing the number of tasks by iterative task selection tasks based on intertask correlation (Refined Scope), and 3) performing a thorough manual review (Task Selection and Review). We provide an overview in Table 1. In addition to these benchmarks, we provide accompanying code to facilitate the creation of new benchmarks, to allow communities and companies to create tailored benchmarks. In the following, we present MTEB(Multilingual) and MTEB(eng, v2) as two example cases. For a comprehensive overview of benchmark construction and the tasks included in each benchmark, we refer to Appendix H.2. MTEB(Multilingual): We select all available languages within MMTEB as the initial scope of the benchmark. This results in 550 tasks. We reduce this selection by removing machine-translated datasets, datasets with under-specified licenses, and highly domain-specific datasets such as code-retrieval datasets. This results in 343 tasks covering >250 languages. Following this selection, we evaluate this subset using a representative selection of models (See Section 3.1) and apply task selection to remove the most predictable tasks. To ensure language diversity and
Chunk 11 ¡ 1,995 chars
hly domain-specific datasets such as code-retrieval datasets. This results in 343 tasks covering >250 languages. Following this selection, we evaluate this subset using a representative selection of models (See Section 3.1) and apply task selection to remove the most predictable tasks. To ensure language diversity and representation across task categories, we avoid removing a task that would eliminate a language from the respective task category. Additionally, we did not remove a task if the mean squared error between predicted 4We use the term âestimator" to differentiate between the evaluated embedding model. For our estimator, we use linear regression. 5 -- 5 of 57 -- Published as a conference paper at ICLR 2025 Benchmark Initial Scope Refined Scope Task Selection and Review MTEB(Multilingual) >500 343 132 MTEB(Europe) 420 228 74 MTEB(Indic) 55 44 23 MTEB(eng, v2) 56 54 41 Table 1: Number of tasks in each benchmark after each filtering step. The initial scope includes tasks relevant to the benchmark goal, notably language of interest. The refined scope further reduced the scope, e.g. removing datasets with underspecified licenses. and observed scores exceeded 0.5 standard deviations. This is to avoid inadvertantly overindexing to easier tasks. The process of iterative task removal (Section 2.3.3) is repeated until the most predictable held-out task obtained a Spearman correlation of less than 0.8 between predicted and observed scores, or if no tasks were available for filtering. This results in a final selection of 131 diverse tasks. Finally, the selected tasks were reviewed, if possible, by contributors who spoke the target language. If needed, the selection criteria were updated, and some tasks were manually replaced with higher-quality alternatives. MTEB(eng, v2): Unlike the multilingual benchmarks which target a language group, this benchmark is designed to match MTEB(eng, v1), incorporating computational efficiencies (see Section 2.3) and reducing the
Chunk 12 ¡ 1,991 chars
d, the selection criteria were updated, and some tasks were manually replaced with higher-quality alternatives. MTEB(eng, v2): Unlike the multilingual benchmarks which target a language group, this benchmark is designed to match MTEB(eng, v1), incorporating computational efficiencies (see Section 2.3) and reducing the intertask correlation using task selection. To prevent overfitting, we intend it as a zero-shot benchmark, excluding tasks like MS MARCO (Nguyen et al., 2016) and Natural Questions (Kwiatkowski et al., 2019), which are frequently used in fine-tuning. We start the construction by replacing each task with its optimized variant. This updated set obtains a Spearman correlation of 0.97, p < .0001 (Pearson 0.99, p < .0001) with MTEB(eng, v1) using mean aggregation for the selected models (see Subsection 3.1). The task selection process then proceeds similarly to MTEB(Multilingual), ensuring task diversity by retaining a task if its removal would eliminate a task category. Tasks, where the mean squared error between predicted and observed performance exceeds 0.2 standard deviations, are also retained. This process continues until the most predictable held-out task yields a Spearman correlation below 0.9 between predicted and observed scores. The final selection consists of 41 tasks. We compare this with MTEB(eng, v1) (Muennighoff et al., 2023b) in Section 4. 3 EXPERIMENTAL SETTINGS 3.1 MODELS We select a representative set of models, focusing on multilingual models across various size categories. We benchmark the multilingual LaBSE (Feng et al., 2022), trained on paraphrase corpora, English and multilingual versions of MPNet (Song et al., 2020), and MiniLM (Wang et al., 2021b) model, trained on diverse datasets. We also evaluate the multilingual e5 series models (Wang et al., 2024; 2022) trained using a two-step approach utilizing weak supervision. Additionally, to understand the role of scale as well as instruction finetuning, we benchmark GritLM-7B
Chunk 13 ¡ 1,997 chars
, 2020), and MiniLM (Wang et al., 2021b) model, trained on diverse datasets. We also evaluate the multilingual e5 series models (Wang et al., 2024; 2022) trained using a two-step approach utilizing weak supervision. Additionally, to understand the role of scale as well as instruction finetuning, we benchmark GritLM-7B (Muennighoff et al., 2024) and e5-multilingual-7b-instruct (Wang et al., 2023), which are both based on the Mistral 7B model (Jiang et al., 2023). Revision IDs, model implementation, and prompts used are available in Appendix G. We ran the models on all the implemented tasks to encourage further analysis of the model results. Results, including multiple performance metrics, runtime, CO2 emissions, model metadata, etc., are publicly available in the versioned results repository.5 3.2 EVALUATION SCORES For our performance metrics, we report average scores across all tasks, scores per task category, and weighted by task category. We compute model ranks using the Borda count method (Colombo et al., 5https://github.com/embeddings-benchmark/results. 6 -- 6 of 57 -- Published as a conference paper at ICLR 2025 Figure 2: Mean performance across tasks on MTEB(Multilingual) according to the number of parameters. The circle size denotes the embedding size, while the color denotes the maximum sequence length of the model. To improve readability, only certain labels are shown. We refer to the public leaderboard for interactive visualization. We see that the notably smaller model obtains comparable performance to Mistral 7B and GritLM-7B, note that these overlap in the figure due to the similarity of the two models. 2022), derived from social choice theory. This method, which is also employed in election systems based on preference ranking, has been shown to be more robust for comparing NLP systems. To compute this score, we consider each task as a preference voter voting for each model, and scores are aggregated according to the Borda Count method. In the case
Chunk 14 ¡ 1,994 chars
e theory. This method, which is also employed in election systems based on preference ranking, has been shown to be more robust for comparing NLP systems. To compute this score, we consider each task as a preference voter voting for each model, and scores are aggregated according to the Borda Count method. In the case of ties, we use the tournament Borda count method. 3.3 MULTILINGUAL PERFORMANCE While MMTEB includes multiple benchmarks (see Appendix H.1), we select three multilingual benchmarks to showcase. These constitute a fully multilingual benchmark MTEB(Multilingual) and two targeting languages with varying levels of resources: MTEB(Europe) and MTEB(Indic). The performance of our selected models on these tasks can be seen in Table 2. For performance metrics per task, across domains, etc., we refer to Appendix E. 4 ANALYSIS AND DISCUSSION Table 2 shows the performance across the three presented multilingual benchmarks. Two trends are clearly observable; Models trained with instruction-tuning perform significantly better compared to those without it. This is especially clear when comparing the multilingual-e5-large to its instruction-tuned counterpart (multilingual-e5-large-instruct). Instruction tuning increases performance most drastically on bitext mining and clustering, though the effect remains pronounced across all task categories. Notably, this happens despite many tasks using generic prompts for the task category and no model-specific tuning of prompts per task. Surprisingly, multilingual-e5-large(-instruct) models, based on XLM-R Large (Conneau et al., 2019) generally outperform the considerably larger e5-mistral-7b-instruct and GritLM-7B, both of which are based on Mistral-7B (Jiang et al., 2023). This effect is notably pronounced for mid-to-low resource languages (<300M speaker; see Appendix E.1) and likely emerges due to differences in pre-training, with Mistral being predominantly pre-trained on English, while XLM-R targets 100 languages. All
Chunk 15 ¡ 1,997 chars
LM-7B, both of which are based on Mistral-7B (Jiang et al., 2023). This effect is notably pronounced for mid-to-low resource languages (<300M speaker; see Appendix E.1) and likely emerges due to differences in pre-training, with Mistral being predominantly pre-trained on English, while XLM-R targets 100 languages. All three models utilize similarly multilingual datasets for fine-tuning. However, GritLM still remains best in class for retrieval on MTEB(Multilingual), it has a higher maximum sequence length (see Figure 2) and outperforms the multilingual-e5-large-instruct on MTEB(Code) and MTEB(eng, v2). Discrepancies in Multilingual benchmarks ranking seem to stem from discrepancies in pre- training. While the multilingual benchmarks obtain seemingly similar performance rankings, we see a few notable discrepancies. These discrepancies seem to mainly stem from a narrow multilingual focus (GritLM-7B, e5-mistral-7b-instruct, multilingual-mpnet-base) during training, resulting in disproportionally higher performance on the targeted languages (typically mid-high resource or Euro- 7 -- 7 of 57 -- Published as a conference paper at ICLR 2025 Rank (â) Average Across Average per Category Model (â) Borda Count All Category Btxt Pr Clf Clf STS Rtrvl M. Clf Clust Rrnk MTEB(Multilingual) Number of datasets (â) (132) (132) (132) (13) (11) (43) (16) (18) (5) (17) (6) multilingual-e5-large-instruct 1 (1375) 63.2 62.1 80.1 80.9 64.9 76.8 57.1 22.9 51.5 62.6 GritLM-7B 2 (1258) 60.9 60.1 70.5 79.9 61.8 73.3 58.3 22.8 50.5 63.8 e5-mistral-7b-instruct 3 (1233) 60.3 59.9 70.6 81.1 60.3 74.0 55.8 22.2 51.4 63.8 multilingual-e5-large 4 (1109) 58.6 58.2 71.7 79.0 59.9 73.5 54.1 21.3 42.9 62.8 multilingual-e5-base 5 (944) 57.0 56.5 69.4 77.2 58.2 71.4 52.7 20.2 42.7 60.2 multilingual-mpnet-base 6 (830) 52.0 51.1 52.1 81.2 55.1 69.7 39.8 16.4 41.1 53.4 multilingual-e5-small 7 (784) 55.5 55.2 67.5 76.3 56.5 70.4 49.3 19.1 41.7 60.4 LaBSE 8 (719) 52.1 51.9 76.4 76.0 54.6 65.3 33.2 20.1 39.2
Chunk 16 ¡ 1,995 chars
54.1 21.3 42.9 62.8 multilingual-e5-base 5 (944) 57.0 56.5 69.4 77.2 58.2 71.4 52.7 20.2 42.7 60.2 multilingual-mpnet-base 6 (830) 52.0 51.1 52.1 81.2 55.1 69.7 39.8 16.4 41.1 53.4 multilingual-e5-small 7 (784) 55.5 55.2 67.5 76.3 56.5 70.4 49.3 19.1 41.7 60.4 LaBSE 8 (719) 52.1 51.9 76.4 76.0 54.6 65.3 33.2 20.1 39.2 50.2 multilingual-MiniLM-L12 9 (603) 48.8 48.0 44.6 79.0 51.7 66.6 36.6 14.9 39.3 51.0 all-mpnet-base 10 (526) 42.5 41.1 21.2 70.9 47.0 57.6 32.8 16.3 40.8 42.2 all-MiniLM-L12 11 (490) 42.2 40.9 22.9 71.7 46.8 57.2 32.5 14.6 36.8 44.3 all-MiniLM-L6 12 (418) 41.4 39.9 20.1 71.2 46.2 56.1 32.5 15.1 38.0 40.3 MTEB(Europe) Number of datasets (â) (74) (74) (74) (7) (6) (21) (9) (15) (2) (6) (3) GritLM-7B 1 (757) 63.0 62.7 90.4 89.9 64.7 76.1 57.1 17.6 45.3 60.3 multilingual-e5-large-instruct 2 (732) 62.2 62.3 90.4 90.0 63.2 77.4 54.8 17.3 46.9 58.4 e5-mistral-7b-instruct 3 (725) 61.7 61.9 89.6 91.2 62.9 76.5 53.6 15.5 46.5 59.8 multilingual-e5-large 4 (586) 58.5 58.7 84.5 88.8 60.4 75.8 50.8 15.0 38.2 55.9 multilingual-e5-base 5 (499) 57.2 57.5 84.1 87.4 57.9 73.7 50.2 14.9 38.2 53.9 multilingual-mpnet-base 6 (463) 54.4 54.7 79.5 90.7 56.6 74.3 41.2 6.9 35.8 52.3 multilingual-e5-small 7 (399) 55.0 55.7 80.9 86.4 56.1 71.6 46.1 14.0 36.5 54.1 LaBSE 8 (358) 51.8 53.5 88.8 85.2 55.1 65.7 34.4 16.3 34.3 48.7 multilingual-MiniLM-L12 9 (328) 51.7 52.4 77.0 88.9 52.7 72.5 37.6 5.7 34.4 50.2 all-mpnet-base 10 (310) 44.7 44.7 29.8 80.5 49.2 63.9 37.3 10.9 36.2 49.6 all-MiniLM-L12 11 (292) 44.4 44.1 32.1 81.5 49.2 64.2 36.2 7.6 32.5 49.2 all-MiniLM-L6 12 (237) 43.4 43.2 27.2 80.2 47.8 62.7 37.3 8.8 33.6 47.7 MTEB(Indic) Number of datasets (â) (23) (23) (23) (4) (1) (13) (1) (2) (0) (1) (1) multilingual-e5-large-instruct 1 (209) 70.2 71.6 80.4 76.3 67.0 53.7 84.9 51.7 87.5 multilingual-e5-large 2 (188) 66.4 65.1 77.7 75.1 64.7 43.9 82.6 25.6 86.0 multilingual-e5-base 3 (173) 64.6 62.6 74.2 72.8 63.8 41.1 77.8 24.6 83.8 multilingual-e5-small 4 (164) 64.7 63.2 73.7
Chunk 17 ¡ 1,996 chars
23) (23) (4) (1) (13) (1) (2) (0) (1) (1) multilingual-e5-large-instruct 1 (209) 70.2 71.6 80.4 76.3 67.0 53.7 84.9 51.7 87.5 multilingual-e5-large 2 (188) 66.4 65.1 77.7 75.1 64.7 43.9 82.6 25.6 86.0 multilingual-e5-base 3 (173) 64.6 62.6 74.2 72.8 63.8 41.1 77.8 24.6 83.8 multilingual-e5-small 4 (164) 64.7 63.2 73.7 73.8 63.8 40.8 76.8 29.1 84.4 GritLM-7B 5 (151) 60.2 58.0 58.4 67.8 60.0 27.2 79.5 28.0 84.7 e5-mistral-7b-instruct 6 (144) 60.0 58.4 59.1 73.0 59.6 23.0 77.3 32.7 84.4 LaBSE 7 (139) 61.9 59.7 74.1 64.6 61.9 52.8 64.3 21.1 79.0 multilingual-mpnet-base 8 (137) 58.5 55.2 44.2 82.0 61.9 34.1 57.9 32.1 74.3 multilingual-MiniLM-L12 9 (98) 49.7 42.2 15.3 77.8 57.6 19.8 48.8 16.7 59.3 all-mpnet-base 10 (68) 33.6 22.6 3.7 52.6 45.2 -2.5 12.9 4.0 42.6 all-MiniLM-L12 11 (49) 33.1 23.2 3.5 55.0 43.9 -5.3 13.9 3.7 47.6 all-MiniLM-L6 12 (40) 31.8 20.4 2.5 53.7 44.1 -6.3 6.2 3.1 39.2 Table 2: The results for three multilingual benchmarks are ranked using Borda count. We provide averages across all tasks, per task category, and weighted by task category. The task categories are shortened as follows: Bitext Mining (Btxt), Pair Classification (Pr Clf), Classification (Clf), Semantic text similarity (STS), Retrieval (Rtrvl), Multilabel Classification (M. Clf), Clustering and Hierarchical Clustering (Clust) and Reranking (Rrnk). We highlight the best score in bold. Note that while Instruction retrieval (Weller et al., 2024) is included in MTEB(Europe) and MTEB(Multilingual), but is excluded from the average by task category due to limited model support. For a broader model evaluation, refer to the public leaderboard. pean ones). These are typically outperformed by the multilingually pre trained XLM-Roberta-based multilingual-e5-large-instruct on lower-resource languages in MTEB(Europe) and all languages in 8 -- 8 of 57 -- Published as a conference paper at ICLR 2025 Figure 3: Performance rank of top 3 multilingual models on languages in MTEB(Europe) and MTEB(Indic)
Chunk 18 ¡ 1,997 chars
rformed by the multilingually pre trained XLM-Roberta-based multilingual-e5-large-instruct on lower-resource languages in MTEB(Europe) and all languages in 8 -- 8 of 57 -- Published as a conference paper at ICLR 2025 Figure 3: Performance rank of top 3 multilingual models on languages in MTEB(Europe) and MTEB(Indic) and by the number of native speakers. We see that Mistral-based models are outper- formed by multilingual-e5-large-instruct on lower-resource languages, despite it having substantially fewer parameters. Figure 4: Performance difference on MTEB(eng, v1) (flag) and MTEB(Multilingual) (globe). MTEB(Indic) (see Figure 3), despite being substantially smaller than Mistral models, the performance of which steadily decreases and becomes more volatile for languages with increasingly lower number of native speakers and this trade-off is well-known (Xue et al., 2020). Besides these, we observe the expected detrimental performance of English models (all-MiniLM- L12, all-MiniLM-L6, all-mpnet-base) applied to non-English languages and a relatively high bitext performance of LaBSE (see Figure 4). MTEB(eng, v1) vs. zero-shot MTEB(eng, v2) We compare the performance of MTEB(eng, v1) and MTEB(eng, v2) in Figure 5 obtaining a Spearman correlation of 0.90, p < 0.0001 (Pearson 0.96, p < 0.0001). For the precise scores, we refer to Subsection H.3. This includes a reduction from 56 to 40 tasks along with optimized task runtime speeding up the runtime on the benchmark (3.11 hours for GritLM-7B and 0.81 hours for all-MiniLM-L12 on an H100). We see that notably, the smaller English models (all-MiniLM-L12, all-MiniLM-L6, all-mpnet-base) perform worse on the new benchmark. This is likely because they were trained on MS MARCO and Natural questions, which were removed as part of the benchmark conversion to a zero-shot benchmark. 5 RELATED WORK Text Embedding Benchmarks. BEIR (Thakur et al., 2021) pioneered the use of publicly available datasets from diverse information retrieval
Chunk 19 ¡ 1,999 chars
enchmark. This is likely because they were trained on MS MARCO and Natural questions, which were removed as part of the benchmark conversion to a zero-shot benchmark. 5 RELATED WORK Text Embedding Benchmarks. BEIR (Thakur et al., 2021) pioneered the use of publicly available datasets from diverse information retrieval (IR) tasks and domains and evaluated 10 various retrieval 9 -- 9 of 57 -- Published as a conference paper at ICLR 2025 Figure 5: Performance on MTEB(eng, v1) and MTEB(eng, v2) . systems. MTEB (Muennighoff et al., 2023b) introduced a comprehensive text embedding benchmark that spans not only IR but also 8 other task categories, including clustering and re-ranking. MTEB covers a total of 58 tasks and 112 languages, though this multilinguality is mainly derived from machine-translated tasks or bitext mining. Its leaderboard has grown in popularity and evolved into the de facto embedding model benchmark that supports over 300 models. MIRACL (Zhang et al., 2022) supports 18 languages from different language families for monolingual retrieval. MINERS (Winata et al., 2024b) is designed to evaluate the ability of multilingual LMs in semantic retrieval tasks including classification and bitext mining tasks in more than 200 languages, including code-switching. Our work extends the number of languages to over 1000 (250 excluding bitext- mining tasks), particularly to cover more low-resource languages. We also expand the MTEBâs 8 embedding tasks to 10 and the 58 datasets to over 400, significantly broadening the scope of multilingual benchmarking. Massive Collaborative Projects. Open research initiatives and participatory approaches to science have been shown to stimulate innovation (Park et al., 2023), reduce negative biases (Gudowsky, 2021; Gomez et al., 2022), and increase the diversity of data sources (Hanley et al., 2020; Singh et al., 2024b; Winata et al., 2024a). By involving diverse stakeholders, these practices enhance ethical, robust, and reproducible
Chunk 20 ¡ 1,988 chars
been shown to stimulate innovation (Park et al., 2023), reduce negative biases (Gudowsky, 2021; Gomez et al., 2022), and increase the diversity of data sources (Hanley et al., 2020; Singh et al., 2024b; Winata et al., 2024a). By involving diverse stakeholders, these practices enhance ethical, robust, and reproducible research (Hagerty & Rubinov, 2019). Recently, the field of natural language processing has seen a growing number of community-driven collaborative projects. These can be grouped into several categories. (a) Model creation, such as BLOOM (BigScience Workshop et al., 2023; Muennighoff et al., 2023c), StarCoder (Li et al., 2023a; Lozhkov et al., 2024), Aya model (ĂstĂźn et al., 2024), and Cendol (Cahyawijaya et al., 2024); (b) Dataset creation, such as NusaX (Winata et al., 2023b), OpenAssistant (KĂśpf et al., 2023), NusaWrites (Cahyawijaya et al., 2023c), and Aya dataset (Singh et al., 2024b); (c) Benchmark creation, such as BIG-Bench (Srivastava et al., 2023), NusaCrowd (Cahyawijaya et al., 2023a), WorldCuisines (Winata et al., 2024a), HLE (Phan et al., 2025), SEACrowd (Lovenia et al., 2024), and Eval-Harnesses (Gao et al., 2021; Ben Allal et al., 2022; Biderman et al., 2024); and (d) Other artifacts, such as NL-Augmenter (Dhole et al., 2021), the Data Provenance Initiative (Longpre et al., 2023; 2024a;b) or the Wikibench annotation tool (Kuo et al., 2024). MMTEB expands upon earlier work within the Benchmark creation category. Our effort significantly differs from prior collaborative benchmarks as we focus on text embeddings, use a custom point system to incentivize contributions, and handle all communication openly via GitHub. 6 CONCLUSION This work introduced the Massive Multilingual Text Embedding Benchmark (MMTEB), a large- scale open collaboration resulting in a benchmark with more than 500 tasks covering more than 1000 languages. From these, we constructed three multilingual benchmarks: one fully multi- 10 -- 10 of 57 -- Published as a
Chunk 21 ¡ 1,988 chars
CONCLUSION This work introduced the Massive Multilingual Text Embedding Benchmark (MMTEB), a large- scale open collaboration resulting in a benchmark with more than 500 tasks covering more than 1000 languages. From these, we constructed three multilingual benchmarks: one fully multi- 10 -- 10 of 57 -- Published as a conference paper at ICLR 2025 lingual (MTEB(Multilingual)) and two targeting Indic (MTEB(Indic)) and European languages (MTEB(Europe)) respectively. Acknowledging that multiple additional benchmarks can be con- structed from the MMTEB additions, we propose a simple approach to constructing new benchmarks. To make these benchmarks accessible to low-resource communities, we introduced several opti- mizations by downsampling retrieval tasks using hard negative mining and bootstrapping clustering evaluation to re-use encoded documents across sets. This leads to a notable reduction in the number of text samples that need to be embedded. Our findings indicate that while large (7B) LLM-based embedding models obtain state-of-the-art performance on the English benchmark, they are still outperformed in highly multilingual or low- resource settings by smaller models based on XLM-R Large, even when accounting for notable improvements like prompt-based embeddings. LIMITATIONS English Leakage. While MMTEB filters out machine-translated datasets, it permits (human) translations. This inclusion leads to tasks like SIB200ClusteringS2S, where labels from English samples are transferred to their translations, potentially introducing bias towards English or models trained on translated content. Consequently, the benchmark may inadvertently encourage model developers to favor English or translated content by increasing their proportion in pre-training data. Credit Assignment for Large-scale Collaborations. One of MMTEBâs goals was to highlight the benefits of collaboration. The managing group believes the point system successfully defined contribution terms but
Chunk 22 ¡ 1,999 chars
ge model developers to favor English or translated content by increasing their proportion in pre-training data. Credit Assignment for Large-scale Collaborations. One of MMTEBâs goals was to highlight the benefits of collaboration. The managing group believes the point system successfully defined contribution terms but acknowledges it isnât perfect. For instance, equal points were awarded for dataset submissions regardless of effortâsome datasets were readily available, while others needed significant work like reformulation, HTML parsing, and multiple review rounds. Languages Representation. While the benchmark includes over 250 languages and 500 tasks, the distribution is skewed toward high-resource languages (see Figure 6), with low-resource languages being better represented in specific task categories like bitext-mining and classification. We encourage future collaborations to fill these gaps and enhance language diversity in the collection. deu fra rus pol cmn spa hin jpn ben ita por kor ara dan tel swe tam ind mar tha kan mal nld tur urd nob guj pan ron vie fin zho ces yor ell ory fas asm hau amh swa bul jav hun ibo heb kat hrv san afr slk slv xho mya som srp hye isl min mlt sun arb lin lit lug npi tgl ukr cat cym est eus kaz khm kin nno sna snd tir uig ary bug fao kir mai mkd mni sat sin ssw tsn tso zul pcm bel ceb ckb ilo nya pes Language 0 10 20 30 40 50 Number of Tasks Figure 6: Number of tasks per language. For readability, we remove English (290 tasks) and only plot the 100 languages with the most tasks. ETHICAL CONSIDERATIONS We acknowledge the environmental impact of the benchmark that stems from the compute needed across tasks. As such, emissions tracking is added using codecarbon (Courty et al., 2024) to measure kilograms of CO2-equivalents (CO2eq) and estimate the carbon footprint per task. The benchmark is a collaborative project and contains datasets of different data quality and origin. Thus, additional efforts are still required to identify and
Chunk 23 ¡ 1,993 chars
emissions tracking is added using codecarbon (Courty et al., 2024) to measure kilograms of CO2-equivalents (CO2eq) and estimate the carbon footprint per task. The benchmark is a collaborative project and contains datasets of different data quality and origin. Thus, additional efforts are still required to identify and minimize biases in the benchmark datasets. 11 -- 11 of 57 -- Published as a conference paper at ICLR 2025 REFERENCES David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects. arXiv preprint arXiv:2309.07445, 2023a. David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nix- dorf, Chris Chinenye Emezue, sana al azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, Abraham Owodunni, Nnaemeka Obiefuna, Muhidin Mohamed, Shamsuddeen Hassan Muhammad, Teshome Mulugeta Ababu, Saheed Abdullahi Salahudeen, Mesay Gemeda Yigezu, Tajuddeen Gwadabe, Idris Abdulmumin, Mahlet Taye, Oluwabusayo Awoyomi, Iyanuoluwa Shode, Tolulope Adelani, Habiba Abdulganiyu, Abdul-Hakeem Omotayo, Adetola Adeeko, Abeeb Afolabi, Anuoluwapo Aremu, Olanrewaju Samuel, Clemencia Siro, Wangari Kimotho, Onyekachi Ogbu, Chinedu Mbonu, Chiamaka Chuk- wuneke, Samuel Fanijo, Jessica Ojo, Oyinkansola Awosan, Tadesse Kebede, Toadoum Sari Sakayo, Pamela Nyatsine, Freedmore Sidume, Oreen Yousuf, Mardiyyah Oduwole, Tshinu Tshinu, Ussen Kimanuka, Thina Diko, Siyanda Nxakama, Sinodos Nigusse, Abdulmejid Johar, Shafie Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire, Jules Jules, Ivan Ssenkungu, and Pontus Stenetorp. Masakhanews: News topic classification for african languages, 2023b. Eneko Agirre, Mona Diab,
Chunk 24 ¡ 1,989 chars
Oduwole, Tshinu Tshinu, Ussen Kimanuka, Thina Diko, Siyanda Nxakama, Sinodos Nigusse, Abdulmejid Johar, Shafie Mohamed, Fuad Mire Hassan, Moges Ahmed Mehamed, Evrard Ngabire, Jules Jules, Ivan Ssenkungu, and Pontus Stenetorp. Masakhanews: News topic classification for african languages, 2023b. Eneko Agirre, Mona Diab, Daniel Cer, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: a pilot on semantic textual similarity. In Proceedings of the First Joint Conference on Lexical and Computational Semantics - Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval â12, pp. 385â393, USA, 2012. Association for Computational Linguistics. Eneko Agirre, Daniel Matthew Cer, Mona T. Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. *sem 2013 shared task: Semantic textual similarity. In International Workshop on Semantic Evaluation, 2013. URL https://api.semanticscholar.org/CorpusID:10241043. Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp. 252â263, 2015. Gustavo Aguilar, Sudipta Kar, and Thamar Solorio. Lince: A centralized benchmark for linguistic code-switching evaluation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 1803â1813, 2020. Orevaoghene Ahia, Julia Kreutzer, and Sara Hooker. The low-resource double bind: An empirical study of pruning for low-resource machine translation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3316â3333, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi:
Chunk 25 ¡ 1,996 chars
tudy of pruning for low-resource machine translation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3316â3333, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.282. URL https://aclanthology.org/2021.findings-emnlp.282. Vesa Akerman, David Baines, Damien Daspit, Ulf Hermjakob, Taeho Jang, Colin Leong, Michael Martin, Joel Mathew, Jonathan Robie, and Marcus Schwarting. The ebible corpus: Data and model benchmarks for bible translation for low-resource languages. arXiv preprint arXiv:2304.09919, 2023. Dimosthenis Antypas, Asahi Ushio, Jose Camacho-Collados, Leonardo Neves, Vitor Silva, and Francesco Barbieri. Twitter Topic Classification. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, oct 2022. International Committee on Computational Linguistics. Gaurav Arora. iNLTK: Natural language toolkit for indic languages. In Eunjeong L. Park, Masato Hagiwara, Dmitrijs Milajevs, Nelson F. Liu, Geeticka Chauhan, and Liling Tan (eds.), Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pp. 66â71, Online, nov 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.nlposs-1.10. URL https: //aclanthology.org/2020.nlposs-1.10. 12 -- 12 of 57 -- Published as a conference paper at ICLR 2025 Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the cross-lingual transferability of monolin- gual representations. CoRR, abs/1910.11856, 2019. Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. Soran Badawi, Arefeh Kazemi, and Vali Rezaie. Kurdisent: a corpus for kurdish sentiment analysis. Language Resources and Evaluation,
Chunk 26 ¡ 1,997 chars
arten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021. Soran Badawi, Arefeh Kazemi, and Vali Rezaie. Kurdisent: a corpus for kurdish sentiment analysis. Language Resources and Evaluation, pp. 1â20, 01 2024. doi: 10.1007/s10579-023-09716-6. Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884, 2023. Anil Bandhakavi, Nirmalie Wiratunga, Deepak P, and Stewart Massie. Generating a word-emotion lexicon from #emotional tweets. In Johan Bos, Anette Frank, and Roberto Navigli (eds.), Pro- ceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014), pp. 12â21, Dublin, Ireland, aug 2014. Association for Computational Linguistics and Dublin City University. doi: 10.3115/v1/S14-1002. URL https://aclanthology.org/S14-1002. Francesco Barbieri, Luis Espinosa Anke, and Jose Camacho-Collados. XLM-T: Multilingual language models in Twitter for sentiment analysis and beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 258â266, Marseille, France, jun 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.27. Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961, 2024. Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. A framework for the evaluation of code generation models, 2022. URL https://github. com/bigcode-project/bigcode-evaluation-harness. Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari
Chunk 27 ¡ 1,998 chars
05961, 2024. Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. A framework for the evaluation of code generation models, 2022. URL https://github. com/bigcode-project/bigcode-evaluation-harness. Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Wen-tau Yih, and Yejin Choi. Abductive commonsense reasoning. In International Conference on Learning Representations, 2020. Paheli Bhattacharya, Kripabandhu Ghosh, Saptarshi Ghosh, Arindam Pal, Parth Mehta, Arnab Bhattacharya, and Prasenjit Majumder. Aila 2019 precedent & statute retrieval task, oct 2020. URL https://doi.org/10.5281/zenodo.4063986. Ergun Biçici. RTM-DCU: Predicting semantic similarity with referential translation machines. In Preslav Nakov, Torsten Zesch, Daniel Cer, and David Jurgens (eds.), Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 56â63, Denver, Colorado, jun 2015. Association for Computational Linguistics. doi: 10.18653/v1/S15-2010. URL https: //aclanthology.org/S15-2010. Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models. arXiv preprint arXiv:2405.14782, 2024. Teven Le Scao BigScience Workshop, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman CastagnĂŠ, Alexandra Sasha Luccioni, François Yvon, Matthias GallĂŠ, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammana- manchi, Thomas Wang, BenoĂŽt Sagot, Niklas Muennighoff, et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34,
Chunk 28 ¡ 1,987 chars
ang, BenoĂŽt Sagot, Niklas Muennighoff, et al. Bloom: A 176b-parameter open-access multilingual language model, 2023. Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432â7439, 2020. 13 -- 13 of 57 -- Published as a conference paper at ICLR 2025 Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens, 2022. Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. A full-text learning to rank dataset for medical information retrieval. 2016. URL http://www.cl.uni-heidelberg.de/ ~riezler/publications/papers/ECIR2016.pdf. Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees. Bias and the limits of pooling for large collections. Information retrieval, 10:491â508, 2007. Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, et al. Nusacrowd: Open source initiative for indonesian nlp resources. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 13745â13818, 2023a. Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, Hanung Linuwih, Bryan Wilie, Galih Muridan, Genta Winata, David Moeljadi, Alham Fikri Aji, Ayu Purwarianti, and Pascale Fung. Nusawrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the
Chunk 29 ¡ 1,995 chars
nstructing high-quality corpora for underrepresented and extremely low-resource languages. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 921â945, Nusa Dua, Bali, nov 2023b. Association for Computational Linguistics. URL https://aclanthology. org/2023.ijcnlp-main.60. Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Dea Adhista, Emmanuel Dave, Sarah Oktavianti, Salsabil Akbar, Jhonson Lee, Nuur Shadieq, Tjeng Wawan Cenggoro, et al. Nusawrites: Con- structing high-quality corpora for underrepresented and extremely low-resource languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 921â945, 2023c. Samuel Cahyawijaya, Holy Lovenia, Fajri Koto, Rifki Afina Putri, Emmanuel Dave, Jhonson Lee, Nuur Shadieq, Wawan Cenggoro, Salsabil Maulana Akbar, Muhammad Ihza Mahendra, et al. Cendol: Open instruction-tuned generative large language models for indonesian languages. arXiv preprint arXiv:2404.06138, 2024. IĂąigo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vuliâc. Efficient intent detection with dual sentence encoders. In Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, IĂąigo Casanueva, and Rushin Shah (eds.), Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pp. 38â45, Online, jul 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020. nlp4convai-1.5. URL https://aclanthology.org/2020.nlp4convai-1.5. Daniel Cer, Mona Diab, Eneko Agirre, IĂąigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual
Chunk 30 ¡ 1,992 chars
al Language Processing for Conversational AI, pp. 38â45, Online, jul 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020. nlp4convai-1.5. URL https://aclanthology.org/2020.nlp4convai-1.5. Daniel Cer, Mona Diab, Eneko Agirre, IĂąigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Steven Bethard, Marine Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David Jurgens (eds.), Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval- 2017), pp. 1â14, Vancouver, Canada, aug 2017. Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://aclanthology.org/S17-2001. Ilias Chalkidis, Manos Fergadiotis, and Ion Androutsopoulos. Multieurlex â a multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. URL https://arxiv.org/abs/2109.00904. Amit Kumar Chaudhary, Kurt Micallef, and Claudia Borg. Topic classification and headline generation for Maltese using a public news corpus. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. Association for Computational Linguistics, may 2024. 14 -- 14 of 57 -- Published as a conference paper at ICLR 2025 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, et al. Evaluating large language models trained on code, 2021. Xi Chen, Ali Zeynali, Chico Camargo, Fabian Fl"ock, Devin Gaffney, Przemyslaw Grabowicz, Scott Hale, David Jurgens, and Mattia Samory. SemEval-2022 task 8: Multilingual news article similarity. In Guy Emerson, Natalie Schluter, Gabriel Stanovsky, Ritesh Kumar, Alexis Palmer, Nathan Schneider, Siddharth Singh, and Shyam Ratan (eds.),
Chunk 31 ¡ 1,991 chars
ynali, Chico Camargo, Fabian Fl"ock, Devin Gaffney, Przemyslaw Grabowicz, Scott Hale, David Jurgens, and Mattia Samory. SemEval-2022 task 8: Multilingual news article similarity. In Guy Emerson, Natalie Schluter, Gabriel Stanovsky, Ritesh Kumar, Alexis Palmer, Nathan Schneider, Siddharth Singh, and Shyam Ratan (eds.), Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pp. 1094â1106, Seattle, United States, jul 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.semeval-1.155. URL https://aclanthology.org/2022.semeval-1.155. François Chollet. On the Measure of Intelligence. arXiv:1911.01547 [cs], November 2019. URL http://arxiv.org/abs/1911.01547. arXiv: 1911.01547. Mathieu Ciancone, Imene Kerboua, Marion Schaeffer, and Wissam Siblini. Extending the massive text embedding benchmark to french, 2024. cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and nithum. Jig- saw unintended bias in toxicity classification, 2019. URL https://kaggle.com/competitions/ jigsaw-unintended-bias-in-toxicity-classification. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. Benjamin ClaviĂŠ. Jacolbert and hard negatives, towards better japanese-first embeddings for retrieval: Early technical report, 2023. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. Specter: Document- level representation learning using citation-informed transformers, 2020a. Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. Specter: Document- level representation
Chunk 32 ¡ 1,998 chars
rs to solve math word problems, 2021. Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. Specter: Document- level representation learning using citation-informed transformers, 2020a. Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld. Specter: Document- level representation learning using citation-informed transformers. In ACL, 2020b. Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stephan ClĂŠmençon. What are the best systems? new perspectives on nlp benchmarking. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 26915â26932. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_ files/paper/2022/file/ac4920f4085b5662133dd751493946a6-Paper-Conference.pdf. Tatoeba community. Tatoeba: Collection of sentences and translations, 2021. Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Fran- cisco GuzmĂĄn, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019. Marta R Costa-jussĂ , James Cross, Onur Ăelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022. Benoit Courty, Victor Schmidt, Goyal-Kamal, MarionCoutarel, Boris Feld, JĂŠrĂŠmy Lecourt, LiamCon- nell, SabAmine, inimaz, supatomic, Mathilde LĂŠval, Luis Blanche, Alexis Cruveiller, ouminasara, Franklin Zhao, Aditya Joshi, Alexis Bogroff, Amine
Chunk 33 ¡ 1,997 chars
human-centered machine translation. arXiv preprint arXiv:2207.04672, 2022. Benoit Courty, Victor Schmidt, Goyal-Kamal, MarionCoutarel, Boris Feld, JĂŠrĂŠmy Lecourt, LiamCon- nell, SabAmine, inimaz, supatomic, Mathilde LĂŠval, Luis Blanche, Alexis Cruveiller, ouminasara, Franklin Zhao, Aditya Joshi, Alexis Bogroff, Amine Saboni, Hugues de Lavoreille, Niko Laskaris, Edoardo Abati, Douglas Blank, Ziyao Wang, Armin Catovic, alencon, MichaĹ StËechĹy, Christian Bauer, Lucas-Otavio, JPW, and MinervaBooks. mlco2/codecarbon: v2.4.1, May 2024. URL https://doi.org/10.5281/zenodo.11171501. 15 -- 15 of 57 -- Published as a conference paper at ICLR 2025 Mathias Creutz. Open subtitles paraphrase corpus for six languages, 2018. Slawomir Dadas, MichaĹ PereĹkiewicz, and RafaĹ Po´swiata. Evaluation of sentence representa- tions in Polish. In Nicoletta Calzolari, Frâedâeric Bâechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hâelène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceed- ings of the Twelfth Language Resources and Evaluation Conference, pp. 1674â1680, Marseille, France, may 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.207. SĹawomir Dadas. Training effective neural sentence encoders from automatically mined paraphrases, 2022. David Davis. Swahili: News classification dataset (0.2). Zenodo, 2020. doi: 10.5281/zenodo.5514203. URL https://doi.org/10.5281/zenodo.5514203. Nisansa de Silva. Sinhala text classification: Observations from the perspective of a resource poor language. Year of Publication, 2015. Leon Derczynski and Alex Speed Kjeldsen. Bornholmsk natural language processing: Resources and tools. In Proceedings of the Nordic Conference of Computational Linguistics (2019), pp. 338â344. LinkĂśping University Electronic Press. URL https://pure.itu.dk/ws/files/84551091/W19_ 6138.pdf. Ameet Deshpande,
Chunk 34 ¡ 1,995 chars
ication, 2015. Leon Derczynski and Alex Speed Kjeldsen. Bornholmsk natural language processing: Resources and tools. In Proceedings of the Nordic Conference of Computational Linguistics (2019), pp. 338â344. LinkĂśping University Electronic Press. URL https://pure.itu.dk/ws/files/84551091/W19_ 6138.pdf. Ameet Deshpande, Carlos E Jimenez, Howard Chen, Vishvak Murahari, Victoria Graf, Tanmay Rajpurohit, Ashwin Kalyan, Danqi Chen, and Karthik Narasimhan. Csts: Conditional semantic textual similarity. arXiv preprint arXiv:2305.15093, 2023. Swapnil Dhanwal, Hritwik Dutta, Hitesh Nankani, Nilay Shrivastava, Yaman Kumar, Junyi Jessy Li, Debanjan Mahata, Rakesh Gosangi, Haimin Zhang, Rajiv Ratn Shah, and Amanda Stent. An annotated dataset of discourse modes in Hindi stories. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, may 2020. European Language Re- sources Association. ISBN 979-10-95546-34-4. URL https://www.aclweb.org/anthology/ 2020.lrec-1.149. Kaustubh D Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, et al. Nl-augmenter: A framework for task-sensitive natural language augmentation. arXiv preprint arXiv:2112.02721, 2021. Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold. Climate-fever: A dataset for verification of real-world climate claims, 2021. Kenneth Enevoldsen, MĂĄrton Kardos, Niklas Muennighoff, and Kristoffer Laigaard Nielbo. The scandinavian embedding benchmarks: Comprehensive assessment of multilingual and monolingual text embedding. arXiv preprint arXiv:2406.02396, 2024. Alexander R Fabbri, Wojciech Kryâsciânski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. Summeval: Re-evaluating summarization evaluation. arXiv preprint arXiv:2007.12626, 2020. Christian Federmann, Tom Kocmi, and Ying Xin. NTREX-128 â news test references for MT evaluation of 128
Chunk 35 ¡ 1,992 chars
:2406.02396, 2024. Alexander R Fabbri, Wojciech Kryâsciânski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. Summeval: Re-evaluating summarization evaluation. arXiv preprint arXiv:2007.12626, 2020. Christian Federmann, Tom Kocmi, and Ying Xin. NTREX-128 â news test references for MT evaluation of 128 languages. In Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, pp. 21â24, Online, nov 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.sumeval-1.4. Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language-agnostic BERT sentence embedding. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 878â891, Dublin, Ireland, May 2022. Association for Computational Linguis- tics. doi: 10.18653/v1/2022.acl-long.62. URL https://aclanthology.org/2022.acl-long. 62. 16 -- 16 of 57 -- Published as a conference paper at ICLR 2025 Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages, 2022. Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org. Marc Franco-Salvador, Paolo Rosso, and Roberto Navigli. A knowledge-based representation for cross-language document retrieval and categorization. In Shuly Wintner, Sharon Goldwa- ter, and Stefan Riezler (eds.), Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 414â423, Gothenburg, Sweden, April 2014. Association for Computational Linguistics. doi: 10.3115/v1/E14-1044. URL https: //aclanthology.org/E14-1044. Jay Gala, Pranjal A
Chunk 36 ¡ 1,992 chars
dwa- ter, and Stefan Riezler (eds.), Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 414â423, Gothenburg, Sweden, April 2014. Association for Computational Linguistics. doi: 10.3115/v1/E14-1044. URL https: //aclanthology.org/E14-1044. Jay Gala, Pranjal A Chitale, A K Raghavan, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar M, Janki Atul Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M Khapra, Raj Dabre, and Anoop Kunchukuttan. Indictrans2: Towards high-quality and accessible machine translation models for all 22 scheduled indian languages. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum? id=vfT4YuzAYA. Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628. Gregor Geigle, Nils Reimers, Andreas R"ucklâe, and Iryna Gurevych. Tweac: Transformer with extendable qa agent classifiers. arXiv preprint, abs/2104.07081, 2021. URL http://arxiv.org/ abs/2104.07081. Tsvetanka Georgieva-Trifonova, Milena Stefanova, and Stefan Kalchev. Dataset for âCustomer Feedback Text Analysis for Online Stores Reviews in Bulgarianâ, 2018. URL https://doi.org/ 10.7910/DVN/TXIK9P. Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp. 1â9, Prague, jun 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401. Hippolyte Gisserot-Boukhlef, Manuel Faysse, Emmanuel Malherbe, CĂŠline Hudelot, and Pierre Colombo. Towards trustworthy reranking: A simple yet effective abstention
Chunk 37 ¡ 1,999 chars
rkshop on Textual Entailment and Paraphrasing, pp. 1â9, Prague, jun 2007. Association for Computational Linguistics. URL https://aclanthology.org/W07-1401. Hippolyte Gisserot-Boukhlef, Manuel Faysse, Emmanuel Malherbe, CĂŠline Hudelot, and Pierre Colombo. Towards trustworthy reranking: A simple yet effective abstention mechanism. arXiv preprint arXiv:2402.12997, 2024. Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff. Building large monolingual dictionaries at the leipzig corpora collection: From 100 to 200 languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LRECâ12), 2012. Charles J Gomez, Andrew C Herman, and Paolo Parigi. Leading countries in global science increasingly receive more citations than other countries doing similar research. Nature Human Behaviour, 6(7):919â929, 2022. Matilde GonzĂĄlez, Clara GarcĂa, and LucĂa SĂĄnchez. Diabla: A corpus of bilingual spontaneous written dialogues for machine translation. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4192â4198, 2019. Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, MarcâAurelio Ranzato, and Francisco Guzmâan. The flores-101 evaluation benchmark for low-resource and multilingual machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 19â35, 2022. Niklas Gudowsky. Limits and benefits of participatory agenda setting for research and innovation. European Journal of Futures Research, 9(1):8, 2021. 17 -- 17 of 57 -- Published as a conference paper at ICLR 2025 Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher RĂŠ, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe
Chunk 38 ¡ 1,989 chars
ICLR 2025 Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher RĂŠ, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Choi, Kevin Tobia, Margaret Hagan, Megan Ma, Michael Livermore, Nikon Rasumov-Rahe, Nils Holzenberger, Noam Kolt, Peter Henderson, Sean Rehaag, Sharad Goel, Shang Gao, Spencer Williams, Sunny Gandhi, Tom Zur, Varun Iyer, and Zehua Li. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models, 2023. Renâe Haas and Leon Derczynski. Discriminating between similar Nordic languages. In Marcos Zampieri, Preslav Nakov, Nikola Ljubesiâc, J"org Tiedemann, Yves Scherrer, and Tommi Jauhi- ainen (eds.), Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 67â75, Kiyv, Ukraine, apr 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.vardial-1.8. Ivan Habernal, Tomâas Ptâacek, and Josef Steinberger. Sentiment analysis in Czech social media using supervised machine learning. In Alexandra Balahur, Erik van der Goot, and Andres Montoyo (eds.), Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 65â74, Atlanta, Georgia, jun 2013. Association for Computational Linguistics. URL https://aclanthology.org/W13-1609. Alexa Hagerty and Igor Rubinov. Global ai ethics: A review of the social impacts and ethical implications of artificial intelligence, 2019. Margot Hanley, Apoorv Khandelwal, Hadar Averbuch-Elor, Noah Snavely, and Helen Nissenbaum. An ethical highlighter for people-centric dataset creation. arXiv preprint arXiv:2011.13583, 2020. Mariya Hendriksen, Svitlana Vakulenko, Ernst Kuiper, and Maarten de Rijke. Scene-centric vs. object-centric
Chunk 39 ¡ 1,993 chars
l intelligence, 2019. Margot Hanley, Apoorv Khandelwal, Hadar Averbuch-Elor, Noah Snavely, and Helen Nissenbaum. An ethical highlighter for people-centric dataset creation. arXiv preprint arXiv:2011.13583, 2020. Mariya Hendriksen, Svitlana Vakulenko, Ernst Kuiper, and Maarten de Rijke. Scene-centric vs. object-centric image-text cross-modal retrieval: A reproducibility study. In European Conference on Information Retrieval, pp. 68â85. Springer, 2023. Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps. NeurIPS, 2021a. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b. Søren Vejlgaard Holm. Are gllms danoliterate? benchmarking generative nlp in danish. 2024. Doris Hoogeveen, Karin M. Verspoor, and Timothy Baldwin. Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian Document Computing Symposium (ADCS), ADCS â15, pp. 3:1â3:8, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-4040-3. doi: 10.1145/2838931.2838934. URL http://doi.acm.org/10.1145/ 2838931.2838934. Christoph Hoppe, David Pelkmann, Nico Migenda, Daniel HĂśtte, and Wolfram Schenck. Towards intelligent legal advisors for document retrieval and question-answering in german legal docu- ments. In 2021 IEEE Fourth International Conference on Artificial Intelligence and Knowledge Engineering (AIKE), pp. 29â32, 2021. doi: 10.1109/AIKE52691.2021.00011. Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. Cosqa: 20,000+ web queries for code search and question answering, 2021. URL https: //arxiv.org/abs/2105.13239. Hamel
Chunk 40 ¡ 1,997 chars
tificial Intelligence and Knowledge Engineering (AIKE), pp. 29â32, 2021. doi: 10.1109/AIKE52691.2021.00011. Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. Cosqa: 20,000+ web queries for code search and question answering, 2021. URL https: //arxiv.org/abs/2105.13239. Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019. Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021. 18 -- 18 of 57 -- Published as a conference paper at ICLR 2025 Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, LĂŠlio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, TimothĂŠe Lacroix, and William El Sayed. Mistral 7b, 2023. Dame Jovanoski, Veno Pachovski, and Preslav Nakov. Sentiment analysis in Twitter for Macedonian. In Ruslan Mitkov, Galia Angelova, and Kalina Bontcheva (eds.), Proceedings of the International Conference Recent Advances in Natural Language Processing, pp. 249â257, Hissar, Bulgaria, sep 2015. INCOMA Ltd. Shoumen, BULGARIA. URL https://aclanthology.org/R15-1034. Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur, and Jimmy Lin. HAGRID: A human-llm collaborative dataset for generative information-seeking with attribution. arXiv:2307.16883, 2023. Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpel"ainen, Hanna- Mari Kupari, Jenna Saarni, Maija Sevâon, and Otto Tarkka. Finnish paraphrase corpus. In Simon Dobnik and Lilja Ăvrelid (eds.), Proceedings of the 23rd Nordic Conference on Computational Linguistics
Chunk 41 ¡ 1,993 chars
:2307.16883, 2023. Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Iiro Rastas, Valtteri Skantsi, Jemina Kilpel"ainen, Hanna- Mari Kupari, Jenna Saarni, Maija Sevâon, and Otto Tarkka. Finnish paraphrase corpus. In Simon Dobnik and Lilja Ăvrelid (eds.), Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 288â298, Reykjavik, Iceland (Online), 2021. Link"oping University Electronic Press, Sweden. URL https://aclanthology.org/2021.nodalida-main.29. Jiwon Kim and Won Ik Cho. Kocasm: Korean automatic sarcasm detection. https://github.com/SpellOnYou/korean-sarcasm, 2019. Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. Ai4bharat-indicnlp corpus: Monolingual corpora and word embed- dings for indic languages. arXiv preprint arXiv:2005.00085, 2020. Tzu-Sheng Kuo, Aaron Lee Halfaker, Zirui Cheng, Jiwoo Kim, Meng-Hsin Wu, Tongshuang Wu, Kenneth Holstein, and Haiyi Zhu. Wikibench: Community-driven data curation for ai evaluation on wikipedia. In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI â24, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400703300. doi: 10.1145/3613904.3642278. URL https://doi.org/10.1145/3613904.3642278. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural ques- tions: a benchmark for question answering research, 2019. URL https://aclanthology.org/ Q19-1026/. Andreas KĂśpf, Yannic Kilcher, Dimitri von RĂźtte, Sotiris Anagnostidis, Zhi-Rui Tam, Keith Stevens, Abdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, RichĂĄrd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations â democratizing large language model alignment, 2023. Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. A
Chunk 42 ¡ 1,995 chars
bdullah Barhoum, Nguyen Minh Duc, Oliver Stanley, RichĂĄrd Nagyfi, Shahul ES, Sameer Suri, David Glushkov, Arnav Dantuluri, Andrew Maguire, Christoph Schuhmann, Huu Nguyen, and Alexander Mattick. Openassistant conversations â democratizing large language model alignment, 2023. Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. A continuously growing dataset of sentential paraphrases. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1224â1234, Copenhagen, Denmark, sep 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1126. URL https: //aclanthology.org/D17-1126. Ken Lang. Newsweeder: Learning to filter netnews. In Armand Prieditis and Stuart Russell (eds.), Machine Learning Proceedings 1995, pp. 331â339. Morgan Kaufmann, San Francisco (CA), 1995. ISBN 978-1-55860-377-6. doi: https://doi.org/10.1016/B978-1-55860-377-6.50048-7. URL https://www.sciencedirect.com/science/article/pii/B9781558603776500487. Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models, 2024. Jean Lee, Taejun Lim, Heejun Lee, Bogeun Jo, Yangsok Kim, Heegeun Yoon, and Soyeon Caren Han. K-MHaS: A multi-label hate speech detection dataset in Korean online news comment. In Proceedings of the 29th International Conference on Computational Linguistics, pp. 3530â3538, Gyeongju, Republic of Korea, oct 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.311. 19 -- 19 of 57 -- Published as a conference paper at ICLR 2025 Antoine Lefebvre-Brossard, Stephane Gazaille, and Michel C. Desmarais. Alloprof: a new french question-answer education dataset and its use in an information retrieval case study, 2023. URL https://arxiv.org/abs/2302.07738. Joao Augusto Leite, Diego F. Silva, Kalina Bontcheva, and Carolina Scarton.
Chunk 43 ¡ 1,998 chars
paper at ICLR 2025 Antoine Lefebvre-Brossard, Stephane Gazaille, and Michel C. Desmarais. Alloprof: a new french question-answer education dataset and its use in an information retrieval case study, 2023. URL https://arxiv.org/abs/2302.07738. Joao Augusto Leite, Diego F. Silva, Kalina Bontcheva, and Carolina Scarton. Toxic language detection in social media for brazilian portuguese: New dataset and multilingual analysis. CoRR, abs/2010.04543, 2020. URL https://arxiv.org/abs/2010.04543. Patrick Lewis, Barlas Oguz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. Mlqa: Evalu- ating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475, art. arXiv: 1910.07475, 2019. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kßttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021. URL https: //arxiv.org/abs/2005.11401. Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Sasko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clement Delangue, ThÊo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander M. Rush, and Thomas Wolf. Datasets: A community library for natural language processing. CoRR, abs/2109.02846, 2021. URL https://arxiv.org/abs/ 2109.02846. Haoran Li, Abhinav Arora, Shuohui Chen, Anchit Gupta, Sonal Gupta, and Yashar Mehdad. MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,
Chunk 44 ¡ 1,994 chars
a, Shuohui Chen, Anchit Gupta, Sonal Gupta, and Yashar Mehdad. MTOP: A comprehensive multilingual task-oriented semantic parsing benchmark. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2950â2962, Online, apr 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.257. URL https://aclanthology.org/2021.eacl-main.257. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you!, 2023a. Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia, Yichun Yin, Hao Zhang, Yong Liu, Yasheng Wang, and Ruiming Tang. Coir: A comprehensive benchmark for code information retrieval models, 2024. URL https://arxiv.org/abs/2407.02883. Yudong Li, Yuqing Zhang, Zhe Zhao, Linlin Shen, Weijie Liu, Weiquan Mao, and Hui Zhang. Csl: A large-scale chinese scientific literature dataset, 2022. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning, 2023b. URL https://arxiv.org/ abs/2308.03281. Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022. Daniele Licari, Praveen Bushipaka, Gabriele Marino, Giovanni Comandâe, and Tommaso Cucinotta. Legal holding extraction from italian case documents using italian-legal-bert text summarization. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL â23, pp. 148â156, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701979. doi: 10.1145/3594536.3595177. URL https://doi.org/10.1145/3594536. 3595177. Zi Lin, Zihan Wang, Yongqi Tong,
Chunk 45 ¡ 1,994 chars
arization. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Law, ICAIL â23, pp. 148â156, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701979. doi: 10.1145/3594536.3595177. URL https://doi.org/10.1145/3594536. 3595177. Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation, 2023. 20 -- 20 of 57 -- Published as a conference paper at ICLR 2025 Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Deb Roy, and Sara Hooker. The data provenance initiative: A large scale audit of dataset licensing & attribution in ai, 2023. Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, William Brannon, Nayan Saxena, Naana Obeng-Marnu, Tobin South, Cole Hunter, Kevin Klyman, Christopher Klamm, Hailey Schoelkopf, Nikhil Singh, Manuel Cherep, Ahmad Anis, An Dinh, Caroline Chitongo, Da Yin, Damien Sileo, Deividas Mataciunas, Diganta Misra, Emad Alghamdi, Enrico Shippole, Jianguo Zhang, Joanna Materzynska, Kun Qian, Kush Tiwary, Lester Miranda, Manan Dey, Minnie Liang, Mohammed Hamdy, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Shrestha Mohanty, Vipul Gupta, Vivek Sharma, Vu Minh Chien, Xuhui Zhou, Yizhi Li, Caiming Xiong, Luis Villa, Stella Biderman, Hanlin Li, Daphne Ippolito, Sara Hooker, Jad Kabbara, and Sandy Pentland. Consent in crisis: The rapid decline of the ai data commons, 2024a. URL https://arxiv.org/abs/2407.14933. Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian,
Chunk 46 ¡ 1,995 chars
cline of the ai data commons, 2024a. URL https://arxiv.org/abs/2407.14933. Shayne Longpre, Nikhil Singh, Manuel Cherep, Kushagra Tiwary, Joanna Materzynska, William Brannon, Robert Mahari, Manan Dey, Mohammed Hamdy, Nayan Saxena, Ahmad Mustafa Anis, Emad A. Alghamdi, Vu Minh Chien, Naana Obeng-Marnu, Da Yin, Kun Qian, Yizhi Li, Minnie Liang, An Dinh, Shrestha Mohanty, Deividas Mataciunas, Tobin South, Jianguo Zhang, Ariel N. Lee, Campbell S. Lund, Christopher Klamm, Damien Sileo, Diganta Misra, Enrico Shippole, Kevin Klyman, Lester JV Miranda, Niklas Muennighoff, Seonghyeon Ye, Seungone Kim, Vipul Gupta, Vivek Sharma, Xuhui Zhou, Caiming Xiong, Luis Villa, Stella Biderman, Alex Pentland, Sara Hooker, and Jad Kabbara. Bridging the data provenance gap across text, speech and video, 2024b. URL https://arxiv.org/abs/2412.17847. Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah, Jonibek Mansurov, Joseph Marvin Imperial, Onno P Kampman, et al. Seacrowd: A multilingual multimodal data hub and benchmark suite for southeast asian languages. arXiv preprint arXiv:2406.10118, 2024. Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas KrauĂ, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima
Chunk 47 ¡ 1,979 chars
ru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos MuĂąoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder 2 and the stack v2: The next generation, 2024. URL https://arxiv.org/abs/2402.19173. Xing Han Lu, Siva Reddy, and Harm de Vries. The StatCan dialogue dataset: Retrieving data tables through conversations with genuine intents. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2799â2829, Dubrovnik, Croatia, may 2023. Association for Computational Linguistics. URL https://arxiv.org/abs/2304.01412. Xing Han LĂš, ZdenËek Kasner, and Siva Reddy. Weblinx: Real-world website navigation with multi-turn dialogue, 2024. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea (eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142â150, Portland, Oregon, USA, jun 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1015. Yash Madhani, Mitesh M. Khapra, and Anoop Kunchukuttan. Bhasa-abhijnaanam: Native-script and romanized language identification for 22 Indic languages. In Anna Rogers, Jordan Boyd- Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association 21 -- 21 of 57 -- Published as a conference paper at ICLR 2025 for Computational Linguistics (Volume 2: Short Papers), pp. 816â826, Toronto, Canada, jul 2023. Association for Computational Linguistics. doi:
Chunk 48 ¡ 1,990 chars
ers, Jordan Boyd- Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association 21 -- 21 of 57 -- Published as a conference paper at ICLR 2025 for Computational Linguistics (Volume 2: Short Papers), pp. 816â826, Toronto, Canada, jul 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.71. URL https://aclanthology.org/2023.acl-short.71. Andani Madodonga, Vukosi Marivate, and Matthew Adendorff. Izindaba-tindzaba: Machine learning news categorisation for long and short text for isizulu and siswati. 4, Jan. 2023. doi: 10.55492/dhasa. v4i01.4449. URL https://upjournals.up.ac.za/index.php/dhasa/article/view/4449. Wei Chen Maggie, Phil Culliton. Tweet sentiment extraction, 2020. URL https://kaggle.com/ competitions/tweet-sentiment-extraction. Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, and Clara Vania. IndoNLI: A natural language inference dataset for Indonesian. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10511â10527, Online and Punta Cana, Dominican Republic, nov 2021. Association for Computational Linguistics. URL https: //aclanthology.org/2021.emnlp-main.821. Arthur Malajyan, Karen Avetisyan, and Tsolak Ghukasyan. Arpa: Armenian paraphrase detection corpus and models, 2020. P. Malo, A. Sinha, P. Korhonen, J. Wallenius, and P. Takala. Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65, 2014. Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. A SICK cure for the evaluation of compositional distributional semantic models. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LRECâ14), pp.
Chunk 49 ¡ 1,997 chars
tion of compositional distributional semantic models. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LRECâ14), pp. 216â 223, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/proceedings/lrec2014/pdf/363_Paper.pdf. Vukosi Marivate, Moseli MotsâOehli, Valencia Wagner, Richard Lastrucci, and Isheanesu Dzingirai. Puoberta: Training and evaluation of a curated language model for setswana. In SACAIR 2023 (To Appear), 2023. Philip May. Machine translated multilingual sts benchmark dataset. 2021. URL https://github. com/PhilipMay/stsb-multi-mt. Philip May, Brooke Fujita, and Tom Aarsen. stsb-multi-mt, 2021. URL https://github.com/ PhilipMay/stsb-multi-mt. GitHub repository. Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfr- embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog, 2024. URL https://blog.salesforceairesearch.com/sfr-embedded-mistral/. Yev Meyer, Marjan Emadi, Dhruv Nathawani, Lipika Ramaswamy, Kendrick Boyd, Maarten Van Seg- broeck, Matthew Grossman, Piotr Mlocek, and Drew Newberry. Synthetic-Text-To-SQL: A syn- thetic dataset for training language models to generate sql queries from natural language prompts, April 2024. URL https://huggingface.co/datasets/gretelai/synthetic-text-to-sql. Roshanak Mirzaee, Hossein Rajaby Faghihi, Qiang Ning, and Parisa Kordjamshidi. Spartqa: A textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4582â4598, 2021. Julius Monsen and Arne J"onsson. A method for building non-english corpora for abstractive text summarization. In Proceedings of
Chunk 50 ¡ 1,993 chars
for spatial reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4582â4598, 2021. Julius Monsen and Arne J"onsson. A method for building non-english corpora for abstractive text summarization. In Proceedings of CLARIN Annual Conference, 2021. Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search, 2022. Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023a. 22 -- 22 of 57 -- Published as a conference paper at ICLR 2025 Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB: Massive text em- bedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014â2037, Dubrovnik, Croatia, May 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.148. URL https://aclanthology.org/2023.eacl-main.148. Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, and Colin Raffel. Crosslingual generalization through multitask finetuning, 2023c. Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning, 2024. Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Saâid Ahmad, Meriem Beloucif, Saif Mo- hammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Felermino Dâario Mâario Antâonio Ali, Davis Davis,
Chunk 51 ¡ 1,995 chars
esentational instruction tuning, 2024. Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Saâid Ahmad, Meriem Beloucif, Saif Mo- hammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Felermino Dâario Mâario Antâonio Ali, Davis Davis, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Baye Messelle, Hailu Beshada Balcha, Sisay Adugna Chala, Hagos Tesfahun Gebremichael, Bernard Opoku, and Steven Arthur. Afrisenti:a twitter sentiment analysis benchmark for african languages. 2023. Timo MĂśller, Julian Risch, and Malte Pietsch. Germanquad and germandpr: Improving non-english question answering and passage retrieval, 2021. Jørgen Johnsen Navjord and Jon-Mikkel Ryen Korsvik. Beyond extractive: advancing abstractive au- tomatic text summarization in norwegian with transformers. Masterâs thesis, Norwegian University of Life Sciences, Ă s, 2023. Thong Nguyen, Mariya Hendriksen, Andrew Yates, and Maarten de Rijke. Multimodal learned sparse retrieval with probabilistic expansion control. In European Conference on Information Retrieval, pp. 448â464. Springer, 2024. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. CoRR, abs/1611.09268, 2016. URL http://arxiv.org/abs/1611.09268. Dan Nielsen. ScandEval: A benchmark for Scandinavian natural language processing. In Tanel Alum"ae and Mark Fishel (eds.), Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 185â201, Tâorshavn, Faroe Islands, may 2023. University of Tartu Library. URL https://aclanthology.org/2023.nodalida-1.20. Joel Niklaus, Matthias StĂźrmer, and Ilias Chalkidis. An empirical study on cross-x transfer for legal judgment prediction, 2022. Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher
Chunk 52 ¡ 1,995 chars
slands, may 2023. University of Tartu Library. URL https://aclanthology.org/2023.nodalida-1.20. Joel Niklaus, Matthias StĂźrmer, and Ilias Chalkidis. An empirical study on cross-x transfer for legal judgment prediction, 2022. Joakim Nivre, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, and Natalia Silveira. Universal dependen- cies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LRECâ16), pp. 1659â1666, 2016. Jeppe Nørregaard and Leon Derczynski. DanFEVER: claim verification dataset for Danish. In Simon Dobnik and Lilja Ăvrelid (eds.), Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 422â428, Reykjavik, Iceland (Online), 2021. Link"oping University Electronic Press, Sweden. URL https://aclanthology.org/2021.nodalida-main.47. Maciej Ogrodniczuk and Mateusz Kopeâc. The Polish summaries corpus. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LRECâ14), pp. 3712â3715, Reykjavik, Iceland, may 2014. European Language Resources Association (ELRA). URL http://www.lrec-conf.org/ proceedings/lrec2014/pdf/1211_Paper.pdf. 23 -- 23 of 57 -- Published as a conference paper at ICLR 2025 Maciej Ogrodniczuk and Ĺukasz Kobyli´nski (eds.). Proceedings of the PolEval 2019 Workshop, Warsaw, Poland, 2019. Institute of Computer Science, Polish Academy of Sciences. ISBN 978-83- 63159-28-3. URL http://2019.poleval.pl/files/poleval2019.pdf. James OâNeill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, and Danushka Bollegala. I wish I would have loved this one, but I didnât â a multilingual dataset for counterfactual detection in product review. In Marie-Francine Moens, Xuanjing
Chunk 53 ¡ 1,991 chars
ISBN 978-83- 63159-28-3. URL http://2019.poleval.pl/files/poleval2019.pdf. James OâNeill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, and Danushka Bollegala. I wish I would have loved this one, but I didnât â a multilingual dataset for counterfactual detection in product review. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7092â7108, Online and Punta Cana, Dominican Republic, nov 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.568. URL https: //aclanthology.org/2021.emnlp-main.568. Nedjma Ousidhoum, Shamsuddeen Hassan Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Said Ahmad, Sanchit Ahuja, Alham Fikri Aji, Vladimir Araujo, Abinew Ali Ayele, Pavan Baswani, et al. Semrel2024: A collection of semantic textual relatedness datasets for 14 languages. arXiv preprint arXiv:2402.08638, 2024. Hille Pajupuu, Jaan Pajupuu, Rene Altrov, and Kairi Tamuri. Estonian Valence Corpus / Eesti valentsikorpus. 11 2023. doi: 10.6084/m9.figshare.24517054.v1. URL https://figshare.com/ articles/dataset/Estonian_Valence_Corpus_Eesti_valentsikorpus/24517054. Christos Papaloukas, Ilias Chalkidis, Konstantinos Athinaios, Despina-Athanasia Pantazi, and Mano- lis Koubarakis. Multi-granular legal topic classification on greek legislation. In Proceedings of the Natural Legal Language Processing Workshop 2021, pp. 63â75, Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.48550/arXiv.2109.15298. URL https://arxiv.org/abs/2109.15298. Shantipriya Parida, Sambit Sekhar, Soumendra Kumar Sahoo, Swateek Jena, Abhijeet Parida, Satya Ranjan Dash, and Guneet Singh Kohli. Odiagenai: Generative ai and llm initiative for the odia language. https://huggingface.co/OdiaGenAI, 2023. Michael Park, Erin Leahey, and Russell J. Funk. Papers and patents are becoming less disruptive over time. Nature,
Chunk 54 ¡ 1,996 chars
Soumendra Kumar Sahoo, Swateek Jena, Abhijeet Parida, Satya Ranjan Dash, and Guneet Singh Kohli. Odiagenai: Generative ai and llm initiative for the odia language. https://huggingface.co/OdiaGenAI, 2023. Michael Park, Erin Leahey, and Russell J. Funk. Papers and patents are becoming less disruptive over time. Nature, 613:138â144, 2023. URL https://api.semanticscholar.org/CorpusID: 255466666. Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Daron Anderson, Tung Nguyen, Mobeen Mahmood, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Jessica P. Wang, Pawan Kumar, Oleksandr Pokutnyi, Robert Gerbicz, Serguei Popov, John-Clark Levin, Mstyslav Kazakov, Johannes Schmitt, Geoff Galgon, Alvaro Sanchez, Yongki Lee, Will Yeadon, Scott Sauers, Marc Roth, Chidozie Agu, Søren Riis, Fabian Giska, Saiteja Utpala, Zachary Giboney, Gashaw M. Goshu, Joan of Arc Xavier, Sarah-Jane Crowson, Mohinder Maheshbhai Naiya, Noah Burns, Lennart Finke, Zerui Cheng, Hyunwoo Park, Francesco Fournier-Facio, John Wydallis, Mark Nandor, Ankit Singh, Tim Gehrunger, Jiaqi Cai, Ben McCarty, Darling Duclosel, Jungbae Nam, Jennifer Zampese, Ryan G. Hoerr, Aras Bacho, Gautier Abou Loume, Abdallah Galal, Hangrui Cao, Alexis C Garretson, Damien Sileo, Qiuyu Ren, Doru Cojoc, Pavel Arkhipov, Usman Qazi, Lianghui Li, Sumeet Motwani, Christian Schroeder de Witt, Edwin Taylor, Johannes Veith, Eric Singer, Taylor D. Hartman, Paolo Rissone, Jaehyeok Jin, Jack Wei Lun Shi, Chris G. Willcocks, Joshua Robinson, Aleksandar Mikov, Ameya Prabhu, Longke Tang, Xavier Alapont, Justine Leon Uro, Kevin Zhou, Emily de Oliveira Santos, Andrey Pupasov Maksimov, Edward Vendrow, Kengo Zenitani, Julien Guillod, Yuqi Li, Joshua Vendrow, Vladyslav Kuchkin, Ng Ze- An, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Andrew Gritsevskiy,
Chunk 55 ¡ 1,993 chars
eksandar Mikov, Ameya Prabhu, Longke Tang, Xavier Alapont, Justine Leon Uro, Kevin Zhou, Emily de Oliveira Santos, Andrey Pupasov Maksimov, Edward Vendrow, Kengo Zenitani, Julien Guillod, Yuqi Li, Joshua Vendrow, Vladyslav Kuchkin, Ng Ze- An, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Andrew Gritsevskiy, Dakotah Martinez, Ben Pageler, Nick Crispino, Dimitri Zvonkine, Natanael Wildner Fraga, Saeed Soori, Ori Press, Henry Tang, Julian Salazar, Sean R. Green, Lina Brßssel, Moon Twayana, Aymeric Dieuleveut, T. Ryan Rogers, Wenjin Zhang, Bikun Li, Jinzhou Yang, Arun Rao, Gabriel Loiseau, Mikhail Kalinin, Marco Lukas, Ciprian Manolescu, Subrata Mishra, Ariel Ghislain Kemogne Kamdoum, Tobias Kreiman, Tad Hogg, Alvin Jin, Carlo Bosio, Gongbo Sun, Brian P Coppola, Tim Tarver, Haline Heidinger, Rafael Sayous, Stefan Ivanov, Joseph M Cavanagh, Jiawei Shen, Joseph Marvin Imperial, Philippe Schwaller, Shaipranesh Senthilkuma, Andres M Bran, Ali Dehghan, Andres Algaba, Brecht Verbeken, David Noever, Ragavendran P V, Lisa Schut, Ilia 24 -- 24 of 57 -- Published as a conference paper at ICLR 2025 Sucholutsky, Evgenii Zheltonozhskii, Derek Lim, Richard Stanley, Shankar Sivarajan, Tong Yang, John Maar, Julian Wykowski, Martà Oller, Jennifer Sandlin, Anmol Sahu, Yuzheng Hu, Sara Fish, Nasser Heydari, Archimedes Apronti, Kaivalya Rawal, Tobias Garcia Vilchis, Yuexuan Zu, Martin Lackner, James Koppel, Jeremy Nguyen, Daniil S. Antonenko, Steffi Chern, Bingchen Zhao, Pierrot Arsene, Alan Goldfarb, Sergey Ivanov, RafaŠPo´swiata, Chenguang Wang, Daofeng Li, Donato Crisostomi, Andrea Achilleos, Benjamin Myklebust, Archan Sen, David Perrella, Nurdin Kaparov, Mark H Inlow, Allen Zang, Elliott Thornley, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Dan Bar Hava, Aleksey Kuchkin, Robert Lauff, David Holmes, Frank Sommerhage, Keith Schneider, Zakayo Kazibwe, Nate Stambaugh, Mukhwinder Singh, Ilias
Chunk 56 ¡ 1,995 chars
v, Mark H Inlow, Allen Zang, Elliott Thornley, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Dan Bar Hava, Aleksey Kuchkin, Robert Lauff, David Holmes, Frank Sommerhage, Keith Schneider, Zakayo Kazibwe, Nate Stambaugh, Mukhwinder Singh, Ilias Magoulas, Don Clarke, Dae Hyun Kim, Felipe Meneguitti Dias, Veit Elser, Kanu Priya Agarwal, Victor Efren Guadarrama Vilchis, Immo Klose, Christoph Demian, Ujjwala Anantheswaran, Adam Zweiger, Guglielmo Albani, Jeffery Li, Nicolas Daans, Maksim Radionov, VĂĄclav RozhoËn, Ziqiao Ma, Christian Stump, Mohammed Berkani, Jacob Platnick, Volodymyr Nevirkovets, Luke Basler, Marco Piccardo, Ferenc Jeanplong, Niv Cohen, Josef Tkadlec, Paul Rosu, Piotr Padlewski, Stanislaw Barzowski, Kyle Montgomery, Aline Menezes, Arkil Patel, Zixuan Wang, Jamie Tucker-Foltz, Jack Stade, Tom Goertzen, Fereshteh Kazemi, Jeremiah Milbauer, John Arnold Ambay, Abhishek Shukla, Yan Carlos Leyva Labrador, Alan GivrĂŠ, Hew Wolff, Vivien Rossbach, Muhammad Fayez Aziz, Younesse Kaddar, Yanxu Chen, Robin Zhang, Jiayi Pan, Antonio Terpin, Niklas Muennighoff, Hailey Schoelkopf, Eric Zheng, Avishy Carmi, Adam Jones, Jainam Shah, Ethan D. L. Brown, Kelin Zhu, Max Bartolo, Richard Wheeler, Andrew Ho, Shaul Barkan, Jiaqi Wang, Martin Stehberger, Egor Kretov, Kaustubh Sridhar, Zienab EL-Wasif, Anji Zhang, Daniel Pyda, Joanna Tam, David M. Cunningham, Vladimir Goryachev, Demosthenes Patramanis, Michael Krause, Andrew Redenti, Daniel Bugas, David Aldous, Jesyin Lai, Shannon Coleman, Mohsen Bahaloo, Jiangnan Xu, Sangwon Lee, Sandy Zhao, Ning Tang, Michael K. Cohen, Micah Carroll, Orr Paradise, Jan Hendrik Kirchner, Stefan Steinerberger, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Benedito Alves de Oliveira Junior, Michael Wang, Yuzhou Nie, Paolo Giordano, Philipp Petersen, Anna Sztyber-Betley, Priti Shukla, Jonathan Crozier, Antonella Pinto, Shreyas Verma, Prashant Joshi, Zheng-Xin Yong,
Chunk 57 ¡ 1,997 chars
r Paradise, Jan Hendrik Kirchner, Stefan Steinerberger, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Benedito Alves de Oliveira Junior, Michael Wang, Yuzhou Nie, Paolo Giordano, Philipp Petersen, Anna Sztyber-Betley, Priti Shukla, Jonathan Crozier, Antonella Pinto, Shreyas Verma, Prashant Joshi, Zheng-Xin Yong, Allison Tee, JĂŠrĂŠmy AndrĂŠoletti, Orion Weller, Raghav Singhal, Gang Zhang, Alexander Ivanov, Seri Khoury, Hamid Mostaghimi, Kunvar Thaman, Qijia Chen, Tran Quoc KhĂĄnh, Jacob Loader, Stefano Cavalleri, Hannah Szlyk, Zachary Brown, Jonathan Roberts, William Alley, Kunyang Sun, Ryan Stendall, Max Lamparth, Anka Reuel, Ting Wang, Hanmeng Xu, Sreenivas Goud Raparthi, Pablo HernĂĄndez-CĂĄmara, Freddie Martin, Dmitry Malishev, Thomas Preu, Tomek Korbak, Marcus Abramovitch, Dominic Williamson, Ziye Chen, BirĂł BĂĄlint, M Saiful Bari, Peyman Kassani, Zihao Wang, Behzad Ansarinejad, Laxman Prasad Goswami, Yewen Sun, Hossam Elgnainy, Daniel Tordera, George Balabanian, Earth Anderson, Lynna Kvistad, Alejandro JosĂŠ Moyano, Rajat Maheshwari, Ahmad Sakor, Murat Eron, Isaac C. McAlister, Javier Gimenez, Innocent Enyekwe, Andrew Favre D. O., Shailesh Shah, Xiaoxiang Zhou, Firuz Kamalov, Ronald Clark, Sherwin Abdoli, Tim Santens, Khalida Meer, Harrison K Wang, Kalyan Ramakrishnan, Evan Chen, Alessandro Tomasiello, G. Bruno De Luca, Shi-Zhuo Looi, Vinh-Kha Le, Noam Kolt, Niels MĂźndler, Avi Semler, Emma Rodman, Jacob Drori, Carl J Fossum, Milind Jagota, Ronak Pradeep, Honglu Fan, Tej Shah, Jonathan Eicher, Michael Chen, Kushal Thaman, William Merrill, Carter Harris, Jason Gross, Ilya Gusev, Asankhaya Sharma, Shashank Agnihotri, Pavel Zhelnov, Siranut Usawasutsakorn, Mohammadreza Mofayezi, Sergei Bogdanov, Alexander Piperski, Marc Carauleanu, David K. Zhang, Dylan Ler, Roman Leventov, Ignat Soroko, Thorben Jansen, Pascal Lauer, Joshua Duersch, Vage Taamazyan, Wiktor Morak, Wenjie Ma, William Held, Tran Ăuc Huy, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah
Chunk 58 ¡ 1,998 chars
akorn, Mohammadreza Mofayezi, Sergei Bogdanov, Alexander Piperski, Marc Carauleanu, David K. Zhang, Dylan Ler, Roman Leventov, Ignat Soroko, Thorben Jansen, Pascal Lauer, Joshua Duersch, Vage Taamazyan, Wiktor Morak, Wenjie Ma, William Held, Tran Ăuc Huy, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah Leser, Michelle X Yuan, Laila Yacar, Johannes Lengler, Hossein Shahrtash, Edson Oliveira, Joseph W. Jackson, Daniel Espinosa Gonzalez, Andy Zou, Muthu Chidambaram, Timothy Manik, Hector Haffenden, Dashiell Stander, Ali Dasouqi, Alexander Shen, Emilien Duc, Bita Golshani, David Stap, Mikalai Uzhou, Alina Borisovna Zhidkovskaya, Lukas Lewark, MĂĄtyĂĄs Vincze, Dustin Wehr, Colin Tang, Zaki Hossain, Shaun Phillips, Jiang Muzhen, Fredrik EkstrĂśm, Angela Hammon, Oam Patel, Nicolas Remy, Faraz Farhidi, George Medley, Forough Mohammadzadeh, Madellene PeĂąaflor, Haile Kassahun, Alena Friedrich, Claire Sparrow, Taom Sakal, Omkar Dhamane, Ali Khajegili Mirabadi, Eric Hallman, Mike Battaglia, Mohammad Maghsoudimehrabani, Hieu Hoang, Alon Amit, Dave Hulbert, Roberto Pereira, Simon Weber, Stephen Mensah, Nathan Andre, Anton Peristyy, Chris Harjadi, Himanshu Gupta, Stephen Malina, Samuel Albanie, Will Cai, Mustafa Mehkary, Frank Reidegeld, Anna-Katharina Dick, Cary Friday, Jasdeep Sidhu, Wanyoung Kim, Mariana Costa, Hubeyb Gurdogan, Brian Weber, 25 -- 25 of 57 -- Published as a conference paper at ICLR 2025 Harsh Kumar, Tong Jiang, Arunim Agarwal, Chiara Ceconello, Warren S. Vaz, Chao Zhuang, Haon Park, Andrew R. Tawfeek, Daattavya Aggarwal, Michael Kirchhof, Linjie Dai, Evan Kim, Johan Ferret, Yuzhou Wang, Minghao Yan, Krzysztof Burdzy, Lixin Zhang, Antonio Franca, Diana T. Pham, Kang Yong Loh, Joshua Robinson, Shreen Gul, Gunjan Chhablani, Zhehang Du, Adrian Cosma, Colin White, Robin Riblet, Prajvi Saxena, Jacob Votava, Vladimir Vinnikov, Ethan Delaney, Shiv Halasyamani, Syed M. Shahid, Jean-Christophe Mourrat, Lavr Vetoshkin, Renas Bacho, Vincent Ginis, Aleksandr
Chunk 59 ¡ 1,992 chars
tonio Franca, Diana T. Pham, Kang Yong Loh, Joshua Robinson, Shreen Gul, Gunjan Chhablani, Zhehang Du, Adrian Cosma, Colin White, Robin Riblet, Prajvi Saxena, Jacob Votava, Vladimir Vinnikov, Ethan Delaney, Shiv Halasyamani, Syed M. Shahid, Jean-Christophe Mourrat, Lavr Vetoshkin, Renas Bacho, Vincent Ginis, Aleksandr Maksapetyan, Florencia de la Rosa, Xiuyu Li, Guillaume Malod, Leon Lang, Julien Laurendeau, Fatimah Adesanya, Julien Portier, Lawrence Hollom, Victor Souza, Yuchen Anna Zhou, YiËgit YalÄąn, Gbenga Daniel Obikoya, Luca Arnaboldi, Rai, Filippo Bigi, Kaniuar Bacho, Pierre Clavier, Gabriel Recchia, Mara Popescu, Nikita Shulga, Ngefor Mildred Tanwie, Thomas C. H. Lux, Ben Rank, Colin Ni, Alesia Yakimchyk, Huanxu, Liu, Olle HäggstrĂśm, Emil Verkama, Himanshu Narayan, Hans Gundlach, Leonor Brito-Santana, Brian Amaro, Vivek Vajipey, Rynaa Grover, Yiyang Fan, Gabriel Poesia Reis e Silva, Linwei Xin, Yosi Kratish, Jakub Ĺucki, Wen-Ding Li, Justin Xu, Kevin Joseph Scaria, Freddie Vargus, Farzad Habibi, Long, Lian, Emanuele RodolĂ , Jules Robins, Vincent Cheng, Declan Grabb, Ida Bosio, Tony Fruhauff, Ido Akov, Eve J. Y. Lo, Hao Qi, Xi Jiang, Ben Segev, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Mao Mao, Yibo Jiang, Xinyu Zhang, David Avagian, Eshawn Jessica Scipio, Muhammad Rehan Siddiqi, Alon Ragoler, Justin Tan, Deepakkumar Patil, Rebeka Plecnik, Aaron Kirtland, Roselynn Grace Montecillo, Stephane Durand, Omer Faruk Bodur, Zahra Adoul, Mohamed Zekry, Guillaume Douville, Ali Karakoc, Tania C. B. Santos, Samir Shamseldeen, Loukmane Karim, Anna Liakhovitskaia, Nate Resman, Nicholas Farina, Juan Carlos Gonzalez, Gabe Maayan, Sarah Hoback, Rodrigo De Oliveira Pena, Glen Sherman, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, GĂśzdenur Demir, Sandra Mendoza, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Hsiaoyun Milliron, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Alexey Pronin, Jing Fan, Angel
Chunk 60 ¡ 1,978 chars
lez, Gabe Maayan, Sarah Hoback, Rodrigo De Oliveira Pena, Glen Sherman, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, GĂśzdenur Demir, Sandra Mendoza, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Hsiaoyun Milliron, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Alexey Pronin, Jing Fan, Angel Ramirez-Trinidad, Ashley Cartwright, Daphiny Pottmaier, Omid Taheri, David Outevsky, Stanley Stepanic, Samuel Perry, Luke Askew, RaĂşl AdriĂĄn Huerta RodrĂguez, Abdelkader Dendane, Sam Ali, Ricardo Lorena, Krishnamurthy Iyer, Sk Md Salauddin, Murat Islam, Juan Gonzalez, Josh Ducey, Russell Campbell, Maja Somrak, Vasilios Mavroudis, Eric Vergo, Juehang Qin, BenjĂĄmin BorbĂĄs, Eric Chu, Jack Lindsey, Anil Radhakrishnan, Antoine Jallon, I. M. J. McInnis, Alex Hoover, SĂśren MĂśller, Song Bian, John Lai, Tejal Patwardhan, Summer Yue, Alexandr Wang, and Dan Hendrycks. Humanityâs last exam, 2025. URL https://arxiv.org/abs/2501.14249. Lidia Pivovarova, Ekaterina Pronoza, Elena Yagunova, and Anton Pronoza. Paraphraser: Russian paraphrase corpus and shared task. In Conference on artificial intelligence and natural language, pp. 211â225. Springer, 2017. RafaĹ Po´swiata, SĹawomir Dadas, and MichaĹ PereĹkiewicz. PL-MTEB: Polish Massive Text Embed- ding Benchmark. arXiv preprint arXiv:2405.10138, 2024. Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh Shantadevi Khapra. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages. Transactions of the Association for Compu- tational Linguistics, 10:145â162, 02 2022. ISSN 2307-387X. doi: 10.1162/tacl_a_00452. URL https://doi.org/10.1162/tacl_a_00452. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese
Chunk 61 ¡ 1,996 chars
icly Available Parallel Corpora Collection for 11 Indic Languages. Transactions of the Association for Compu- tational Linguistics, 10:145â162, 02 2022. ISSN 2307-387X. doi: 10.1162/tacl_a_00452. URL https://doi.org/10.1162/tacl_a_00452. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. Nils Reimers, Philip Beyer, and Iryna Gurevych. Task-oriented intrinsic evaluation of semantic textual similarity. In Yuji Matsumoto and Rashmi Prasad (eds.), Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 87â 96, Osaka, Japan, December 2016. The COLING 2016 Organizing Committee. URL https: //aclanthology.org/C16-1009. Neil Christian R. Riego, Danny Bell Villarba, Ariel Antwaun Rolando C. Sison, Fernandez C. Pineda, and HerminiĂąo C. Lagunzad. Enhancement to low-resource text classification via sequential transfer learning. United International Journal for Research and Technology, 04:72â82, 2023. 26 -- 26 of 57 -- Published as a conference paper at ICLR 2025 Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen Voorhees, Lucy Lu Wang, and William R Hersh. Searching for scientific evidence in a pandemic: An overview of trec-covid, 2021. Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. Getting closer to ai complete question answering: A set of prerequisite real tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 8722â8731, 2020. Andrew Rosenberg and Julia Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In Jason Eisner (ed.), Proceedings of the 2007 Joint Conference on Empir- ical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410â420, Prague, Czech Republic, June 2007. Association for Computa- tional Linguistics. URL https://aclanthology.org/D07-1043. Paul R"ottger, Bertie Vidgen,
Chunk 62 ¡ 1,998 chars
.), Proceedings of the 2007 Joint Conference on Empir- ical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410â420, Prague, Czech Republic, June 2007. Association for Computa- tional Linguistics. URL https://aclanthology.org/D07-1043. Paul R"ottger, Bertie Vidgen, Dong Nguyen, Zeerak Waseem, Helen Margetts, and Janet Pier- rehumbert. HateCheck: Functional tests for hate speech detection models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Con- ference on Natural Language Processing (Volume 1: Long Papers), pp. 41â58, Online, aug 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.4. URL https://aclanthology.org/2021.acl-long.4. Ivan Rybin, Vladislav Korablinov, Pavel Efimov, and Pavel Braslavski. Rubq 2.0: An innovated russian question answering dataset. In ESWC, pp. 532â547, 2021. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99â106, 2021. Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Common- sense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4463â4473, 2019. Salim Sazzed. Cross-lingual sentiment classification in low-resource bengali language. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pp. 50â60, 2020. Alexander Sboev, Aleksandr Naumov, and Roman Rybka. Data-driven model for emotion detection in russian texts. Procedia Computer Science, 190:637â642, 2021. Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. On the stratification of multi- label data. In Machine
Chunk 63 ¡ 1,997 chars
erated Text (W-NUT 2020), pp. 50â60, 2020. Alexander Sboev, Aleksandr Naumov, and Roman Rybka. Data-driven model for emotion detection in russian texts. Procedia Computer Science, 190:637â642, 2021. Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas. On the stratification of multi- label data. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III 22, pp. 145â158. Springer, 2011. Darsh Shah, Tao Lei, Alessandro Moschitti, Salvatore Romeo, and Preslav Nakov. Adversarial domain adaptation for duplicate question detection. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Junâichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1056â1063, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1131. URL https://aclanthology.org/D18-1131. Zareen Sharf. Roman Urdu Data Set. UCI Machine Learning Repository, 2018. DOI: https://doi.org/10.24432/C58325. Eva Sharma, Chen Li, and Lu Wang. BIGPATENT: A large-scale dataset for abstractive and coherent summarization. CoRR, abs/1906.03741, 2019. URL http://arxiv.org/abs/1906.03741. Tatiana Shavrina, Alena Fenogenova, Anton Emelyanov, Denis Shevelev, Ekaterina Artemova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, Andrey Chertok, and Andrey Evlampiev. Russiansuperglue: A russian language understanding evaluation benchmark. arXiv preprint arXiv:2010.15925, 2020. Emily Sheng and David Uthus. Investigating societal biases in a poetry composition system, 2020. 27 -- 27 of 57 -- Published as a conference paper at ICLR 2025 Iyanuoluwa Shode, David Ifeoluwa Adelani, Jing Peng, and Anna Feldman. Nollysenti: Leveraging transfer learning and machine translation for nigerian movie sentiment classification. In Proceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers),
Chunk 64 ¡ 1,997 chars
paper at ICLR 2025 Iyanuoluwa Shode, David Ifeoluwa Adelani, Jing Peng, and Anna Feldman. Nollysenti: Leveraging transfer learning and machine translation for nigerian movie sentiment classification. In Proceed- ings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 986â998, 2023. Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, and Partha Talukdar. Indicgenbench: A multilingual benchmark to evaluate generation capabilities of llms on indic languages, 2024a. Shivalika Singh, Freddie Vargus, Daniel Dsouza, BĂśrje F. Karlsson, Abinaya Mahendiran, Wei- Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony, Mike Zhang, Ramith Hettiarachchi, Joseph Wilson, Marina Machado, Luisa Souza Moura, Dominik Krzemi´nski, Hakimeh Fadaei, Irem ErgĂźn, Ifeoma Okoh, Aisha Alaagib, Oshan Mudannayake, Zaid Alyafeai, Vu Minh Chien, Sebastian Ruder, Surya Guthikonda, Emad A. Alghamdi, Sebastian Gehrmann, Niklas Muennighoff, Max Bartolo, Julia Kreutzer, Ahmet ĂstĂźn, Marzieh Fadaee, and Sara Hooker. Aya dataset: An open-access collection for multilingual instruction tuning, 2024b. Artem Snegirev, Maria Tikhonova, Anna Maksimova, Alena Fenogenova, and Alexander Abramov. The russian-focused embeddersâ exploration: rumteb benchmark and russian embedding model design, 2024. URL https://arxiv.org/abs/2408.12503. VĂŠsteinn SnĂŚbjarnarson, Annika Simonsen, Goran GlavaĹĄ, and Ivan Vuli´c. Transfer to a low-resource language via close relatives: The case study on faroese. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), TĂłrshavn, Faroe Islands, may 22â24 2023. Link"oping University Electronic Press, Sweden. Ian Soboroff and Stephen Robertson. Building a filtering test collection for trec 2002. In Proceed- ings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 243â250, 2003. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and
Chunk 65 ¡ 1,993 chars
ping University Electronic Press, Sweden. Ian Soboroff and Stephen Robertson. Building a filtering test collection for trec 2002. In Proceed- ings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 243â250, 2003. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. Advances in neural information processing systems, 33: 16857â16867, 2020. Gizem SoËgancÄąoËglu, Hakime ĂztĂźrk, and Arzucan ĂzgĂźr. BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics, 33(14):i49âi58, 07 2017. ISSN 1367- 4803. doi: 10.1093/bioinformatics/btx238. URL https://doi.org/10.1093/bioinformatics/ btx238. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, AdriĂ Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023. Michal Stefâanik, Marek KadlcâÄąk, Piotr Gramacki, and Petr Sojka. Resources and few-shot learners for in-context learning in slavic languages. arXiv preprint arXiv:2304.01922, 2023. Hongjin Su, Howard Yen, Mengzhou Xia, Weijia Shi, Niklas Muennighoff, Han yu Wang, Haisu Liu, Quan Shi, Zachary S. Siegel, Michael Tang, Ruoxi Sun, Jinsung Yoon, Sercan O. Arik, Danqi Chen, and Tao Yu. Bright: A realistic and challenging benchmark for reasoning-intensive retrieval, 2024. URL https://arxiv.org/abs/2407.12883. Piotr Szyma´nski and Tomasz Kajdanowicz. A network perspective on stratification of multi- label data. In Paula Branco LuĂs Torgo and Nuno Moniz (eds.), Proceedings of the First In- ternational Workshop on Learning with Imbalanced Domains: Theory and Applications, vol- ume 74 of Proceedings of Machine Learning Research, pp. 22â35. PMLR, 22 Sep 2017. URL https://proceedings.mlr.press/v74/szyma%C5%84ski17a.html. Qingyu Tan, Hwee Tou Ng, and Lidong Bing.
Chunk 66 ¡ 1,999 chars
no Moniz (eds.), Proceedings of the First In- ternational Workshop on Learning with Imbalanced Domains: Theory and Applications, vol- ume 74 of Proceedings of Machine Learning Research, pp. 22â35. PMLR, 22 Sep 2017. URL https://proceedings.mlr.press/v74/szyma%C5%84ski17a.html. Qingyu Tan, Hwee Tou Ng, and Lidong Bing. Towards benchmarking and improving the temporal reasoning capability of large language models. arXiv preprint arXiv:2306.08952, 2023. Nandan Thakur, Nils Reimers, Andreas R"ucklâe, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=wCu6T5xFjeJ. 28 -- 28 of 57 -- Published as a conference paper at ICLR 2025 "Nandan Thakur, Luiz Bonifacio, Maik Fr"obe, Alexander Bondarenko, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, and Jimmy Lin". "systematic evaluation of neural retrieval models on the Touchâe 2020 argument retrieval subset of BEIR". In "Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval", 2024. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large- scale dataset for fact extraction and VERification. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 809â819, New Orleans, Louisiana, jun 2018a. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification, 2018b. J"org Tiedemann and Santhosh Thottingal. Opus-mt â building open translation services for the world.
Chunk 67 ¡ 1,994 chars
doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074. James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. Fever: a large-scale dataset for fact extraction and verification, 2018b. J"org Tiedemann and Santhosh Thottingal. Opus-mt â building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT), 2020. Herbert Ullrich, Jan Drchal, Martin R `ypar, Hana Vincourovâa, and Vâaclav Moravec. Csfever and ctkfacts: acquiring czech data for fact verification. Language Resources and Evaluation, 57(4): 1571â1605, 2023. Elena Volodina, Yousuf Ali Mohammed, and Julia Klezl. Dalaj - a dataset for linguistic acceptability judgments for swedish: Format, baseline, sharing, 2021. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz ChrupaĹa, and Afra Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop Black- boxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 353â355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. CoRR, abs/1905.00537, 2019. URL http://arxiv.org/abs/ 1905.00537. Kexin Wang, Nils Reimers, and Iryna Gurevych. Tsdae: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. arXiv preprint arXiv:2104.06979, 4 2021a. URL https://arxiv.org/abs/2104.06979. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training, 2022. Liang Wang, Nan
Chunk 68 ¡ 1,983 chars
for unsupervised sentence embedding learning. arXiv preprint arXiv:2104.06979, 4 2021a. URL https://arxiv.org/abs/2104.06979. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training, 2022. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilin- gual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672, 2024. Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. MiniLMv2: Multi- head self-attention relation distillation for compressing pretrained transformers. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Findings of the Association for Com- putational Linguistics: ACL-IJCNLP 2021, pp. 2140â2151, Online, August 2021b. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.188. URL https: //aclanthology.org/2021.findings-acl.188. Silvan Wehrli, Bert Arnrich, and Christopher Irrgang. German text embedding clustering benchmark, 2024. URL https://arxiv.org/abs/2401.02709. Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, and Luca Soldaini. Followir: Evaluating and teaching information retrieval models to follow instructions, 2024. 29 -- 29 of 57 -- Published as a conference paper at ICLR 2025 Orion Weller, Benjamin Chang, Eugene Yang, Mahsa Yarmohammadi, Sam Barham, Sean MacA- vaney, Arman Cohan, Luca Soldaini, Benjamin Van Durme, and Dawn Lawrie. mfollowir: a multilingual benchmark for instruction following in retrieval, 2025. URL https://arxiv.org/ abs/2501.19264. Andika William and Yunita Sari. Click-id: A novel dataset for indonesian clickbait headlines. Data in Brief, 32:106231, 2020. ISSN 2352-3409. doi:
Chunk 69 ¡ 1,991 chars
Luca Soldaini, Benjamin Van Durme, and Dawn Lawrie. mfollowir: a multilingual benchmark for instruction following in retrieval, 2025. URL https://arxiv.org/ abs/2501.19264. Andika William and Yunita Sari. Click-id: A novel dataset for indonesian clickbait headlines. Data in Brief, 32:106231, 2020. ISSN 2352-3409. doi: https://doi.org/10.1016/j.dib.2020.106231. URL http://www.sciencedirect.com/science/article/pii/S2352340920311252. Genta Winata, Lingjue Xie, Karthik Radhakrishnan, Yifan Gao, and Daniel Preo¸tiuc-Pietro. Efficient zero-shot cross-lingual inference via retrieval. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 93â104, 2023a. Genta Indra Winata, Andrea Madotto, Zhaojiang Lin, Rosanne Liu, Jason Yosinski, and Pascale Fung. Language models are few-shot multilingual learners. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 1â15, 2021. Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, and Sebastian Ruder. Nusax: Multilingual parallel sentiment dataset for 10 indonesian local languages, 2022. Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, et al. Nusax: Multilingual parallel sentiment dataset for 10 indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 815â834, 2023b. Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, et al. Worldcuisines: A massive-scale
Chunk 70 ¡ 1,999 chars
erence of the European Chapter of the Association for Computational Linguistics, pp. 815â834, 2023b. Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, et al. Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines. arXiv preprint arXiv:2410.12705, 2024a. Genta Indra Winata, Ruochen Zhang, and David Ifeoluwa Adelani. Miners: Multilingual language models as semantic retrievers. arXiv preprint arXiv:2406.07424, 2024b. Marco Wrzalik and Dirk Krechel. GerDaLIR: A German dataset for legal information retrieval. In Proceedings of the Natural Legal Language Processing Workshop 2021, pp. 123â128, Punta Cana, Dominican Republic, nov 2021. Association for Computational Linguistics. URL https: //aclanthology.org/2021.nllp-1.13. Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, and Ming Zhou. MIND: A large-scale dataset for news recommen- dation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3597â3606, Online, jul 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.331. URL https://aclanthology.org/2020.acl-main.331. Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu, et al. Mind: A large-scale dataset for news recommendation. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 3597â3606, 2020b. Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, and Graham Neubig. Pre- dicting performance for natural language processing tasks. CoRR, abs/2005.00870, 2020. URL https://arxiv.org/abs/2005.00870. Chenghao Xiao, G Thomas Hudson, and Noura Al Moubayed.
Chunk 71 ¡ 1,992 chars
or computational linguistics, pp. 3597â3606, 2020b. Mengzhou Xia, Antonios Anastasopoulos, Ruochen Xu, Yiming Yang, and Graham Neubig. Pre- dicting performance for natural language processing tasks. CoRR, abs/2005.00870, 2020. URL https://arxiv.org/abs/2005.00870. Chenghao Xiao, G Thomas Hudson, and Noura Al Moubayed. Rar-b: Reasoning as retrieval benchmark. arXiv preprint arXiv:2404.06347, 2024a. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023. 30 -- 30 of 57 -- Published as a conference paper at ICLR 2025 Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packed resources for general chinese embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 641â649, 2024b. Xiaohui Xie, Qian Dong, Bingning Wang, Feiyang Lv, Ting Yao, Weinan Gan, Zhijing Wu, Xiang- sheng Li, Haitao Li, Yiqun Liu, and Jin Ma. T2ranking: A large-scale chinese benchmark for passage ranking, 2023. Wei Xu, Chris Callison-Burch, and Bill Dolan. SemEval-2015 task 1: Paraphrase and semantic similarity in Twitter (PIT). In Preslav Nakov, Torsten Zesch, Daniel Cer, and David Jurgens (eds.), Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 1â11, Denver, Colorado, jun 2015. Association for Computational Linguistics. doi: 10.18653/v1/ S15-2001. URL https://aclanthology.org/S15-2001. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020. Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. Codetransocean: A comprehen- sive multilingual benchmark for code translation, 2023. URL https://arxiv.org/abs/2310. 04951. Hitomi Yanaka and Koji Mineshima. Compositional evaluation on japanese
Chunk 72 ¡ 1,991 chars
t-to-text transformer. arXiv preprint arXiv:2010.11934, 2020. Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. Codetransocean: A comprehen- sive multilingual benchmark for code translation, 2023. URL https://arxiv.org/abs/2310. 04951. Hitomi Yanaka and Koji Mineshima. Compositional evaluation on japanese textual entailment and similarity. Transactions of the Association for Computational Linguistics, 10:1266â1284, 2022. Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. Paws-x: A cross-lingual adversarial dataset for paraphrase identification, 2019. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answer- ing. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Junâichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369â2380, Brussels, Belgium, 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1259. URL https://aclanthology.org/D18-1259. Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Gar- nett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Asso- ciates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ 250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf. Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan
Chunk 73 ¡ 1,996 chars
rtes, N. Lawrence, D. Lee, M. Sugiyama, and R. Gar- nett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Asso- ciates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ 250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf. Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. Making a miracl: Multilingual information retrieval across a continuum of languages, 2022. Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages. Transactions of the Association for Computa- tional Linguistics, 11:1114â1131, 09 2023. ISSN 2307-387X. doi: 10.1162/tacl_a_00595. URL https://doi.org/10.1162/tacl_a_00595. Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. Opencodeinterpreter: Integrating code generation with execution and refinement. arXiv preprint arXiv:2402.14658, 2024. 31 -- 31 of 57 -- Published as a conference paper at ICLR 2025 Dawei Zhu, Liang Wang, Nan Yang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Longembed: Extending embedding models for long context retrieval. arXiv preprint arXiv:2404.12096, 2024. Elena Zotova, Rodrigo Agerri, Manuel NuĂąez, and German Rigau. Multilingual stance detection in tweets: The Catalonia independence corpus. In Nicoletta Calzolari, Frâedâeric Bâechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hâelène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 1368â1375. European Language Resources Association, may 2020. ISBN 979-10-95546-34-4. Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. Overview of the second BUCC
Chunk 74 ¡ 1,999 chars
i, Hâelène Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 1368â1375. European Language Resources Association, may 2020. ISBN 979-10-95546-34-4. Pierre Zweigenbaum, Serge Sharoff, and Reinhard Rapp. Overview of the second BUCC shared task: Spotting parallel sentences in comparable corpora. In Serge Sharoff, Pierre Zweigenbaum, and Reinhard Rapp (eds.), Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pp. 60â67, Vancouver, Canada, aug 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-2512. URL https://aclanthology.org/W17-2512. Ahmet ĂstĂźn, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel Dâsouza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker. Aya model: An instruction finetuned open-access multilingual language model, 2024. Ĺukasz Augustyniak, Kamil Tagowski, Albert Sawczyn, Denis Janiak, Roman Bartusiak, Adrian Szymczak, Marcin W Ëatroba, Arkadiusz Janz, Piotr Szyma´nski, MikoĹaj Morzy, Tomasz Kajdanow- icz, and Maciej Piasecki. This is the way: designing and compiling lepiszcze, a comprehensive nlp benchmark for polish, 2022. URL https://arxiv.org/abs/2211.13112. Michal Ĺ tefĂĄnik, Marek KadlËcĂk, Piotr Gramacki, and Petr Sojka. Resources and few-shot learners for in-context learning in slavic languages, 2023. 32 -- 32 of 57 -- Published as a conference paper at ICLR 2025 APPENDIX TABLE OF CONTENTS A Contributions 33 B Overview and Construction of Tasks 35 B.1 Introduction to benchmark tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 B.2 Task construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 B.3 Novel datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 B.4 Task Metadata . . . . . . . . . . . . . . . .
Chunk 75 ¡ 1,996 chars
on to benchmark tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 B.2 Task construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 B.3 Novel datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 B.4 Task Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 B.4.1 Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 C Benchmark Optimizations 42 C.1 Speeding Up Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 C.1.1 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 C.1.2 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 C.2 Code Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 D Task Overview 46 D.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 D.2 Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 D.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 E Full results 48 E.1 Performance per Number of Speakers . . . . . . . . . . . . . . . . . . . . . . . . 50 F New Metrics 50 F.1 Abstention for retrieval and reranking tasks . . . . . . . . . . . . . . . . . . . . . 50 G Models 51 H Benchmark Construction and Overview 51 H.1 Benchmark creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 H.2 Benchmark task overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 H.3 Performance on MTEB(eng, v2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 H.4 Performance on MTEB(Code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 A CONTRIBUTIONS We list the contributions of every author in Table 3. The possible types of contributions and their associated points are: ⢠New dataset: A new dataset includes creating a new implementation
Chunk 76 ¡ 1,998 chars
. . . . . . . . . 53 H.4 Performance on MTEB(Code) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 A CONTRIBUTIONS We list the contributions of every author in Table 3. The possible types of contributions and their associated points are: ⢠New dataset: A new dataset includes creating a new implementation (subclass) of a task using a new dataset. 2 points were awarded for implementing the task and 4 points for each new language introduced by the task. 33 -- 33 of 57 -- Published as a conference paper at ICLR 2025 ⢠New task: An implementation of a new task category such as multi-label classification or instruction retrieval. 2 points were given for a new task, as well as points following adding a new dataset. ⢠Annotations: Many existing datasets were not yet annotated with proper metadata. To encourage high-quality annotations we awarded 1 point for each full dataset annotation. ⢠Fixes: These included bug fixes, usability fixes, speed improvements and more. For these, we typically awarded 2-10 points depending on the size of the contribution. ⢠Running Models: This includes both running and implementing models for MMTEB. We typically awarded 1 point per model run on a full set of relevant tasks. Relevant tasks for a specific model are limited to those pertinent to its language. For instance, a Russian model does not need to be run on French tasks. ⢠Review PR: A large part of ensuring good dataset quality comes from the dataset review. We award 2 points for a review. If a PR had multiple reviewers, 2 points were awarded to each. Often reviewers finalized dataset additions, helped with data formatting, and resolving bugs. In many cases, adding 2 points for review was considered either too low (a perfect PR with little to no corrections) or too high (lengthy discussion examining dataset quality, debugging implementations and more), however on average we believe it was appropriate. ⢠Writing: At this point many of the authors writing the paper already
Chunk 77 ¡ 1,997 chars
adding 2 points for review was considered either too low (a perfect PR with little to no corrections) or too high (lengthy discussion examining dataset quality, debugging implementations and more), however on average we believe it was appropriate. ⢠Writing: At this point many of the authors writing the paper already qualified for co- authorship and thus had reasonable experience with the MMTEB point system. Thus, it was generally possible to discuss a reasonable amount of points based on the efforts made in earlier stages. ⢠Coordination: Included Coordination of contributors and initial ideation were given points at the end of the project based on relative effort. These points were given, similar to paper writing, based on relative effort. A total of 10 points had to be obtained to be invited as a co-author. To see each contribution mapped to specific PRs, see https://github.com/embeddings-benchmark/mteb/tree/main/docs/mmteb/ points, where the name of JSON files corresponds to the PR id. Table 3: Contributions by GitHub users. See Table 4 for the mapping between authors and GitHub handles Total Bug fixes Review PR New dataset Dataset annotations Paper writing Coordination New task Running Models GitHub KennethEnevoldsen 597 87 326 68 35 0 81 0 0 isaac-chung 433 50 194 120 1 12 54 2 0 imenelydiaker 358 24 144 120 0 0 70 0 0 awinml 302 0 2 300 0 0 0 0 0 x-tabdeveloping 239 10 32 144 0 0 41 12 0 davidstap 176 0 0 176 0 0 0 0 0 jaygala24 149 0 0 149 0 0 0 0 0 wissam-sib 144 4 6 134 0 0 0 0 0 Muennighoff 142 0 48 0 0 0 70 0 24 orionw 125 20 20 0 0 0 75 10 0 dokato 112 12 6 94 0 0 0 0 0 gentaiscool 110 0 0 110 0 0 0 0 0 jupyterjazz 108 0 0 108 0 0 0 0 0 SaitejaUtpala 102 0 0 102 0 0 0 0 0 vaibhavad 93 8 4 6 0 0 75 0 0 MathieuCiancone 88 0 0 88 0 0 0 0 0 schmarion 88 0 0 88 0 0 0 0 0 GabrielSequeira 88 0 0 88 0 0 0 0 0 digantamisra98 71 0 0 71 0 0 0 0 0 shreeya-dhakal 62 0 8 54 0 0 0 0 0 Rysias 58 0 0 58 0 0 0 0 0 Samoed 51 22 2 18 0 0 0 0 9 gowitheflow-1998 50 0 0 50
Chunk 78 ¡ 1,999 chars
SaitejaUtpala 102 0 0 102 0 0 0 0 0 vaibhavad 93 8 4 6 0 0 75 0 0 MathieuCiancone 88 0 0 88 0 0 0 0 0 schmarion 88 0 0 88 0 0 0 0 0 GabrielSequeira 88 0 0 88 0 0 0 0 0 digantamisra98 71 0 0 71 0 0 0 0 0 shreeya-dhakal 62 0 8 54 0 0 0 0 0 Rysias 58 0 0 58 0 0 0 0 0 Samoed 51 22 2 18 0 0 0 0 9 gowitheflow-1998 50 0 0 50 0 0 0 0 0 sivareddyg 50 0 0 0 0 0 50 0 0 asparius 48 0 14 34 0 0 0 0 0 Akash190104 46 0 0 46 0 0 0 0 0 MartinBernstorff 43 13 8 2 0 0 20 0 0 staoxiao 40 0 0 40 0 0 0 0 0 akshita-sukhlecha 40 4 0 36 0 0 0 0 0 rafalposwiata 36 0 0 36 0 0 0 0 0 bp-high 36 0 0 36 0 0 0 0 0 KranthiGV 34 0 14 20 0 0 0 0 0 bjoernpl 28 0 0 28 0 0 0 0 0 Continued on next page 34 -- 34 of 57 -- Published as a conference paper at ICLR 2025 Table 3: (Continued) Contributions by GitHub users. See Table 4 for the mapping between authors and GitHub handles Github Total Bug Review New Dataset Paper Coordination New Running Handle fixes PR dataset annotations writing task Models rasdani 28 0 0 28 0 0 0 0 0 loicmagne 28 28 0 0 0 0 0 0 0 jphme 28 0 0 28 0 0 0 0 0 ShawonAshraf 28 0 0 28 0 0 0 0 0 violenil 26 0 0 26 0 0 0 0 0 mariyahendriksen 24 0 0 0 0 24 0 0 0 dwzhu-pku 24 0 0 24 0 0 0 0 0 hgissbkh 23 13 2 0 0 3 0 5 0 jankounchained 22 8 0 14 0 0 0 0 0 taeminlee 22 0 0 22 0 0 0 0 0 tomaarsen 22 0 2 0 0 0 20 0 0 kwojtasi 22 0 0 22 0 0 0 0 0 mrshu 21 0 4 16 1 0 0 0 0 crystina-z 21 0 0 21 0 0 0 0 0 ManuelFay 20 13 0 2 0 0 0 5 0 AlexeyVatolin 20 20 0 0 0 0 0 0 0 Andrian0s 20 2 4 14 0 0 0 0 0 rbroc 20 0 0 20 0 0 0 0 0 john-b-yang 20 0 0 0 0 20 0 0 0 mmhamdy 20 0 0 20 0 0 0 0 0 manandey 18 0 0 18 0 0 0 0 0 thakur-nandan 18 0 0 18 0 0 0 0 0 PranjalChitale 16 0 0 16 0 0 0 0 0 Sakshamrzt 16 0 4 12 0 0 0 0 0 sted97 16 0 0 16 0 0 0 0 0 dipam7 16 0 2 14 0 0 0 0 0 artemsnegirev 14 0 0 12 2 0 0 0 0 taidnguyen 14 0 0 14 0 0 0 0 0 jordiclive 12 10 0 2 0 0 0 0 0 guenthermi 12 0 0 12 0 0 0 0 0 slvnwhrl 12 0 0 12 0 0 0 0 0 Art3mis07 12 0 0 12 0 0 0 0 0 xhluca 12 4 2 6 0 0 0 0 0 anpalmak2003 12 0 0 9 3 0
Chunk 79 ¡ 1,996 chars
shamrzt 16 0 4 12 0 0 0 0 0 sted97 16 0 0 16 0 0 0 0 0 dipam7 16 0 2 14 0 0 0 0 0 artemsnegirev 14 0 0 12 2 0 0 0 0 taidnguyen 14 0 0 14 0 0 0 0 0 jordiclive 12 10 0 2 0 0 0 0 0 guenthermi 12 0 0 12 0 0 0 0 0 slvnwhrl 12 0 0 12 0 0 0 0 0 Art3mis07 12 0 0 12 0 0 0 0 0 xhluca 12 4 2 6 0 0 0 0 0 anpalmak2003 12 0 0 9 3 0 0 0 0 ab1992ao 11 0 0 8 3 0 0 0 0 MariyaTikhonova 11 0 0 7 4 0 0 0 0 henilp105 11 2 0 0 9 0 0 0 0 simon-clematide 10 0 0 10 0 0 0 0 0 jimmy-lin 10 0 0 0 0 0 10 0 0 sarahooker 10 0 0 0 0 10 0 0 0 swj0419 10 0 0 10 0 0 0 0 0 xiamengzhou 10 0 0 10 0 0 0 0 0 ABorghini 10 0 0 10 0 0 0 0 0 xu3kev 10 0 0 10 0 0 0 0 0 malteos 10 0 0 10 0 0 0 0 0 ljvmiranda921 10 0 0 10 0 0 0 0 0 howard-yen 10 0 0 10 0 0 0 0 0 hongjin-su 10 0 0 10 0 0 0 0 0 guangyusong 10 0 0 10 0 0 0 0 0 Alenush 10 0 0 6 4 0 0 0 0 cassanof 10 1 0 8 0 0 0 0 1 HLasse 10 5 0 0 5 0 0 0 0 ZhengLiu101 10 0 0 10 0 0 0 0 0 Ruqyai 10 0 8 2 0 0 0 0 0 izhx 6 0 0 6 0 0 0 0 0 marcobellagente93 6 0 0 6 0 0 0 0 0 monikernemo 2 0 0 2 0 0 0 0 0 NouamaneTazi 2 0 2 0 0 0 0 0 0 MexicanLemonade 2 0 0 2 0 0 0 0 0 bakrianoo 2 0 0 2 0 0 0 0 0 PhilipMay 2 0 2 0 0 0 0 0 0 achibb 2 0 0 2 0 0 0 0 0 antoniolanza1996 2 2 0 0 0 0 0 0 0 cslizc 2 0 0 2 0 0 0 0 0 hanhainebula 2 0 0 2 0 0 0 0 0 B OVERVIEW AND CONSTRUCTION OF TASKS In this appendix, we first provide an overview of existing tasks in MTEB benchmark and newly introduced tasks in our benchmark (Section B.1). We proceed by explaining how the tasks were constructed (Section B.2) from existing datasets. Lastly, we introduce newly constructed datasets specifically designed for MMTEB (Section B.3). 35 -- 35 of 57 -- Published as a conference paper at ICLR 2025 GitHub First name Last name Affiliations KennethEnevoldsen Kenneth Enevoldsen Aarhus University x-tabdeveloping MĂĄrton Kardos Aarhus University imenelydiaker Imene Kerboua INSA Lyon, LIRIS wissam-sib Wissam Siblini Individual Contributor GabrielSequeira Gabriel Sequeira Individual Contributor schmarion Marion
Chunk 80 ¡ 1,999 chars
CLR 2025 GitHub First name Last name Affiliations KennethEnevoldsen Kenneth Enevoldsen Aarhus University x-tabdeveloping MĂĄrton Kardos Aarhus University imenelydiaker Imene Kerboua INSA Lyon, LIRIS wissam-sib Wissam Siblini Individual Contributor GabrielSequeira Gabriel Sequeira Individual Contributor schmarion Marion Schaeffer Wikit MathieuCiancone Mathieu Ciancone Wikit MartinBernstorff Martin Bernstorff Aarhus University staoxiao Shitao Xiao Beijing Academy of Artificial Intelligence ZhengLiu101 Zheng Liu Beijing Academy of Artificial Intelligence achibb Aaron Chibb Individual Contributor cassanof Federico Cassano Northeastern University and Cursor AI taidnguyen Nguyen Tai University of Pennsylvania xu3kev Wen-Ding Li Cornell University Rysias Jonathan Rystrøm University of Oxford taeminlee Taemin Lee Korea University Human-Inspired AI Research izhx Xin Zhang Harbin Institute of Technology orionw Orion Weller Johns Hopkins University slvnwhrl Silvan Wehrli Robert Koch Institute manandey Manan Dey Salesforce isaac-chung Isaac Chung Individual Contributor asparius Ămer ĂaËgatan Koç University,Turkey rafalposwiata RafaĹ Po´swiata National Information Processing Institute rbroc Roberta Rocca Aarhus University awinml Ashwin Mathur Individual Contributor guangyusong Guangyu Song Tano Labs davidstap David Stap University of Amsterdam HLasse Lasse Hansen Aarhus University jaygala24 Jay Gala MBZUAI digantamisra98 Diganta Misra Max Planck Institute for Intelligent Systems and ELLIS Institute TĂźbingen PranjalChitale Pranjal Chitale Indian Institute of Technology Akash190104 Akash Kundu Heritage Institute of Technology and Apart Research dwzhu-pku Dawei Zhu Peking University ljvmiranda921 Lester James Miranda Allen Institute for AI Andrian0s Andrianos Michail University of Zurich simon-clematide Simon Clematide University of Zurich SaitejaUtpala Saiteja Utpala Microsoft Research mmhamdy Mohammed Hamdy Cohere For AI Community jupyterjazz Saba Sturua Jina AI Ruqyai Ruqiya Bin
Chunk 81 ¡ 1,983 chars
ing University ljvmiranda921 Lester James Miranda Allen Institute for AI Andrian0s Andrianos Michail University of Zurich simon-clematide Simon Clematide University of Zurich SaitejaUtpala Saiteja Utpala Microsoft Research mmhamdy Mohammed Hamdy Cohere For AI Community jupyterjazz Saba Sturua Jina AI Ruqyai Ruqiya Bin Safi NaN KranthiGV Kranthi Kiran GV New York University shreeya-dhakal Shreeya Dhakal Individual Contributor dipam7 Dipam Vasani Individual Contributor Art3mis07 Gayatri K R. V. College of Engineering jankounchained Jan Kostkan Aarhus University bp-high Bhavish Pahwa Microsoft Research rasdani Daniel Auras ellamind, Germany ShawonAshraf Shawon Ashraf ellamind, Germany bjoernpl BjÜrn Plßster ellamind, Germany jphme Jan Philipp Harries ellamind, Germany malteos Malte Ostendorff Occiglot ManuelFay Manuel Faysse CentraleSupÊlec and Illuin Technology hgissbkh Hippolyte Gisserot-Boukhlef CentraleSupÊlec and Artefact Research Center sted97 Simone Tedeschi Sapienza University of Rome gentaiscool Genta Indra Winata Individual Contributor henilp105 Henil Panchal Nirma University ABorghini Alessia Borghini Sapienza University of Rome jordiclive Jordan Clive Imperial College London gowitheflow-1998 Chenghao Xiao Durham University mariyahendriksen Mariya Hendriksen University of Amsterdam dokato Dominik Krzemi´nski Cohere For AI Community Samoed Roman Solomatin AI Talent Hub and ITMO University Alenush Alena Fenogenova SaluteDevices ab1992ao Aleksandr Abramov SaluteDevices artemsnegirev Artem Snegirev SaluteDevices anpalmak2003 Anna Maksimova SaluteDevices MariyaTikhonova Maria Tikhonova SaluteDevices and HSE University vaibhavad Vaibhav Adlakha Mila, McGill University and ServiceNow Research sivareddyg Siva Reddy Mila, McGill University and ServiceNow Research guenthermi Michael Gßnther Jina AI violenil Isabelle Mohr Jina AI akshita-sukhlecha Akshita Sukhlecha Individual Contributor Muennighoff Niklas Muennighoff Stanford University and Contextual
Chunk 82 ¡ 1,989 chars
av Adlakha Mila, McGill University and ServiceNow Research sivareddyg Siva Reddy Mila, McGill University and ServiceNow Research guenthermi Michael GĂźnther Jina AI violenil Isabelle Mohr Jina AI akshita-sukhlecha Akshita Sukhlecha Individual Contributor Muennighoff Niklas Muennighoff Stanford University and Contextual AI AlexeyVatolin Aleksei Vatolin FRC CSC RAS xhluca Xing Han LĂš Mila, McGill University crystina-z Xinyu Zhang University of Waterloo tomaarsen Tom Aarsen Hugging Face mrshu Marek Suppa Comenius University Bratislava and Cisco Systems swj0419 Weijia Shi University of Washington xiamengzhou Mengzhou Xia Princeton University john-b-yang John Yang Stanford University thakur-nandan Nandan Thakur University of Waterloo loicmagne Loic Magne Individual Contributor sarahooker Sara Hooker Cohere For AI kwojtasi Konrad Wojtasik WrocĹaw University of Science and Technology jimmy-lin Jimmy Lin University of Waterloo hongjin-su Hongjin Su University of Hong Kong howard-yen Howard Yen Princeton University Sakshamrzt Saksham Thakur Individual Contributor Table 4: Author overview, along with their affiliations and GitHub handles. 36 -- 36 of 57 -- Published as a conference paper at ICLR 2025 B.1 INTRODUCTION TO BENCHMARK TASKS Classification First, a train set is constructed by sampling n (8-16) samples for each label. If only a test set is available, a section is split off as a training set. Both sets are then embedded and used to train a logistic regression using a maximum of 100 iterations. Afterwards, performance metrics are calculated. For robustness, this process is repeated 10 times. Pair classification For two paired texts, the goal is to predict the label. Examples of such tasks include paraphrase detection or duplicate detection. The task is solved by embedding all documents and then computing the distance either using a model-specified metric, cosine, euclidean, dot product, or Manhattan. Using the best binary threshold, performance metrics are
Chunk 83 ¡ 1,992 chars
s to predict the label. Examples of such tasks include paraphrase detection or duplicate detection. The task is solved by embedding all documents and then computing the distance either using a model-specified metric, cosine, euclidean, dot product, or Manhattan. Using the best binary threshold, performance metrics are computed. Bitext mining The dataset consists of matching pairs of sentences, and the goal is to find the match. All matching pairs of sentences are embedded, and the closest match is found using cosine similarity, and metrics are reported. Clustering and hierarchical clustering Clustering starts with a set of documents and an associated set of labels. First we embed all documents, then take subsets of the data of size k for each of 10 consecutive experiments. All the documents are embedded, and a set of size k is sampled from the embedded documents. The embeddings are then clustered using K-means clustering, and performance metrics are calculated between the estimated clusters and labels. If the clustering problem is hierarchical, this procedure is repeated for each level of the hierarchy separately. Hierarchical tasks were formerly either split into multiple tasks, or later levels of the cluster hierarchy were ignored. Note that this formulation differs from that of MTEB in that the sets are randomly sampled from the embedded documents instead of being specified a-priori. This drastically reduced runtime as one document can be used in multiple subsets without the need to embed it multiple times. The new formulation also allows us to gain a robust estimate of performance with a lower number of documents. Retrieval Retrieval tasks consist of a corpus, queries, and mapping between the queries and their relevant documents. The goal is to retrieve these relevant documents. Both queries and documents are embedded using the model. We allow these to be embedded differently depending on the model. For each query, the corpus documents are ranked using a
Chunk 84 ¡ 1,997 chars
onsist of a corpus, queries, and mapping between the queries and their relevant documents. The goal is to retrieve these relevant documents. Both queries and documents are embedded using the model. We allow these to be embedded differently depending on the model. For each query, the corpus documents are ranked using a similarity score, and performance metrics are calculated based on the reference mapping. Multi-label classification Classification tasks in MTEB were previously limited to utilizing only one label per document. As such, some, otherwise useful multi-label classification tasks had to be dropped or reformulated. We addressed this by introducing a multi-label classification task type Similarly to our novel clustering task, we down sample training sets for 10 experiments. We limit the training sets to include 8 instances of each unique label, and train a K Nearest-Neighbours classifier. Every classifier is then evaluated on the same test set. We opted for Accuracy, F1 and Label Ranking Average Precision (LRAP) as evaluation metrics. Instruction retrieval Instruction retrieval builds on the traditional retrieval task by incorporating detailed instructions alongside the queries. Unlike standard retrieval, where queries are usually brief keywords, instruction retrieval pairs each query with a comprehensive instruction that outlines the criteria for document relevance. These instructions are specific to each query and not generic to the entire dataset. Therefore, the task involves using both the query and its associated instruction to retrieve relevant documents from the corpus. For the main metric, we use Robustness@10. Reranking Similar to the retrieval task, reranking includes a corpus, query, and a list of relevant and irrelevant reference texts. The aim is to rank the results according to their relevance to the query. References and queries are embedded and references are compared to the query using cosine similarity. The resulting ranking is scored for
Chunk 85 ¡ 1,997 chars
retrieval task, reranking includes a corpus, query, and a list of relevant and irrelevant reference texts. The aim is to rank the results according to their relevance to the query. References and queries are embedded and references are compared to the query using cosine similarity. The resulting ranking is scored for each query and averaged across all queries, and performance metrics computed. For the main metric, we use MAP@1000. Semantic text similarity Semantic text similarity (STS) tasks consist of sentence pairs, where the goal is to determine their similarity. Labels are continuous scores, with higher numbers indicating more similar sentences. All sentences are embedded using the model, and the similarity of the pair is computed using various distance metrics, allowing for model-specified similarity metrics. Distances 37 -- 37 of 57 -- Published as a conference paper at ICLR 2025 are benchmarked with ground truth similarities using Pearson and Spearman correlations. Spearman correlation based on highest similarity serves as the main metric (Reimers et al., 2016) B.2 TASK CONSTRUCTION This section outlines our approach to constructing tasks, primarily from pre-existing data. For details on the newly introduced dataset in MMTEB, we refer to Section B.3. Task construction from existing datasets consisted of a number of steps to ensure that the task is compatible with formulations in the benchmark and matches our standards: 1. Dataset preprocessing: we start by applying minimal additional processing to ensure the data is in the required format. 2. Dataset size reduction: to maintain manageable evaluation times, we proceed by reducing dataset size whenever applicable. 3. Relevance filtering: To ensure the datasets are relevant for the types of tasks being evaluated, we apply relevance-based dataset filtering. 4. Differentiation testing: we assess the taskâs ability to differentiate between the performance of two candidate models. For further details on dataset
Chunk 86 ¡ 1,993 chars
whenever applicable. 3. Relevance filtering: To ensure the datasets are relevant for the types of tasks being evaluated, we apply relevance-based dataset filtering. 4. Differentiation testing: we assess the taskâs ability to differentiate between the performance of two candidate models. For further details on dataset transformations for specific tasks, we refer to the dataset_transform method implementation for each task. Classification and pair classification For both classification tasks, we used existing datasets with minimal adjustments, primarily trimming them down to more manageable sizes. For performance evaluation, we rely on such metrics as F1 score, accuracy, or average precision. Whenever feasible, we align our choice of the primary metric with those used in related publications. If no specific guidance exists, we default to accuracy for general classification tasks and average precision for pairwise classification. In scenarios with significant class imbalance, the F1 score is prioritized. Bitext mining Bitext mining tasks were constructed using established paired datasets. Similar to the classification tasks, the primary focus was on adjusting the dataset sizes to maintain the same model rank while reducing computational load. F1 scores were chosen to be the primary metric, unless specified otherwise. Clustering and hierarchical clustering Clustering tasks were derived from existing corpora, such as news articles or encyclopedic entries. The source datasets typically included categories or labels assigned by their original authors or publishers. In some cases, like the SNL and VG datasets (Navjord & Korsvik, 2023), which featured hierarchical labels, we reformulated the tasks from flat to hierarchical clustering. Retrieval A variety of tasks were integrated as retrieval tasks, including existing retrieval, question- answer, and news datasets. For question-answer datasets, the questions were used as queries, and the answers formed the corpus, with
Chunk 87 ¡ 1,996 chars
rarchical labels, we reformulated the tasks from flat to hierarchical clustering. Retrieval A variety of tasks were integrated as retrieval tasks, including existing retrieval, question- answer, and news datasets. For question-answer datasets, the questions were used as queries, and the answers formed the corpus, with correct answers identified as properly retrieved documents. In news datasets, headlines were treated as queries, and both the full articles were considered part of the corpus, with matched summaries and articles serving as relevant documents. For the primary metric, we use nDCG@10, unless otherwise specified by the dataset publication. Multi-label classification For multi-label classification, we used existing datasets that required minimal adjustments. A critical aspect of these tasks was maintaining the balance of label distributions across the training and evaluation splits. To achieve this, we employed advanced stratification techniques (Szyma´nski & Kajdanowicz, 2017; Sechidis et al., 2011) that consider higher-order relationships between labels, ensuring balanced samples and improved classification quality. For the main metric, we use accuracy. Instruction Retrieval For instruction retrieval tasks, we incorporated datasets like FollowIR (Weller et al., 2024; 2025), which consist of comprehensive narratives created by professional assessors. These datasets were initially developed for TREC shared tasks and included rich, context-heavy queries to evaluate retrieval systemsâ performance on more intricate retrieval problems. Reranking For reranking tasks, we adapted datasets covering a range of topics and languages, including academic paper ranking, news articles (Wu et al., 2020b), QA pair relevance from online platforms, and passage ranking (Xie et al., 2023). For the primary metric, we use MAP unless otherwise specified by the dataset publication. 38 -- 38 of 57 -- Published as a conference paper at ICLR 2025 Semantic text similarity For STS
Chunk 88 ¡ 1,995 chars
c paper ranking, news articles (Wu et al., 2020b), QA pair relevance from online platforms, and passage ranking (Xie et al., 2023). For the primary metric, we use MAP unless otherwise specified by the dataset publication. 38 -- 38 of 57 -- Published as a conference paper at ICLR 2025 Semantic text similarity For STS tasks, we adapted well-known benchmarks like STSbenchmark (May et al., 2021) and cross-lingual STS datasets from SemEval (Agirre et al., 2015). We also adapted paraphrase datasets in various languages, such as the Russian ParaPhraser (Pivovarova et al., 2017) and the Finnish Paraphrase Corpus (Kanerva et al., 2021). As the main metric, we use Spearman correlation based on the highest similarity (Reimers et al., 2016). B.3 NOVEL DATASETS This section introduces task specifically created as a part of the MMTEB contributions. For informa- tion on how existing datasets were adapted to MTEB we refer to Appendix B. PublicHealthQA: This retrieval task is built on top of a novel dataset containing question-and- answer pairs in Public Health, specifically related to the COVID-19 disease. They are sourced from Q&A pages and Frequently Asked Questions (FAQ) sections of the Centers for Disease Control and Prevention (CDC) and World Health Organization (WHO) websites. They were produced and collected between 2019-12 and 2020-04. WebLINXReranking: This is a novel HTML reranking task derived from WebLINX, a benchmark for training and evaluating web agents with conversational capabilities (LĂš et al., 2024). Whereas the original work introduces a retrieval task with the goal of retrieving HTML elements using a conversational context, we propose the first task with the goal of reranking HTML elements based on their relevance for actions executed in web environments, including clicks, hovers, and text insertions. WikiClustering: is a multilingual clustering benchmark based on Wikipediaâs main topic classifica- tions. The goal is to create a clustering benchmark that
Chunk 89 ¡ 1,995 chars
rst task with the goal of reranking HTML elements based on
their relevance for actions executed in web environments, including clicks, hovers, and text insertions.
WikiClustering: is a multilingual clustering benchmark based on Wikipediaâs main topic classifica-
tions. The goal is to create a clustering benchmark that works for multiple languages.
To construct a WikiClustering dataset for a given language, we apply the following steps. First,
download the wiki dump of the categories, the articles, and the category links. Second, we find the
main topic classifications for all articles. The main topic classifications can be found by looking
at the category page for the language6. We only use the first paragraph of each article to construct
a paragraph-to-paragraph (P2P) task similar to other P2P tasks within MTEB. Third, we filter out
articles with more than one main topic and remove any topic with only one article associated with it.
This step avoids ambiguity in the clustering task. Finally, we sample 2048 articles with associated
main topics.
While the WikiClustering benchmark can be extended to any language with main topic classifications,
it is currently implemented for the following: Bosnian, Catalan, Czech, Danish, Basque, Manx,
Ilokano, Kurdish, Latvian, Minangkabau, Maltese, Scots, Albanian, and Walloon. All code is
available on GitHub.
WikipediaRetrievalMultilingual and WikipediaRerankingMultilingual: This is a multilingual
retrieval and reranking dataset based on succinct queries generated by a strong multilingual LLM
grounded in Wikipedia articles. The dataset was made to resemble SQuAD. Sampled Wikipedia
articles of a target language were chunked and passed to GPT4-o using the following prompt:
"""
Your task is to anticipate possible search queries by users in the form of a question
for a given document.
- The question must be written in {{ language }}
- The question should be formulated concretely and precisely and relate to the
information from theChunk 90 ¡ 1,999 chars
e chunked and passed to GPT4-o using the following prompt:
"""
Your task is to anticipate possible search queries by users in the form of a question
for a given document.
- The question must be written in {{ language }}
- The question should be formulated concretely and precisely and relate to the
information from the given document
- The question must be coherent and should make sense without knowing the document
- The question must be answerable by the document
- The question should focus on one aspect and avoid using subclauses connected with
'and'
- The question should not be overly specific and should mimic a request of a user who
is just starting to research the given topic
- Do not draw on your prior knowledge
6for details, we refer to https://en.wikipedia.org/wiki/Category:Main_topic_
classificationsforEnglish
39
-- 39 of 57 --
Published as a conference paper at ICLR 2025
Figure 7: Comparison of MRR on synthetic retrieval and gold (GermanQuAD). The synthetic dataset
was generated using GPT4-turbo.
Generate a question in {{ language }} for the following document:
<document>
{{ document }}
</document>
Search query:
"""
We filtered articles with less than 9 paragraphs and sampled 1500 articles from the top 100k viewed
articles. We then selected a random window of 9 consecutive paragraphs per article and chose the
middle one to be the positive context and generated a query for it with gpt-4o. The surrounding 8
paragraphs act as hard negatives. The 9 paragraphs per article are used for the reranking task with one
positive and 8 negatives. The one positive, 8 hard negatives, and the remaining corpus as negatives
are used in the retrieval task.
These datasets where constructed fro the following languages: "bul-Cyrl", "ben-Beng", "ces-Latn",
"dan-Latn", "deu-Latn", "eng-Latn", "fas-Arab", "fin-Latn", "hin-Deva", "ita-Latn", "nld-Latn",
"por-Latn", "ron-Latn", "srp-Cyrl", "dan-Latn", "nob-Latn", "swe-Latn".
To estimate the quality of these samples we compare it toChunk 91 ¡ 1,998 chars
e datasets where constructed fro the following languages: "bul-Cyrl", "ben-Beng", "ces-Latn", "dan-Latn", "deu-Latn", "eng-Latn", "fas-Arab", "fin-Latn", "hin-Deva", "ita-Latn", "nld-Latn", "por-Latn", "ron-Latn", "srp-Cyrl", "dan-Latn", "nob-Latn", "swe-Latn". To estimate the quality of these samples we compare it to the GermanQuAD (MĂśller et al., 2021) in Figure 7. We obtain a Spearman rank correlation of 0.93 with a 95% CI of [0.69; 1.]. B.4 TASK METADATA Table 5 shows the required metadata to fill before adding a task to the benchmark. We provide a detailed description of each field, along with examples and possible values. B.4.1 DOMAINS For our domains, we include the following: ⢠Academic: Scholarly writing and research publications typically found in journals, theses, and dissertations. ⢠Blog: Informal or conversational posts often found on websites or personal pages, covering a wide range of topics. ⢠Constructed: Text or speech that is deliberately invented or constructed, often used for experimental purposes to target specific abilities. ⢠Encyclopaedic: Structured, reference-based texts that provide comprehensive and factual information on a wide range of subjects. 40 -- 40 of 57 -- Published as a conference paper at ICLR 2025 Field Description Name A concise name for the task. Description A brief explanation of the taskâs goals and objectives.. Type The primary task category (e.g., classification, summarization, retrieval). Category The general data structure or format of the task. This can be specified using a combination of single- letter codes (e.g., "s" for sentence, "p" for paragraph, "d" for document). For example, "s2s" indicates a sentence-to-sentence task, "s2p" indicates a sentence-to-paragraph task, and "p2p" indicates a paragraph-to-paragraph task. Task Subtype A more specific subcategory within the primary task type. This can be used to further refine the task and provide additional context. For example, "Summarization" might have
Chunk 92 ¡ 1,996 chars
cates
a sentence-to-sentence task, "s2p" indicates a sentence-to-paragraph task, and "p2p" indicates a
paragraph-to-paragraph task.
Task Subtype A more specific subcategory within the primary task type. This can be used to further refine the task
and provide additional context. For example, "Summarization" might have subtypes like "Extractive
Summarization" or "Abstractive Summarization".
Reference A URL or citation to the original source material (e.g., paper, dataset repository).
Evaluation Splits The specific subsets of the data used for training, validation, and testing.
Evaluation Languages A list of ISO 639-3 language codes (e.g., "eng", "fra") followed by ISO 15924 script codes (e.g.,
"Latn", "Cyrl") for each language used in the evaluation. For example: [("eng", "Latn"), ("fra",
"Latn")]. If multiple scripts are used within a single language, we specify them as a list (e.g., [("eng",
["Latn", "Grek"])]).
Date The time period when the data was gathered. Specified as a tuple of two dates.
Main score The primary metric used to evaluate task performance.
Form The format of the data (e.g., "spoken", "written")
License The licensing terms for the dataset (e.g., CC BY-SA, MIT).
Domains The subject areas or fields covered by the data (e.g., medical, legal, news). One dataset can belong to
multiple domains.
Annotation Creators The type of the annotators. Includes "expert-annotated" (annotated by experts), "human-annotated"
(annotated e.g. by mturkers), "derived" (derived from structure in the data), "LM-generated" (gener-
ated using a language model) and "LM-generated and reviewed" (generated using a language model
and reviewed by humans or experts).
Dialect The specific dialect or regional variation of the language.
Text Creation How the text was generated. Includes "found", "created", "human-translated and localized", "human-
translated", "machine-translated", "machine-translated and verified", "machine-translated and local-
ized", "LM-generated andChunk 93 ¡ 1,989 chars
experts). Dialect The specific dialect or regional variation of the language. Text Creation How the text was generated. Includes "found", "created", "human-translated and localized", "human- translated", "machine-translated", "machine-translated and verified", "machine-translated and local- ized", "LM-generated and verified". Bibtex Citation The BibTeX format citation for the dataset. Number of samples The total number of data points in the dataset. Avg. Number of characters The average character length of the samples in the dataset. Table 5: Required metadata for adding a new task to MMTEB. ⢠Fiction: Narrative writing based on imaginative content, including novels, short stories, and other forms of storytelling. ⢠Government: Official documents, reports, and publications produced by governmental bodies. ⢠Legal: Documents and texts relating to laws, legal proceedings, contracts, and legal theory. ⢠Medical: Scientific and clinical literature related to healthcare, treatments, medical research, and patient care. ⢠News: Journalistic content that covers current events, politics, economy, and other topical issues. ⢠Non-fiction: Writing based on factual accounts and real-world subjects, such as biographies, essays, and documentaries. ⢠Poetry: Literary form focused on expressive language, often structured with meter, rhyme, or free verse. ⢠Religious: Texts related to religious teachings, doctrines, sacred scriptures, and spiritual discussions. ⢠Reviews: Critical evaluations of works such as books, movies, music, products, or services. ⢠Social: Written or spoken communication on social media platforms, forums, and other digital environments. ⢠Spoken: Oral communication, including speeches, dialogues, interviews, and recorded conversations. ⢠Subtitles: Textual transcriptions or translations of spoken language in films, videos, or multimedia presentations. ⢠Web: Text content found on websites, covering a wide range of subjects, often hyperlinked and
Chunk 94 ¡ 1,997 chars
nts. ⢠Spoken: Oral communication, including speeches, dialogues, interviews, and recorded conversations. ⢠Subtitles: Textual transcriptions or translations of spoken language in films, videos, or multimedia presentations. ⢠Web: Text content found on websites, covering a wide range of subjects, often hyperlinked and multimedia-enriched. ⢠Written: General term for any form of text-based communication, whether printed or digital. 41 -- 41 of 57 -- Published as a conference paper at ICLR 2025 ⢠Programming: Text written in programming languages to instruct computers, often for software development. Our definition of domain aligns with that of the Universal Dependencies project (Nivre et al., 2016). We do not claim that our definition is neither precise nor comprehensive. However, and include subject fields such as "medical", "legal", and "news" and literary type such as "fiction", "non-fiction". They are not mutually exclusive. C BENCHMARK OPTIMIZATIONS C.1 SPEEDING UP TASKS We aim to reduce the total amount of time needed to run the complete set of MTEB task. In particular, we investigate how to drastically reduce runtime on clustering and retrieval tasks while maintaining relative model rankings. This appendix provides full details of the approach described in Section 2.3.2. C.1.1 CLUSTERING Task Spearman Speedup Biorxiv P2P 0.9505 31.50x Biorxiv S2S 0.9890 14.31x Medrxiv P2P 0.9615 21.48x Medrxiv S2S 0.9560 8.39x Reddit S2S 0.9670 11.72x Reddit P2P 0.9670 22.77x StackExchange S2S 0.9121 9.55x StackExchange P2P 0.9670 20.20x TwentyNewsgroups 1.0000 5.02x Average 0.9634 16.11x Table 6: Agreement on model rankings on a selection of English clustering tasks using Spearmanâs correlation across the scores of 13 models of various sizes. In the main paper, we present a down-sampled and bootstrapped version of the clustering task. We highlight the main results in Table 6 but refer to. We observe an average speedup across tasks of 16.11x while maintaining the relative
Chunk 95 ¡ 1,990 chars
ng tasks using Spearmanâs correlation across the scores of 13 models of various sizes. In the main paper, we present a down-sampled and bootstrapped version of the clustering task. We highlight the main results in Table 6 but refer to. We observe an average speedup across tasks of 16.11x while maintaining the relative ordering of models on the evaluated tasks. The largest average speed-up was seen for e5-large (16.93x), but we expect this effect to be even more pronounced among 7b or larger models. 9 single-level English clustering tasks are evaluated on 13 models across various sizes. A fraction of the documents are sampled and stratified by their target categories. At the same time, we wish to maintain robustness of the evaluation, i.e. the fast approach should be able to determine highly similar model ranking to that from the original approach. As such, we investigate the extent of agreement between the original clustering task and ours in each task on the model rankings. The model ranking is determined from the mean of V-measure scores from evaluations, where a higher mean gives a higher model rank. Spearmanâs rank correlation score is then calculated based on the ranks from ours and the original approach. We additionally calculate the significant model rank which is determined by computing the significance of the given modelâs V-measure bootstrapped distribution based on its mean of V-measure scores using our approach against that of the original approach. Significant S is then calculated based on the significant ranks from our and the original approach. To find a balance between speedup and the robustness of the approach, 4% of the dataset is chosen as the fraction to down-sample to, with the exception of RedditS2S and StackExchange where n_samples = 32768. Table 7 shows that all evaluated datasets have very high significant Spearmanâs rank scores between our and the original approach. Figure 8 reports the distribution of V-measure 42 -- 42 of 57 --
Chunk 96 ¡ 1,998 chars
t is chosen as the fraction to down-sample to, with the exception of RedditS2S and StackExchange where n_samples = 32768. Table 7 shows that all evaluated datasets have very high significant Spearmanâs rank scores between our and the original approach. Figure 8 reports the distribution of V-measure 42 -- 42 of 57 -- Published as a conference paper at ICLR 2025 (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 8: Distribution of scores per task across models. 43 -- 43 of 57 -- Published as a conference paper at ICLR 2025 Task Sig. S Biorxiv P2P 0.9390 Biorxiv S2S 0.9679 Medrxiv P2P 0.8200 Medrxiv S2S 0.9510 Reddit S2S 0.9790 Reddit P2P 0.7370 StackExchange S2S 0.9486 StackExchange P2P 0.9497 TwentyNewsgroups 0.9832 Average 0.9195 Table 7: Agreement on model rankings on English clustering tasks using significant Spearmanâs rank correlation with selected models of various sizes. 2 5 10 50 100 250 500 All Number of Documents 1 2 3 4 5 6 7 Rank Natural Questions Model Name NV-Embed-v1 SFR-Embedding-Mistral e5-mistral-7b-instruct e5-large-v2 gte-base-en-v1.5 bge-large-en-v1.5 contriever-msmarco 2 5 10 50 100 250 500 All Number of Documents 1 2 3 4 5 6 7 Rank TREC-COVID Model Name NV-Embed-v1 SFR-Embedding-Mistral e5-mistral-7b-instruct gte-base-en-v1.5 e5-large-v2 bge-large-en-v1.5 contriever-msmarco Figure 9: Ranking of different models on subsampled versions of the datasets using hard negatives. We see that NQ can be reduced to just two documents per query (relevant + 1 hard negative) while still maintaining the rank while TREC-COVID is less stable. scores obtained from evaluation per model in each dataset for the ClusteringFast and the original approach. There is generally strong agreement between the rankings from both approaches. We also observe that the ClusteringFast approach often (5 out of 9 datasets) produces a smaller spread (i.e. smaller variance) in its V-measure distributions. Reddit P2P has the lowest significant Spearman score among this set. It
Chunk 97 ¡ 1,966 chars
approach. There is generally strong agreement between the rankings from both approaches. We also observe that the ClusteringFast approach often (5 out of 9 datasets) produces a smaller spread (i.e. smaller variance) in its V-measure distributions. Reddit P2P has the lowest significant Spearman score among this set. It also has the lowest average character length for its documents. C.1.2 RETRIEVAL In this section we provide details about the method used to downsample retrieval datasets. To ensure the downsampling kept the efficacy of the evaluation we aimed to examine several axes: (1) a wide range of models to be sure that the evaluation task could still properly rank the models - just as if it were not downsampled (2) that this method works for retrieval datasets that are sparsely judged and densely judged and (3) seeing if it was possible to use hard negatives from a smaller set of models due to the computational expense to gather these hard negatives on the full datasets.7 To meet these goals we chose NQ (for sparse relevance annotations, one per query) and TREC- COVID (for dense judgements, > 500 per query). To test using a small set of hard negatives, we gather the hard negatives with e5-large-v2 only. We evaluate a wide range of models for this analysis, including the current state-of-the-art and some of the previous state-of-the-art: NV-Embed-v1 (Lee et al., 2024), SFR-Embedding-Mistral (Meng et al., 2024), e5-mistral-7b-instruct (Wang et al., 2023), e5-large-v2 (Wang et al., 2022), gte-base-en-v1.5 (Li et al., 2023b), bge-large-en-v1.5 (Xiao et al., 7We also tested whether ensuring that the ground truth relevant document is present in these hard negatives made a difference - we found that it did not, as most models ranked the ground truth in the top N, so manually including it was little help as it was already included. 44 -- 44 of 57 -- Published as a conference paper at ICLR 2025 2 5 10 50 100 250 500 All Number of
Chunk 98 ¡ 1,994 chars
document is present in these hard negatives
made a difference - we found that it did not, as most models ranked the ground truth in the top N, so manually
including it was little help as it was already included.
44
-- 44 of 57 --
Published as a conference paper at ICLR 2025
2 5 10 50 100 250 500 All
Number of Documents
0.4
0.5
0.6
0.7
0.8
0.9
nDCG@10
Natural Questions
Model Name
NV-Embed-v1
SFR-Embedding-Mistral
e5-mistral-7b-instruct
e5-large-v2
gte-base-en-v1.5
bge-large-en-v1.5
contriever-msmarco
2 5 10 50 100 250 500 All
Number of Documents
0.4
0.5
0.6
0.7
0.8
nDCG@10
TREC-COVID
Model Name
NV-Embed-v1
SFR-Embedding-Mistral
e5-mistral-7b-instruct
gte-base-en-v1.5
e5-large-v2
bge-large-en-v1.5
contriever-msmarco
Figure 10: Absolute scores of different models on subsampled versions of the datasets using hard
negatives. NQ has 1 relevant document per query while TREC-COVID has 500+ relevant documents
per query which is why we see NQ scores gradually increasing whereas TREC-COVID scores vary.
2023), and contriever-msmarco (Izacard et al., 2021). We then evaluated the models on versions of
the datasets with N hard negatives documents per query where N â{2, 5, 10, 50, 100, 500, all}. We
then compared the absolute scores and the relative rank positions to see what settings best retain the
difficulty of the original task.
Ability to rank models correctly For a good evaluation, it must be able to rank models correctly
and determine the best model. For this we examine how the ranking of the models change when
we lower the number of hard negatives. For NQ the rank remains stable even with just one hard
negatives (Figure 9). For TREC-COVID the ranking becomes unstable starting at 100 hard negatives,
continuing to change as the number gets smaller.
Keeping the absolute score similar In an ideal case the scores for the task should remain similar
and not trend towards perfect scores, remaining useful. We see that scores go very high when there
are only a fewChunk 99 ¡ 1,999 chars
e ranking becomes unstable starting at 100 hard negatives, continuing to change as the number gets smaller. Keeping the absolute score similar In an ideal case the scores for the task should remain similar and not trend towards perfect scores, remaining useful. We see that scores go very high when there are only a few hard negatives for NQ (Figure 10). For TREC-COVID it is more stable, but we see some wider swings with smaller documents. Overall, the scores are relatively similar at 100+ hard negatives. Summary Overall, we see that staying above 100 hard negatives gives similar absolute scores while maintaining the ranking ability. Thus we opted for a conservative 250 documents per query to keep these characteristics. C.2 CODE OPTIMIZATIONS We here document the major code optimizations within MTEB not related to dataset scores, task reformulation Dataset loading One important issue identified was about loading multilingual and cross-lingual datasets composed of numerous small files in their repositories. Even for total dataset sizes under 10MB, loading could take hours due to significant overhead from managing a high number of network requests and the improper opening and closing of gzipped files. In collaboration with the datasets team (Lhoest et al., 2021), we addressed these problems with two-side implementation improvements: the datasets library optimized the loading of a large number of requested files, and we restructured the datasets and our codebase to leverage the benefits of the newer implementation. This ultimately reduced loading times by almost a factor of 100, bringing the largely cross-lingual dataset bitext-mining loading to under a minute. Deduplication Upon in-depth scrutiny of all datasets, cases with repeated samples were identified and deduplicated (e.g. MindSmallReranking). As this led to a change in scores, a second version of the task was introduced to maintain compatible scores with existing benchmarks. To move the optimizations to existing
Chunk 100 ¡ 1,999 chars
ute. Deduplication Upon in-depth scrutiny of all datasets, cases with repeated samples were identified and deduplicated (e.g. MindSmallReranking). As this led to a change in scores, a second version of the task was introduced to maintain compatible scores with existing benchmarks. To move the optimizations to existing MTEB tasks we implement a local cache to avoid encoding a sample twice. 45 -- 45 of 57 -- Published as a conference paper at ICLR 2025 D TASK OVERVIEW D.1 TASKS To get an overview of the all the tasks implemented in MMTEB we refer to the automatically updated tables in the documentation8, which include the available metadata for all of the task, including license, task category, domains, etc. D.2 LANGUAGES Additionally, the top 100 out of the total 1051 languages in ISO 639-3 language codes and their respective task counts are in Table 8. ISO Code Language Family BitextMining Classification Clustering InstructionRetrieval MultilabelClassification PairClassification Reranking Retrieval STS Speed Summarization Sum eng English Indo-European 16 143 16 3 1 8 8 92 13 2 1 303 deu German Indo-European 6 14 7 0 1 6 2 18 4 0 0 58 fra French Indo-European 7 13 8 0 1 5 3 15 4 0 1 57 rus Russian Indo-European 5 13 6 0 2 4 2 16 4 0 0 52 pol Polish Indo-European 4 11 4 0 1 4 0 18 4 0 0 46 cmn Mandarin Chinese Sino-Tibetan 4 10 4 0 0 3 4 10 9 0 0 44 spa Spanish Indo-European 4 13 4 0 1 2 2 13 4 0 0 43 hin Hindi Indo-European 9 12 2 0 0 1 2 10 2 0 0 38 code unknown Programming 0 0 0 0 0 0 0 37 0 0 0 37 jpn Japanese Japonic 5 8 3 0 0 1 3 13 2 0 0 35 kor Korean Koreanic 4 8 1 0 1 2 1 9 3 0 0 29 ara Arabic Afro-Asiatic 2 12 0 0 0 2 1 9 2 0 0 28 ben Bengali Indo-European 7 9 2 0 0 1 2 6 1 0 0 28 ita Italian Indo-European 5 9 1 0 1 2 1 5 3 0 0 27 por Portuguese Indo-European 4 9 1 0 2 2 1 5 3 0 0 27 tel Telugu Dravidian 7 7 2 0 0 0 1 5 2 0 0 24 dan Danish Indo-European 5 9 2 0 1 0 1 5 0 0 0 23 swe Swedish Indo-European 4 8 3 0 1 1 1 4 0 0 0 22 ind Indonesian
Chunk 101 ¡ 1,998 chars
8 ben Bengali Indo-European 7 9 2 0 0 1 2 6 1 0 0 28 ita Italian Indo-European 5 9 1 0 1 2 1 5 3 0 0 27 por Portuguese Indo-European 4 9 1 0 2 2 1 5 3 0 0 27 tel Telugu Dravidian 7 7 2 0 0 0 1 5 2 0 0 24 dan Danish Indo-European 5 9 2 0 1 0 1 5 0 0 0 23 swe Swedish Indo-European 4 8 3 0 1 1 1 4 0 0 0 22 ind Indonesian Austronesian 6 7 1 0 0 1 1 4 1 0 0 21 tam Tamil Dravidian 7 7 2 0 0 1 0 3 1 0 0 21 tha Thai Tai-Kadai 4 8 1 0 0 1 1 6 0 0 0 21 mar Marathi Indo-European 7 6 2 0 0 1 0 2 2 0 0 20 zho Chinese Sino-Tibetan 2 2 1 0 0 1 1 13 0 0 0 20 fin Finnish Uralic 3 5 1 0 1 1 2 5 1 0 0 19 kan Kannada Dravidian 6 7 2 0 0 1 0 2 1 0 0 19 mal Malayalam Dravidian 7 7 2 0 0 0 0 2 1 0 0 19 nld Dutch Indo-European 6 6 1 0 1 0 1 2 2 0 0 19 nob Norwegian BokmĂĽl Unclassified 4 7 5 0 0 0 0 3 0 0 0 19 tur Turkish Turkic 4 7 1 0 0 2 0 3 2 0 0 19 urd Urdu Indo-European 7 8 2 0 0 0 0 1 1 0 0 19 guj Gujarati Indo-European 6 6 2 0 0 1 0 2 1 0 0 18 pan Panjabi Indo-European 6 6 2 0 0 1 0 2 1 0 0 18 ron Romanian Indo-European 5 6 1 0 1 0 1 3 1 0 0 18 8For the latest version see https://github.com/embeddings-benchmark/mteb/blob/main/docs/ tasks.md 46 -- 46 of 57 -- Published as a conference paper at ICLR 2025 vie Vietnamese Austroasiatic 5 6 1 0 0 1 0 5 0 0 0 18 fas Persian Indo-European 1 4 0 0 0 1 2 9 0 0 0 17 ces Czech Indo-European 4 5 2 0 1 1 1 2 0 0 0 16 ell Modern Greek Indo-European 3 6 1 0 1 2 0 3 0 0 0 16 yor Yoruba Atlantic-Congo 4 5 3 0 0 0 1 3 0 0 0 16 ory Odia Indo-European 5 4 2 0 0 1 0 2 1 0 0 15 swa Swahili Atlantic-Congo 1 7 2 0 0 1 1 3 0 0 0 15 amh Amharic Afro-Asiatic 3 6 3 0 0 0 0 1 1 0 0 14 asm Assamese Indo-European 5 3 2 0 0 1 0 2 1 0 0 14 hau Hausa Afro-Asiatic 4 5 3 0 0 0 0 1 1 0 0 14 bul Bulgarian Indo-European 3 4 1 0 1 1 1 2 0 0 0 13 jav Javanese Austronesian 4 7 1 0 0 0 0 1 0 0 0 13 hun Hungarian Uralic 5 3 1 0 1 0 0 2 0 0 0 12 ibo Igbo Atlantic-Congo 3 5 3 0 0 0 0 1 0 0 0 12 slk Slovak Indo-European 3 4 1 0 1 0 0 3 0 0 0 12 heb Hebrew Afro-Asiatic 4 5 1 0
Chunk 102 ¡ 1,998 chars
Afro-Asiatic 4 5 3 0 0 0 0 1 1 0 0 14 bul Bulgarian Indo-European 3 4 1 0 1 1 1 2 0 0 0 13 jav Javanese Austronesian 4 7 1 0 0 0 0 1 0 0 0 13 hun Hungarian Uralic 5 3 1 0 1 0 0 2 0 0 0 12 ibo Igbo Atlantic-Congo 3 5 3 0 0 0 0 1 0 0 0 12 slk Slovak Indo-European 3 4 1 0 1 0 0 3 0 0 0 12 heb Hebrew Afro-Asiatic 4 5 1 0 0 0 0 1 0 0 0 11 afr Afrikaans Indo-European 3 4 1 0 0 0 0 1 1 0 0 10 hrv Croatian Indo-European 4 3 1 0 1 0 0 1 0 0 0 10 kat Georgian Kartvelian 4 3 1 0 0 0 0 2 0 0 0 10 san Sanskrit Indo-European 5 3 1 0 0 1 0 0 0 0 0 10 slv Slovenian Indo-European 3 4 1 0 1 0 0 1 0 0 0 10 xho Xhosa Atlantic-Congo 3 3 3 0 0 0 0 1 0 0 0 10 hye Armenian Indo-European 3 3 1 0 0 1 0 1 0 0 0 9 isl Icelandic Indo-European 3 4 1 0 0 0 0 1 0 0 0 9 min Minangkabau Austronesian 3 4 2 0 0 0 0 0 0 0 0 9 mlt Maltese Afro-Asiatic 2 2 2 0 2 0 0 1 0 0 0 9 mya Burmese Sino-Tibetan 3 4 1 0 0 0 0 1 0 0 0 9 som Somali Afro-Asiatic 3 2 3 0 0 0 0 1 0 0 0 9 srp Serbian Indo-European 4 1 1 0 0 0 1 2 0 0 0 9 sun Sundanese Austronesian 3 4 1 0 0 0 0 1 0 0 0 9 arb Standard Arabic Afro-Asiatic 3 1 1 0 0 0 0 2 1 0 0 8 cat Catalan Indo-European 3 2 2 0 0 0 0 1 0 0 0 8 cym Welsh Indo-European 3 4 1 0 0 0 0 0 0 0 0 8 est Estonian Uralic 2 2 1 0 1 0 0 2 0 0 0 8 eus Basque Unclassified 3 2 2 0 0 0 0 1 0 0 0 8 kaz Kazakh Turkic 3 3 1 0 0 0 0 1 0 0 0 8 khm Khmer Austroasiatic 3 3 1 0 0 0 0 1 0 0 0 8 kin Kinyarwanda Atlantic-Congo 2 3 1 0 0 0 0 1 1 0 0 8 lin Lingala Atlantic-Congo 2 2 3 0 0 0 0 1 0 0 0 8 lit Lithuanian Indo-European 4 1 1 0 1 0 0 1 0 0 0 8 lug Ganda Atlantic-Congo 2 2 3 0 0 0 0 1 0 0 0 8 nno Norwegian Nynorsk Unclassified 4 3 1 0 0 0 0 0 0 0 0 8 npi Nepali Indo-European 4 2 1 0 0 0 0 1 0 0 0 8 sna Shona Atlantic-Congo 2 2 3 0 0 0 0 1 0 0 0 8 snd Sindhi Indo-European 4 2 1 0 0 0 0 1 0 0 0 8 tgl Tagalog Austronesian 3 3 1 0 0 0 0 1 0 0 0 8 tir Tigrinya Afro-Asiatic 2 2 3 0 0 0 0 1 0 0 0 8 ukr Ukrainian Indo-European 4 2 1 0 0 0 0 1 0 0 0 8 ary Moroccan Arabic Afro-Asiatic 1 3 1 0 0 0 0 1
Chunk 103 ¡ 1,989 chars
2 1 0 0 0 0 1 0 0 0 8 sna Shona Atlantic-Congo 2 2 3 0 0 0 0 1 0 0 0 8 snd Sindhi Indo-European 4 2 1 0 0 0 0 1 0 0 0 8 tgl Tagalog Austronesian 3 3 1 0 0 0 0 1 0 0 0 8 tir Tigrinya Afro-Asiatic 2 2 3 0 0 0 0 1 0 0 0 8 ukr Ukrainian Indo-European 4 2 1 0 0 0 0 1 0 0 0 8 ary Moroccan Arabic Afro-Asiatic 1 3 1 0 0 0 0 1 1 0 0 7 bug Buginese Austronesian 2 4 1 0 0 0 0 0 0 0 0 7 fao Faroese Indo-European 3 2 1 0 0 0 0 0 1 0 0 7 kir Kirghiz Turkic 2 3 1 0 0 0 0 1 0 0 0 7 mai Maithili Indo-European 4 2 1 0 0 0 0 0 0 0 0 7 mkd Macedonian Indo-European 3 2 1 0 0 0 0 1 0 0 0 7 mni Manipuri Sino-Tibetan 4 2 1 0 0 0 0 0 0 0 0 7 pcm Nigerian Pidgin Indo-European 1 4 2 0 0 0 0 0 0 0 0 7 sat Santali Austroasiatic 4 2 1 0 0 0 0 0 0 0 0 7 sin Sinhala Indo-European 2 3 1 0 0 0 0 1 0 0 0 7 ssw Swati Atlantic-Congo 2 3 1 0 0 0 0 1 0 0 0 7 47 -- 47 of 57 -- Published as a conference paper at ICLR 2025 tsn Tswana Atlantic-Congo 2 3 1 0 0 0 0 1 0 0 0 7 tso Tsonga Atlantic-Congo 1 4 1 0 0 0 0 1 0 0 0 7 uig Uighur Turkic 4 2 1 0 0 0 0 0 0 0 0 7 zul Zulu Atlantic-Congo 2 3 1 0 0 0 0 1 0 0 0 7 awa Awadhi Indo-European 3 2 1 0 0 0 0 0 0 0 0 6 bak Bashkir Turkic 2 3 1 0 0 0 0 0 0 0 0 6 bel Belarusian Indo-European 4 1 1 0 0 0 0 0 0 0 0 6 bho Bhojpuri Indo-European 2 2 1 0 0 1 0 0 0 0 0 6 bod Tibetan Sino-Tibetan 3 1 1 0 0 0 0 1 0 0 0 6 bos Bosnian Indo-European 3 1 2 0 0 0 0 0 0 0 0 6 ceb Cebuano Austronesian 3 1 1 0 0 0 0 1 0 0 0 6 ckb Central Kurdish Indo-European 3 1 1 0 0 0 0 1 0 0 0 6 ilo Iloko Austronesian 2 1 2 0 0 0 0 1 0 0 0 6 Table 8: The top 100 languages across all MMTEB tasks in ISO 639-3 language codes and their respective task counts. D.3 EXAMPLES Table 9 and Table 10 provide examples for each new task type introduced in MMTEB. For exam- ples of bitext mining, classification, clustering, pair classification, reranking, retrieval, STS, and summarization datasets, we refer to the MTEB paper Muennighoff et al. (2023b). Dataset Query OG Instructions Short query Relevant
Chunk 104 ¡ 1,984 chars
9 and Table 10 provide examples for each new task type introduced in MMTEB. For exam- ples of bitext mining, classification, clustering, pair classification, reranking, retrieval, STS, and summarization datasets, we refer to the MTEB paper Muennighoff et al. (2023b). Dataset Query OG Instructions Short query Relevant Document Robust04 Who is involved in the Schengen agree- ment to eliminate border con- trols in Western Europe and what do they hope to accom- plish? Relevant documents will contain any information about the actions of sig- natories of the Schengen agreement such as: mea- sures to eliminate border controls (removal of traf- fic obstacles, lifting of traffic restrictions); im- plementation of the in- formation system data bank that contains unified visa issuance procedures; or strengthening of bor- der controls at the exter- nal borders of the treaty area in exchange for free movement at the inter- nal borders. Discussions of border crossovers for business purposes are not relevant. Find doc- uments that an- swer this ques- tion on Schengen agree- ment actions. ... Schengen Space Concern- ing the mission traditionally performed by PAFâoverseeing border trafficâthe new direc- torate must fit into a Europe of immigration. The inte- rior minister is therefore ask- ing DICILC to step up its control of crossborder traf- fic, "particularly at the fu- ture external borders of the Schengen space." Originally scheduled in February 1994 but constantly postponed, the implementation of the agree- ments signed in Schengen by nine European countries (the Twelve, minus Great Britain, Ireland, and Denmark), pro- vides for the free circulation of nationals within the space common to the territories of their nine countries... Table 9: Instruction Retrieval examples. E FULL RESULTS During this work, multiple models were evaluated on more than >500 tasks, with multiple tasks containing multiple language subsets covering more than 1000 languages. This makes a
Chunk 105 ¡ 1,992 chars
n of nationals within the space common to the territories of their nine countries... Table 9: Instruction Retrieval examples. E FULL RESULTS During this work, multiple models were evaluated on more than >500 tasks, with multiple tasks containing multiple language subsets covering more than 1000 languages. This makes a comprehen- sive overview unreasonable. While we have supplied scores aggregated across task categories, we 48 -- 48 of 57 -- Published as a conference paper at ICLR 2025 Dataset Text Label Maltese News Cate- gories Hi kellha 82 sena Id-dinja muËzikali fl-Italja tinsab fâluttu wara l-mewt tal-attri Ëci u kantanta popolari Milva, li fis-snin 70 kienet meqjusa "ikona" fost it-Taljani. Milva kienet kisbet su Ëc Ëcess kbir, fl-istess epoka taâ Mina u Ornella Vanoni. Milva âar Ëget numru kbir taâ albums tul il-karriera tagâha u âadet sehem fâSanremo gâal xejn anqas minn 15-il darba; iËzda qatt ma rebâet il-festival. Hi kellha 82 sena, u telqet mix-xena tal-ispettaklu eËzatt 10 snin ilu. [ culture(2), inter- national(10) ] Table 10: Multilabel Classification examples. import mteb from mteb . task_selection import results_to_dataframe tasks = mteb . get_tasks ( task_types =[ " Retrieval "], languages =[ " eng " , " fra "], domains =[ " legal "] ) model_names = [ " intfloat / multilingual -e5 - small " , " intfloat / multilingual -e5 - base " , " intfloat / multilingual -e5 - large " , ] models = [ mteb . get_model_meta ( name ) for name in model_names ] results = mteb . load_results ( models = models , tasks = tasks ) df = results_to_dataframe ( results ) Figure 11: Simple example of how to obtain all scores on English (eng) and French (fra) retrieval tasks within the Legal domain for a set of models. realize that readers might be interested in examining scores for their specific language, domain of interest, and task. To ensure that such aggregation is available and easily accessible, we make all results available on the public and versioned results
Chunk 106 ¡ 1,998 chars
fra) retrieval tasks within the Legal domain for a set of models. realize that readers might be interested in examining scores for their specific language, domain of interest, and task. To ensure that such aggregation is available and easily accessible, we make all results available on the public and versioned results repository 9. These results include time of run, evaluation time, and a wide set of performance metrics pr. language subset, CO2 emission, version number, and more. To make these detailed results subject to easy analysis, we have added functionality for loading and aggregating these results within the mteb package. It is, for instance, possible to retrieve the scores for specific models on all English (eng) and French (fra) retrieval tasks within the Legal domain using the code snippet in Figure 11 We refer to the documentation10 for the latest version of this code. 49 -- 49 of 57 -- Published as a conference paper at ICLR 2025 Figure 12: Modelsâ rank on the MTEB(Multilingual) by the total number of speakers of a language. Trendlines represent moving average with a window size of 10 E.1 PERFORMANCE PER NUMBER OF SPEAKERS F NEW METRICS F.1 ABSTENTION FOR RETRIEVAL AND RERANKING TASKS In addition to the existing ranking metrics used for Retrieval and Reranking tasks (Muennighoff et al., 2023b), we propose to assess score calibration through the evaluation of model abstention ability, using the implementation of Gisserot-Boukhlef et al. (2024). Intuitively, a model abstains on a given instance (q, d1, ¡ ¡ ¡ , dk) (one query and k candidate docu- ments) if c (q, d1, ¡ ¡ ¡ , dk) < Ď , where c is a confidence function11 and Ď is a threshold regulating abstention likelihood. Therefore, to evaluate abstention capacity on a given test set S, an approach consists of making Ď vary to achieve several abstention rates. In the case of effective abstention, the metric score increases with the abstention rate. More formally, modelsâ ability to abstain is evaluated
Chunk 107 ¡ 1,997 chars
regulating abstention likelihood. Therefore, to evaluate abstention capacity on a given test set S, an approach consists of making Ď vary to achieve several abstention rates. In the case of effective abstention, the metric score increases with the abstention rate. More formally, modelsâ ability to abstain is evaluated by computing the normalized area under the metric-abstention curve (nAUC). Given a confidence function c, a metric function m12 and a labeled test dataset S, nAUC is computed as follows: 1. Multi-thresholding: Given a model f and dataset D, we define a set of abstention thresholds Ď1, . . . , Ďn, such that Ď1 < ¡ ¡ ¡ < Ďn. For each threshold Ďi, we construct a corresponding sub-dataset Si â D by applying the abstention criterion. We then evaluate the model f on each sub-dataset Si using the metric function m. To quantify the modelâs performance across these thresholds, we compute the area under the metric-abstention curve, denoted as AUCmodel. 2. Compute lower-bound: Since AUCmodel depends on the modelâs raw performance without abstention, we compute the effective lower bound AUCâ. This corresponds to the area under the curve when the metric remains constant as abstention increases, representing the baseline where abstention does not improve the metric. 3. Compute upper-bound: To establish the upper bound, AUC+, we evaluate an oracle model that has access to the true labels. The oracle can selectively retain the best instances at each abstention rate, yielding the theoretical maximum area under the metric-abstention curve. This represents the optimal model performance under abstention. 9https://github.com/embeddings-benchmark/results for the specific version of the repository used for this work see commit id 9a79f7e07542ad2f5cb47490fa1e5ac2ba57d7a8 10https://github.com/embeddings-benchmark/mteb 11In our implementation, we rely on three simple confidence functions all taking the instanceâs query-document cosine similarity scores as input: the maximum
Chunk 108 ¡ 1,997 chars
r the specific version of the repository used for this work see commit id 9a79f7e07542ad2f5cb47490fa1e5ac2ba57d7a8 10https://github.com/embeddings-benchmark/mteb 11In our implementation, we rely on three simple confidence functions all taking the instanceâs query-document cosine similarity scores as input: the maximum score, the standard deviation of scores and the difference between the highest and second highest scores. 12We utilize the metrics initially implemented for the evaluation of Retrieval and Reranking MTEB tasks (Muennighoff et al., 2023b). 50 -- 50 of 57 -- Published as a conference paper at ICLR 2025 Name in Paper HF Name Revision ID GritLM-7B GritLM/GritLM-7B 13f00a0e36500c80ce12870ea513846a066004af e5-mistral-7b-instruct intfloat/e5-mistral-7b-instruct 07163b72af1488142a360786df853f237b1a3ca1 multilingual-e5-base intfloat/multilingual-e5-base d13f1b27baf31030b7fd040960d60d909913633f multilingual-e5-large intfloat/multilingual-e5-large 4dc6d853a804b9c8886ede6dda8a073b7dc08a81 multilingual-e5-large-instruct intfloat/multilingual-e5-large-instruct baa7be480a7de1539afce709c8f13f833a510e0a multilingual-e5-small intfloat/multilingual-e5-small e4ce9877abf3edfe10b0d82785e83bdcb973e22e LaBSE s-t/LaBSE e34fab64a3011d2176c99545a93d5cbddc9a91b7 all-MiniLM-L12 s-t/all-MiniLM-L12-v2 a05860a77cef7b37e0048a7864658139bc18a854 all-MiniLM-L6 s-t/all-MiniLM-L6-v2 8b3219a92973c328a8e22fadcfa821b5dc75636a all-mpnet-base s-t/all-mpnet-base-v2 84f2bcc00d77236f9e89c8a360a00fb1139bf47d multilingual-MiniLM-L12 s-t/paraphrase-multilingual-MiniLM-L12-v2 bf3bf13ab40c3157080a7ab344c831b9ad18b5eb multilingual-mpnet-base s-t/paraphrase-multilingual-mpnet-base-v2 79f2382ceacceacdf38563d7c5d16b9ff8d725d6 Table 11: Model name as it appears in the paper, its name on Huggingface Hub, and their associated revision IDs. Note: s-t stands for sentence-transformers. 4. Compute normalized AUC: Finally, we compute the normalized area under the curve, denoted nAUCmodel,
Chunk 109 ¡ 1,997 chars
gual-mpnet-base-v2 79f2382ceacceacdf38563d7c5d16b9ff8d725d6 Table 11: Model name as it appears in the paper, its name on Huggingface Hub, and their associated revision IDs. Note: s-t stands for sentence-transformers. 4. Compute normalized AUC: Finally, we compute the normalized area under the curve, denoted nAUCmodel, by scaling AUCmodel between the lower and upper bounds: nAUCmodel = AUCmodel â AUCâ AUC+ â AUCâ . G MODELS Models used for task selection along with their revision IDs can be found in Table 11. Code for running the models, including prompts, is available within MTEBâs model registry avail- able at https://github.com/embeddings-benchmark/mteb/tree/main/mteb/models. Unless otherwise specified within the model implementation, the prompt is available in the file https: //github.com/embeddings-benchmark/mteb/blob/main/mteb/models/instructions.py. As some debugging happened during the running of the models, multiple versions of MTEB were used. Due to the computational cost of running these large models on the vast amount of datasets, it was deemed unfeasible to run all the models using the exact same version. However, for each task, all mod- els were run on the same version of the specific task. Model results can be found in JSON format in the results repository; these include additional performance metrics, model metadata, CO2 emission, time of run, and exact version of MTEB used: https://github.com/embeddings-benchmark/ results/tree/9a79f7e07542ad2f5cb47490fa1e5ac2ba57d7a8. H BENCHMARK CONSTRUCTION AND OVERVIEW H.1 BENCHMARK CREATION The following section introduces benchmarks created as a part of the MMTEB open contribution, which arenât introduced within the main article. MTEB additionally includes a variety of benchmark including the language-specific, notably the original English MTEB, MTEB(eng, v2) (Muennighoff et al., 2023b), the Scandinavian embedding benchmark MTEB(Scandinavian) (Enevoldsen et al., 2024), the French benchmark MTEB(fra) (Ciancone
Chunk 110 ¡ 1,999 chars
ntroduced within the main article. MTEB additionally includes a variety of benchmark including the language-specific, notably the original English MTEB, MTEB(eng, v2) (Muennighoff et al., 2023b), the Scandinavian embedding benchmark MTEB(Scandinavian) (Enevoldsen et al., 2024), the French benchmark MTEB(fra) (Ciancone et al., 2024), the German benchmark MTEB(deu) (Wehrli et al., 2024), the Korean benchmark MTEB(kor), the Chinese benchmark (Xiao et al., 2024b), the Polish benchmark MTEB(pol) (Po´swiata et al., 2024). Along with these MTEB also include an instruction based retrieval based benchmark MTEB(FollowIR) (Weller et al., 2024), a benchmark for law MTEB(Law), the bitext section of the MINER benchmark MINERSBitextMining target at low resource languages (Winata et al., 2024b), and the CoIR benchmark for code retrieval CoIR (Li et al., 2024). For this benchmark, we refer to their associated paper and pull requests. For an up-to-date overview of maintained benchmarks please see the benchmark registry.13 MTEB(rus) (Snegirev et al., 2024): Although Russian has approximately 258 million speakers world-wide, it was almost completely absent from the original benchmark and represented only in 13https://github.com/embeddings-benchmark/mteb/blob/main/mteb/benchmarks.py 51 -- 51 of 57 -- Published as a conference paper at ICLR 2025 few multilingual datasets (e.g., MassiveIntentClassification). To address this problem, we included a number of Russian datasets in the new multilingual benchmark. For this, we selected popular Russian time-tested and community-tested datasets representing the main MMTEB tasks. Additionally, we performed data cleaning and automatic filtering, where necessary, and formatted datasets in the MMTEB format. The final Russian part includes 18 datasets covering 7 main tasks: Classification (7 datasets), Clustering (3 datasets), MultiLabelClassification (2 tasks), PairClassification (1 task), Reranking (1 task), Retrieval (2 tasks), and STS (2 tasks).
Chunk 111 ¡ 1,993 chars
c filtering, where necessary, and formatted datasets in the MMTEB format. The final Russian part includes 18 datasets covering 7 main tasks: Classification (7 datasets), Clustering (3 datasets), MultiLabelClassification (2 tasks), PairClassification (1 task), Reranking (1 task), Retrieval (2 tasks), and STS (2 tasks). This dataset was manually constructed. RAR-b: The Reasoning as Retrieval Benchmark (RAR-b) (Xiao et al., 2024a) evaluates reasoning- level understanding abilities stored in embedding models, and assesses whether correct answers to reasoning questions can be retrieved as top similar to queries, under w/ and w/o instruction settings. The benchmark provides insights into whether representations of nuanced expressions are aligned and well-encoded by current embedding models, going beyond the established reliance on evaluating with STS or traditional topical-level IR tasks. The benchmark puts together 17 tasks made from 15 datasets (with reasoning questions from 12 datasets and 3 extra datasets to enlarge the corpus), covering 1) commonsense reasoning: WinoGrande, PIQA, SIQA, ÎąNLI, HellaSwag, ARC-Challenge, Quail, CSTS (Sakaguchi et al., 2021; Bisk et al., 2020; Sap et al., 2019; Bhagavatula et al., 2020; Zellers et al., 2019; Clark et al., 2018; Rogers et al., 2020; Deshpande et al., 2023), 2) temporal reasoning (Tan et al., 2023), 3) spatial reasoning: SpartQA (Mirzaee et al., 2021), 4) numerical reasoning: GSM8K, MATH (Hendrycks et al., 2021b; Cobbe et al., 2021; Yu et al., 2023), and 5) symbolic reasoning: HumanEvalPack and MBPP (Husain et al., 2019; Austin et al., 2021; Chen et al., 2021; Muennighoff et al., 2023a). The comprehensive assessment provides an early checkpoint for abilities envisioned to be necessary for next-generation embedding models (Xiao et al., 2024a). MTEB(Europe): We begin by selecting 56 official languages of the European Union, along with languages recognized by Schengen-area countries, such as Norwegian BokmĂĽl, Icelandic,
Chunk 112 ¡ 1,989 chars
e assessment provides an early checkpoint for abilities envisioned to be necessary for next-generation embedding models (Xiao et al., 2024a). MTEB(Europe): We begin by selecting 56 official languages of the European Union, along with languages recognized by Schengen-area countries, such as Norwegian BokmĂĽl, Icelandic, Romani, and Basque. This initial selection results in 420 tasks. We then reduce this selection by filtering out machine-translated datasets, datasets with unclear licenses, and highly specialized datasets (e.g., code retrieval datasets). Additionally, we remove tasks such as AfriSentiClassification, which, while containing European languages, primarily target African or Indic languages. After these exclusions, 228 tasks remain. Next, we run a representative selection of models (see Section [3.1]) and iteratively filter out the most predictable tasks (see Section [2.3.3]). To preserve language diversity and ensure fair representation across task categories, we avoid removing any task if it would eliminate a language from a particular task category. Furthermore, we retain tasks where the mean squared error between predicted and observed performance exceeds 0.5 standard deviations. This process continues until the most predictable tasks yield a Spearman correlation of less than 0.8 between predicted and observed scores, or until no further tasks can be removed. Ultimately, this results in a final selection of 96 tasks. Finally, contributors proficient in the target languages review the selected tasks, replacing some manually with higher-quality alternatives if necessary. MTEB(Indic): This benchmark is constructed similarly to the previous European benchmark but focuses on a set of Indic languages.14 Initially, we selected 55 tasks. After manual filtering, 44 tasks remain, and following task selection and review, the final benchmark contains 23 tasks. H.2 BENCHMARK TASK OVERVIEW The following tables give an overview of the tasks available within
Chunk 113 ¡ 1,999 chars
revious European benchmark but focuses on a set of Indic languages.14 Initially, we selected 55 tasks. After manual filtering, 44 tasks remain, and following task selection and review, the final benchmark contains 23 tasks. H.2 BENCHMARK TASK OVERVIEW The following tables give an overview of the tasks available within constructed benchmarks. For more information about the specific tasks, we refer to the task metadata available through the mteb package. 15 ⢠Table 12 and Table 13: Gives an overview of the âMTEB(Multilingual)â benchmark ⢠Table 14: Gives an overview of the âMTEB(Europe)â benchmark ⢠Table 15: Gives an overview of the âMTEB(Indic)â benchmark ⢠Table 16: Gives an overview of the âMTEB(eng, v2)â benchmark 14The following iso639-3 codes: asm, awa, ben, bgc, bho, doi, gbm, gom, guj, hin, hne, kan, kas, mai, mal, mar, mni, mup, mwr, nep, npi, ori, ory, pan, raj, san, snd, tam, tel, urd 15https://github.com/embeddings-benchmark/mteb 52 -- 52 of 57 -- Published as a conference paper at ICLR 2025 ⢠Table 17: Gives an overview of the âMTEB(Code)â benchmark Type Name Languages Domains Sample creation Annotations creators Nb samples BitextMining BUCC.v2 Zweigenbaum et al. (2017) [âcmnâ, âdeuâ, âengâ, ...] [âWrittenâ] human-translated human-annotated 35000 BibleNLPBitextMining Akerman et al. (2023) [âaaiâ, âaakâ, âaauâ, ...] [âReligiousâ, âWrittenâ] created expert-annotated 417452 BornholmBitextMining Derczynski & Kjeldsen [âdanâ] [âWebâ, âSocialâ, âFictionâ, ...] created expert-annotated 500 DiaBlaBitextMining GonzĂĄlez et al. (2019) [âengâ, âfraâ] [âSocialâ, âWrittenâ] created human-annotated 11496 FloresBitextMining Goyal et al. (2022) [âaceâ, âacmâ, âacqâ, ...] [âNon-fictionâ, âEncyclopaedicâ, âWrittenâ] created human-annotated 41908944 IN22GenBitextMining Gala et al. (2023) [âasmâ, âbenâ, âbrxâ, ...] [âWebâ, âLegalâ, âGovernmentâ, ...] created expert-annotated 518144 IndicGenBenchFloresBitextMining Singh et al. (2024a) [âasmâ,
Chunk 114 ¡ 1,967 chars
t al. (2022) [âaceâ, âacmâ, âacqâ, ...] [âNon-fictionâ, âEncyclopaedicâ, âWrittenâ] created human-annotated 41908944 IN22GenBitextMining Gala et al. (2023) [âasmâ, âbenâ, âbrxâ, ...] [âWebâ, âLegalâ, âGovernmentâ, ...] created expert-annotated 518144 IndicGenBenchFloresBitextMining Singh et al. (2024a) [âasmâ, âawaâ, âbenâ, ...] [âWebâ, âNewsâ, âWrittenâ] human-translated and localized expert-annotated 116522 NTREXBitextMining Federmann et al. (2022) [âafrâ, âamhâ, âarbâ, ...] [âNewsâ, âWrittenâ] human-translated and localized expert-annotated 3826252 NollySentiBitextMining Shode et al. (2023) [âengâ, âhauâ, âiboâ, ...] [âSocialâ, âReviewsâ, âWrittenâ] found human-annotated 1640 NorwegianCourtsBitextMining Tiedemann & Thottingal (2020) [ânnoâ, ânobâ] [âLegalâ, âWrittenâ] found human-annotated 228 NusaTranslationBitextMining Cahyawijaya et al. (2023c) [âabsâ, âbbcâ, âbewâ, ...] [âSocialâ, âWrittenâ] created human-annotated 50200 NusaXBitextMining Winata et al. (2023b) [âaceâ, âbanâ, âbbcâ, ...] [âReviewsâ, âWrittenâ] created human-annotated 5500 Tatoeba community (2021) [âafrâ, âamhâ, âangâ, ...] [âWrittenâ] found human-annotated 88877 Classification AfriSentiClassification Muhammad et al. (2023) [âamhâ, âarqâ, âaryâ, ...] [âSocialâ, âWrittenâ] found derived 18222 AmazonCounterfactualClassification OâNeill et al. (2021) [âdeuâ, âengâ, âjpnâ] [âReviewsâ, âWrittenâ] found human-annotated 5805 BulgarianStoreReviewSentimentClassfication Georgieva-Trifonova et al. (2018) [âbulâ] [âReviewsâ, âWrittenâ] found human-annotated 182 CSFDSKMovieReviewSentimentClassification Ĺ tefĂĄnik et al. (2023) [âslkâ] [âReviewsâ, âWrittenâ] found derived 2048 CataloniaTweetClassification Zotova et al. (2020) [âcatâ, âspaâ] [âSocialâ, âGovernmentâ, âWrittenâ] created expert-annotated 8051 CyrillicTurkicLangClassification Goldhahn et al. (2012) [âbakâ, âchvâ, âkazâ, ...] [âWebâ, âWrittenâ] found derived
Chunk 115 ¡ 1,994 chars
l. (2023) [âslkâ] [âReviewsâ, âWrittenâ] found derived 2048 CataloniaTweetClassification Zotova et al. (2020) [âcatâ, âspaâ] [âSocialâ, âGovernmentâ, âWrittenâ] created expert-annotated 8051 CyrillicTurkicLangClassification Goldhahn et al. (2012) [âbakâ, âchvâ, âkazâ, ...] [âWebâ, âWrittenâ] found derived 2048 CzechProductReviewSentimentClassification Habernal et al. (2013) [âcesâ] [âReviewsâ, âWrittenâ] found derived 2048 DBpediaClassification Zhang et al. (2015) [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 2048 DalajClassification Volodina et al. (2021) [âsweâ] [âNon-fictionâ, âWrittenâ] created expert-annotated 888 EstonianValenceClassification Pajupuu et al. (2023) [âestâ] [âNewsâ, âWrittenâ] found human-annotated 818 FilipinoShopeeReviewsClassification Riego et al. (2023) [âfilâ] [âSocialâ, âWrittenâ] found human-annotated 4096 FinancialPhrasebankClassification Malo et al. (2014) [âengâ] [âNewsâ, âWrittenâ, âFinancialâ] found expert-annotated 2264 GreekLegalCodeClassification Papaloukas et al. (2021) [âellâ] [âLegalâ, âWrittenâ] found human-annotated 4096 GujaratiNewsClassification [âgujâ] [âNewsâ, âWrittenâ] found derived 1318 IndicLangClassification Madhani et al. (2023) [âasmâ, âbenâ, âbrxâ, ...] [âWebâ, âNon-fictionâ, âWrittenâ] created expert-annotated 30418 IndonesianIdClickbaitClassification William & Sari (2020) [âindâ] [âNewsâ, âWrittenâ] found expert-annotated 2048 IsiZuluNewsClassification Madodonga et al. (2023) [âzulâ] [âNewsâ, âWrittenâ] found human-annotated 752 ItaCaseholdClassification Licari et al. (2023) [âitaâ] [âLegalâ, âGovernmentâ, âWrittenâ] found expert-annotated 221 KorSarcasmClassification Kim & Cho (2019) [âkorâ] [âSocialâ, âWrittenâ] found expert-annotated 2048 KurdishSentimentClassification Badawi et al. (2024) [âkurâ] [âWebâ, âWrittenâ] found derived 1987 MacedonianTweetSentimentClassification Jovanoski et al. (2015) [âmkdâ] [âSocialâ,
Chunk 116 ¡ 1,992 chars
ound expert-annotated 221 KorSarcasmClassification Kim & Cho (2019) [âkorâ] [âSocialâ, âWrittenâ] found expert-annotated 2048 KurdishSentimentClassification Badawi et al. (2024) [âkurâ] [âWebâ, âWrittenâ] found derived 1987 MacedonianTweetSentimentClassification Jovanoski et al. (2015) [âmkdâ] [âSocialâ, âWrittenâ] found human-annotated 1139 MasakhaNEWSClassification Adelani et al. (2023b) [âamhâ, âengâ, âfraâ, ...] [âNewsâ, âWrittenâ] found expert-annotated 6242 MassiveIntentClassification FitzGerald et al. (2022) [âafrâ, âamhâ, âaraâ, ...] [âSpokenâ] human-translated and localized human-annotated 255357 MultiHateClassification R"ottger et al. (2021) [âaraâ, âcmnâ, âdeuâ, ...] [âConstructedâ, âWrittenâ] created expert-annotated 11000 NepaliNewsClassification Arora (2020) [ânepâ] [âNewsâ, âWrittenâ] found derived 2048 NordicLangClassification Haas & Derczynski (2021) [âdanâ, âfaoâ, âislâ, ...] [âEncyclopaedicâ] found derived 3000 NusaParagraphEmotionClassification Cahyawijaya et al. (2023b) [âbbcâ, âbewâ, âbugâ, ...] [âNon-fictionâ, âFictionâ, âWrittenâ] found human-annotated 5700 NusaX-senti Winata et al. (2022) [âaceâ, âbanâ, âbbcâ, ...] [âReviewsâ, âWebâ, âSocialâ, ...] found expert-annotated 4800 OdiaNewsClassification Kunchukuttan et al. (2020) [âoryâ] [âNewsâ, âWrittenâ] found derived 2048 PAC Ĺukasz Augustyniak et al. (2022) [âpolâ] [âLegalâ, âWrittenâ] 3453 PoemSentimentClassification Sheng & Uthus (2020) [âengâ] [âReviewsâ, âWrittenâ] found human-annotated 209 PolEmo2.0-OUT [âpolâ] [âWrittenâ, âSocialâ] 494 PunjabiNewsClassification Kunchukuttan et al. (2020) [âpanâ] [âNewsâ, âWrittenâ] found derived 157 ScalaClassification Nielsen (2023) [âdanâ, ânnoâ, ânobâ, ...] [âFictionâ, âNewsâ, âNon-fictionâ, ...] created human-annotated 8192 SentimentAnalysisHindi Parida et al. (2023) [âhinâ] [âReviewsâ, âWrittenâ] found derived 2048 SinhalaNewsClassification de Silva (2015) [âsinâ]
Chunk 117 ¡ 1,994 chars
â, âWrittenâ] found derived 157 ScalaClassification Nielsen (2023) [âdanâ, ânnoâ, ânobâ, ...] [âFictionâ, âNewsâ, âNon-fictionâ, ...] created human-annotated 8192 SentimentAnalysisHindi Parida et al. (2023) [âhinâ] [âReviewsâ, âWrittenâ] found derived 2048 SinhalaNewsClassification de Silva (2015) [âsinâ] [âNewsâ, âWrittenâ] found derived 2048 SiswatiNewsClassification Madodonga et al. (2023) [âsswâ] [âNewsâ, âWrittenâ] found human-annotated 80 SlovakMovieReviewSentimentClassification Stefâanik et al. (2023) [âsvkâ] [âReviewsâ, âWrittenâ] found derived 2048 SwahiliNewsClassification Davis (2020) [âswaâ] [âNewsâ, âWrittenâ] found derived 2048 SwissJudgementClassification Niklaus et al. (2022) [âdeuâ, âfraâ, âitaâ] [âLegalâ, âWrittenâ] found expert-annotated 4908 ToxicConversationsClassification cjadams et al. (2019) [âengâ] [âSocialâ, âWrittenâ] found human-annotated 2048 TswanaNewsClassification Marivate et al. (2023) [âtsnâ] [âNewsâ, âWrittenâ] found derived 487 TweetTopicSingleClassification Antypas et al. (2022) [âengâ] [âSocialâ, âNewsâ, âWrittenâ] found expert-annotated 1693 Clustering AlloProfClusteringS2S.v2 Lefebvre-Brossard et al. (2023) [âfraâ] [âEncyclopaedicâ, âWrittenâ] found human-annotated 2556 ArXivHierarchicalClusteringP2P [âengâ] [âAcademicâ, âWrittenâ] found derived 2048 ArXivHierarchicalClusteringS2S [âengâ] [âAcademicâ, âWrittenâ] found derived 2048 BigPatentClustering.v2 Sharma et al. (2019) [âengâ] [âLegalâ, âWrittenâ] found derived 2048 BiorxivClusteringP2P.v2 [âengâ] [âAcademicâ, âWrittenâ] created derived 53787 CLSClusteringP2P.v2 Li et al. (2022) [âcmnâ] [âAcademicâ, âWrittenâ] found derived 2048 HALClusteringS2S.v2 Ciancone et al. (2024) [âfraâ] [âAcademicâ, âWrittenâ] found human-annotated 2048 MasakhaNEWSClusteringS2S Adelani et al. (2023b) [âamhâ, âengâ, âfraâ, ...] None 80 MedrxivClusteringP2P.v2 [âengâ] [âAcademicâ, âMedicalâ, âWrittenâ]
Chunk 118 ¡ 1,986 chars
2) [âcmnâ] [âAcademicâ, âWrittenâ] found derived 2048 HALClusteringS2S.v2 Ciancone et al. (2024) [âfraâ] [âAcademicâ, âWrittenâ] found human-annotated 2048 MasakhaNEWSClusteringS2S Adelani et al. (2023b) [âamhâ, âengâ, âfraâ, ...] None 80 MedrxivClusteringP2P.v2 [âengâ] [âAcademicâ, âMedicalâ, âWrittenâ] created derived 37500 PlscClusteringP2P.v2 [âpolâ] [âAcademicâ, âWrittenâ] found derived 2048 RomaniBibleClustering [âromâ] [âReligiousâ, âWrittenâ] human-translated and localized derived SIB200ClusteringS2S Adelani et al. (2023a) [âaceâ, âacmâ, âacqâ, ...] [âNewsâ, âWrittenâ] human-translated and localized expert-annotated 197788 SNLHierarchicalClusteringP2P Navjord & Korsvik (2023) [ânobâ] [âEncyclopaedicâ, âNon-fictionâ, âWrittenâ] found derived 1300 StackExchangeClustering.v2 Geigle et al. (2021) [âengâ] [âWebâ, âWrittenâ] found derived 2048 SwednClusteringP2P Monsen & J"onsson (2021) [âsweâ] [âNewsâ, âNon-fictionâ, âWrittenâ] found derived 68752 WikiCitiesClustering Foundation [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived WikiClusteringP2P.v2 [âbosâ, âcatâ, âcesâ, ...] [âEncyclopaedicâ, âWrittenâ] created derived 28672 Table 12: The tasks included in MTEB(Multilingual) (part 1). H.3 PERFORMANCE ON MTEB(eng, v2) Table 18 show the performance of our representative set of model on MTEB(eng, v2). H.4 PERFORMANCE ON MTEB(Code) Table 19 show the performance of our representative set of model on MTEB(Code). 53 -- 53 of 57 -- Published as a conference paper at ICLR 2025 Type Name Languages Domains Sample creators Annotations creators Nb samples* InstructionReranking Core17InstructionRetrieval Weller et al. (2024) [âengâ] [âNewsâ, âWrittenâ] found derived 19939 News21InstructionRetrieval Weller et al. (2024) [âengâ] [âNewsâ, âWrittenâ] found derived 30985 Robust04InstructionRetrieval Weller et al. (2024) [âengâ] [âNewsâ, âWrittenâ] found derived 47596 MultilabelClassification
Chunk 119 ¡ 1,996 chars
tionRetrieval Weller et al. (2024) [âengâ] [âNewsâ, âWrittenâ] found derived 19939 News21InstructionRetrieval Weller et al. (2024) [âengâ] [âNewsâ, âWrittenâ] found derived 30985 Robust04InstructionRetrieval Weller et al. (2024) [âengâ] [âNewsâ, âWrittenâ] found derived 47596 MultilabelClassification BrazilianToxicTweetsClassification Leite et al. (2020) [âporâ] [âConstructedâ, âWrittenâ] found expert-annotated 2048 CEDRClassification Sboev et al. (2021) [ârusâ] [âWebâ, âSocialâ, âBlogâ, ...] found human-annotated 1882 KorHateSpeechMLClassification Lee et al. (2022) [âkorâ] [âSocialâ, âWrittenâ] found expert-annotated 2037 MalteseNewsClassification Chaudhary et al. (2024) [âmltâ] [âConstructedâ, âWrittenâ] found expert-annotated 2297 MultiEURLEXMultilabelClassification Chalkidis et al. (2021) [âbulâ, âcesâ, âdanâ, ...] [âLegalâ, âGovernmentâ, âWrittenâ] found expert-annotated 115000 PairClassification ArmenianParaphrasePC Malajyan et al. (2020) [âhyeâ] [âNewsâ, âWrittenâ] found derived 1470 CTKFactsNLI Ullrich et al. (2023) [âcesâ] [âNewsâ, âWrittenâ] found human-annotated 680 OpusparcusPC Creutz (2018) [âdeuâ, âengâ, âfinâ, ...] [âSpokenâ, âSpokenâ] created human-annotated 18207 PawsXPairClassification Yang et al. (2019) [âcmnâ, âdeuâ, âengâ, ...] [âWebâ, âEncyclopaedicâ, âWrittenâ] human-translated human-annotated 28000 PpcPC Dadas (2022) [âpolâ] [âFictionâ, âNon-fictionâ, âWebâ, ...] found derived 1000 RTE3 Giampiccolo et al. (2007) [âdeuâ, âengâ, âfraâ, ...] [âNewsâ, âWebâ, âEncyclopaedicâ, ...] found expert-annotated 1923 SprintDuplicateQuestions Shah et al. (2018) [âengâ] [âProgrammingâ, âWrittenâ] found derived 101000 TERRa Shavrina et al. (2020) [ârusâ] [âNewsâ, âWebâ, âWrittenâ] found human-annotated 307 TwitterURLCorpus Lan et al. (2017) [âengâ] [âSocialâ, âWrittenâ] found derived 51534 XNLI Conneau et al. (2018) [âaraâ, âbulâ, âdeuâ, ...] [âNon-fictionâ,
Chunk 120 ¡ 1,979 chars
[âengâ] [âProgrammingâ, âWrittenâ] found derived 101000 TERRa Shavrina et al. (2020) [ârusâ] [âNewsâ, âWebâ, âWrittenâ] found human-annotated 307 TwitterURLCorpus Lan et al. (2017) [âengâ] [âSocialâ, âWrittenâ] found derived 51534 XNLI Conneau et al. (2018) [âaraâ, âbulâ, âdeuâ, ...] [âNon-fictionâ, âFictionâ, âGovernmentâ, ...] created expert-annotated 38220 indonli Mahendra et al. (2021) [âindâ] [âEncyclopaedicâ, âWebâ, âNewsâ, ...] found expert-annotated 2040 Reranking AlloprofReranking Lefebvre-Brossard et al. (2023) [âfraâ] [âWebâ, âAcademicâ, âWrittenâ] found expert-annotated 27355 RuBQReranking Rybin et al. (2021) [ârusâ] [âEncyclopaedicâ, âWrittenâ] created human-annotated 38998 T2Reranking Xie et al. (2023) [âcmnâ] None 103330 VoyageMMarcoReranking ClaviĂŠ (2023) [âjpnâ] [âAcademicâ, âNon-fictionâ, âWrittenâ] found derived 55423 WebLINXCandidatesReranking LĂš et al. (2024) [âengâ] [âAcademicâ, âWebâ, âWrittenâ] created expert-annotated 5592142 WikipediaRerankingMultilingual Foundation [âbenâ, âbulâ, âcesâ, ...] [âEncyclopaedicâ, âWrittenâ] LM-generated and verified LM-generated and reviewed 240000 Retrieval AILAStatutes Bhattacharya et al. (2020) [âengâ] [âLegalâ, âWrittenâ] found derived 82 - 50 ArguAna Boteva et al. (2016) [âengâ] [âMedicalâ, âWrittenâ] 8674 - 1406 BelebeleRetrieval Bandarkar et al. (2023) [âacmâ, âafrâ, âalsâ, ...] [âWebâ, âNewsâ, âWrittenâ] created expert-annotated 183488 - 338378 CUREv1 [âengâ, âfraâ, âspaâ] [âMedicalâ, âAcademicâ, âWrittenâ] created expert-annotated 1541613 - 12000 CovidRetrieval [âcmnâ] None 100001 - 949 HagridRetrieval Kamalloo et al. (2023) [âengâ] [âEncyclopaedicâ, âWrittenâ] found expert-annotated 496 - 496 LEMBPasskeyRetrieval Zhu et al. (2024) [âengâ] [âFictionâ, âWrittenâ] found derived 800 - 400 LegalBenchCorporateLobbying Guha et al. (2023) [âengâ] [âLegalâ, âWrittenâ] found derived 319 -
Chunk 121 ¡ 1,999 chars
49 HagridRetrieval Kamalloo et al. (2023) [âengâ] [âEncyclopaedicâ, âWrittenâ] found expert-annotated 496 - 496 LEMBPasskeyRetrieval Zhu et al. (2024) [âengâ] [âFictionâ, âWrittenâ] found derived 800 - 400 LegalBenchCorporateLobbying Guha et al. (2023) [âengâ] [âLegalâ, âWrittenâ] found derived 319 - 340 MIRACLRetrievalHardNegatives Zhang et al. (2023) [âaraâ, âbenâ, âdeuâ, ...] [âEncyclopaedicâ, âWrittenâ] created expert-annotated 2449382 - 11076 MLQARetrieval Lewis et al. (2019) [âaraâ, âdeuâ, âengâ, ...] [âEncyclopaedicâ, âWrittenâ] found human-annotated 152379 - 173776 SCIDOCS Cohan et al. (2020b) [âengâ] [âAcademicâ, âWrittenâ, âNon-fictionâ] found 25657 - 1000 SpartQA Xiao et al. (2024a) [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 1592 - 3594 StackOverflowQA Li et al. (2024) [âengâ] [âProgrammingâ, âWrittenâ] found derived 19931 - 1994 StatcanDialogueDatasetRetrieval Lu et al. (2023) [âengâ, âfraâ] [âGovernmentâ, âWebâ, âWrittenâ] found derived 23628 - 9436 TRECCOVID Roberts et al. (2021) [âengâ] [âMedicalâ, âAcademicâ, âWrittenâ] 171332 - 50 TempReasonL1 Xiao et al. (2024a) [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 12504 - 4000 TwitterHjerneRetrieval Holm (2024) [âdanâ] [âSocialâ, âWrittenâ] found derived 262 - 78 WikipediaRetrievalMultilingual [âbenâ, âbulâ, âcesâ, ...] [âEncyclopaedicâ, âWrittenâ] LM-generated and verified LM-generated and reviewed 216000 - 24000 WinoGrande Xiao et al. (2024a) [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 5095 - 1267 STS FaroeseSTS SnĂŚbjarnarson et al. (2023) [âfaoâ] [âNewsâ, âWebâ, âWrittenâ] found human-annotated 729 FinParaSTS Kanerva et al. (2021) [âfinâ] [âNewsâ, âSubtitlesâ, âWrittenâ] found expert-annotated 2000 GermanSTSBenchmark May (2021) [âdeuâ] None 2879 IndicCrosslingualSTS Ramesh et al. (2022) [âasmâ, âbenâ, âengâ, ...] [âNewsâ, âNon-fictionâ, âWebâ, ...] created expert-annotated 3072 JSICK
Chunk 122 ¡ 1,993 chars
notated 729 FinParaSTS Kanerva et al. (2021) [âfinâ] [âNewsâ, âSubtitlesâ, âWrittenâ] found expert-annotated 2000 GermanSTSBenchmark May (2021) [âdeuâ] None 2879 IndicCrosslingualSTS Ramesh et al. (2022) [âasmâ, âbenâ, âengâ, ...] [âNewsâ, âNon-fictionâ, âWebâ, ...] created expert-annotated 3072 JSICK Yanaka & Mineshima (2022) [âjpnâ] [âWebâ, âWrittenâ] found human-annotated 1986 SICK-R Marelli et al. (2014) [âengâ] [âWebâ, âWrittenâ] human-annotated 9927 STS12 Agirre et al. (2012) [âengâ] [âEncyclopaedicâ, âNewsâ, âWrittenâ] created human-annotated 3108 STS13 Agirre et al. (2013) [âengâ] [âWebâ, âNewsâ, âNon-fictionâ, ...] created human-annotated 1500 STS14 Bandhakavi et al. (2014) [âengâ] [âBlogâ, âWebâ, âSpokenâ] created derived 3750 STS15 Biçici (2015) [âengâ] [âBlogâ, âNewsâ, âWebâ, ...] created human-annotated 3000 STS17 Cer et al. (2017) [âaraâ, âdeuâ, âengâ, ...] [âNewsâ, âWebâ, âWrittenâ] created human-annotated 5346 STS22.v2 Chen et al. (2022) [âaraâ, âcmnâ, âdeuâ, ...] [âNewsâ, âWrittenâ] found human-annotated 3958 STSB Xiao et al. (2024b) [âcmnâ] None 2819 STSBenchmark May (2021) [âengâ] [âBlogâ, âNewsâ, âWrittenâ] machine-translated and verified human-annotated 1379 STSES Agirre et al. (2015) [âspaâ] [âWrittenâ] 155 SemRel24STS Ousidhoum et al. (2024) [âafrâ, âamhâ, âarbâ, ...] [âSpokenâ, âWrittenâ] created human-annotated 7498 Table 13: The tasks included in MTEB(Multilingual) (part 2). *For the number of samples, are given the total number of samples all languages included, for Retrieval tasks are given the (number of queries - number of documents). 54 -- 54 of 57 -- Published as a conference paper at ICLR 2025 Type Name Languages Domains Sample creation Annotation creators Nb Samples* BitextMining BornholmBitextMining Derczynski & Kjeldsen [âdanâ] [âWebâ, âSocialâ, âFictionâ, ...] created expert-annotated 500 BibleNLPBitextMining Akerman et al. (2023)
Chunk 123 ¡ 1,996 chars
-- 54 of 57 -- Published as a conference paper at ICLR 2025 Type Name Languages Domains Sample creation Annotation creators Nb Samples* BitextMining BornholmBitextMining Derczynski & Kjeldsen [âdanâ] [âWebâ, âSocialâ, âFictionâ, ...] created expert-annotated 500 BibleNLPBitextMining Akerman et al. (2023) [âaaiâ, âaakâ, âaauâ, ...] [âReligiousâ, âWrittenâ] created expert-annotated 417452 BUCC.v2 Zweigenbaum et al. (2017) [âcmnâ, âdeuâ, âengâ, ...] [âWrittenâ] human-translated human-annotated 35000 DiaBlaBitextMining GonzĂĄlez et al. (2019) [âengâ, âfraâ] [âSocialâ, âWrittenâ] created human-annotated 11496 FloresBitextMining Goyal et al. (2022) [âaceâ, âacmâ, âacqâ, ...] [âNon-fictionâ, âEncyclopaedicâ, âWrittenâ] created human-annotated 41908944 NorwegianCourtsBitextMining Tiedemann & Thottingal (2020) [ânnoâ, ânobâ] [âLegalâ, âWrittenâ] found human-annotated 228 NTREXBitextMining Federmann et al. (2022) [âafrâ, âamhâ, âarbâ, ...] [âNewsâ, âWrittenâ] human-translated and localized expert-annotated 3826252 Classification BulgarianStoreReviewSentimentClassfication Georgieva-Trifonova et al. (2018) [âbulâ] [âReviewsâ, âWrittenâ] found human-annotated 182 CzechProductReviewSentimentClassification Habernal et al. (2013) [âcesâ] [âReviewsâ, âWrittenâ] found derived 2048 GreekLegalCodeClassification Papaloukas et al. (2021) [âellâ] [âLegalâ, âWrittenâ] found human-annotated 4096 DBpediaClassification Zhang et al. (2015) [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 2048 FinancialPhrasebankClassification Malo et al. (2014) [âengâ] [âNewsâ, âWrittenâ, âFinancialâ] found expert-annotated 2264 PoemSentimentClassification Sheng & Uthus (2020) [âengâ] [âReviewsâ, âWrittenâ] found human-annotated 209 ToxicChatClassification Lin et al. (2023) [âengâ] [âConstructedâ, âWrittenâ] found expert-annotated 1164 ToxicConversationsClassification cjadams et al. (2019) [âengâ] [âSocialâ, âWrittenâ]
Chunk 124 ¡ 1,999 chars
ed 2264 PoemSentimentClassification Sheng & Uthus (2020) [âengâ] [âReviewsâ, âWrittenâ] found human-annotated 209 ToxicChatClassification Lin et al. (2023) [âengâ] [âConstructedâ, âWrittenâ] found expert-annotated 1164 ToxicConversationsClassification cjadams et al. (2019) [âengâ] [âSocialâ, âWrittenâ] found human-annotated 2048 EstonianValenceClassification Pajupuu et al. (2023) [âestâ] [âNewsâ, âWrittenâ] found human-annotated 818 ItaCaseholdClassification Licari et al. (2023) [âitaâ] [âLegalâ, âGovernmentâ, âWrittenâ] found expert-annotated 221 AmazonCounterfactualClassification OâNeill et al. (2021) [âdeuâ, âengâ, âjpnâ] [âReviewsâ, âWrittenâ] found human-annotated 5805 MassiveScenarioClassification FitzGerald et al. (2022) [âafrâ, âamhâ, âaraâ, ...] [âSpokenâ] human-translated and localized human-annotated 255357 MultiHateClassification R"ottger et al. (2021) [âaraâ, âcmnâ, âdeuâ, ...] [âConstructedâ, âWrittenâ] created expert-annotated 11000 NordicLangClassification Haas & Derczynski (2021) [âdanâ, âfaoâ, âislâ, ...] [âEncyclopaedicâ] found derived 3000 ScalaClassification Nielsen (2023) [âdanâ, ânnoâ, ânobâ, ...] [âFictionâ, âNewsâ, âNon-fictionâ, ...] created human-annotated 8192 SwissJudgementClassification Niklaus et al. (2022) [âdeuâ, âfraâ, âitaâ] [âLegalâ, âWrittenâ] found expert-annotated 4908 TweetSentimentClassification Barbieri et al. (2022) [âaraâ, âdeuâ, âengâ, ...] [âSocialâ, âWrittenâ] found human-annotated 2048 CBD Ogrodniczuk & Ĺukasz Kobyli´nski (2019) [âpolâ] [âWrittenâ, âSocialâ] found human-annotated 1000 PolEmo2.0-OUT [âpolâ] [âWrittenâ, âSocialâ] 494 CSFDSKMovieReviewSentimentClassification Ĺ tefĂĄnik et al. (2023) [âslkâ] [âReviewsâ, âWrittenâ] found derived 2048 DalajClassification Volodina et al. (2021) [âsweâ] [âNon-fictionâ, âWrittenâ] created expert-annotated 888 Clustering WikiCitiesClustering Foundation [âengâ] [âEncyclopaedicâ, âWrittenâ]
Chunk 125 ¡ 1,993 chars
494 CSFDSKMovieReviewSentimentClassification Ĺ tefĂĄnik et al. (2023) [âslkâ] [âReviewsâ, âWrittenâ] found derived 2048 DalajClassification Volodina et al. (2021) [âsweâ] [âNon-fictionâ, âWrittenâ] created expert-annotated 888 Clustering WikiCitiesClustering Foundation [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 1 RomaniBibleClustering [âromâ] [âReligiousâ, âWrittenâ] human-translated and localized derived 4 BigPatentClustering.v2 Sharma et al. (2019) [âengâ] [âLegalâ, âWrittenâ] found derived 2048 BiorxivClusteringP2P.v2 [âengâ] [âAcademicâ, âWrittenâ] created derived 53787 AlloProfClusteringS2S.v2 Lefebvre-Brossard et al. (2023) [âfraâ] [âEncyclopaedicâ, âWrittenâ] found human-annotated 2556 HALClusteringS2S.v2 Ciancone et al. (2024) [âfraâ] [âAcademicâ, âWrittenâ] found human-annotated 2048 SIB200ClusteringS2S Adelani et al. (2023a) [âaceâ, âacmâ, âacqâ, ...] [âNewsâ, âWrittenâ] human-translated and localized expert-annotated 197788 WikiClusteringP2P.v2 [âbosâ, âcatâ, âcesâ, ...] [âEncyclopaedicâ, âWrittenâ] created derived 28672 Retrieval StackOverflowQA Li et al. (2024) [âengâ] [âProgrammingâ, âWrittenâ] found derived 19931 - 1994 TwitterHjerneRetrieval Holm (2024) [âdanâ] [âSocialâ, âWrittenâ] found derived 262 - 78 LegalQuAD Hoppe et al. (2021) [âdeuâ] [âLegalâ, âWrittenâ] found derived 200 - 200 ArguAna Boteva et al. (2016) [âengâ] [âMedicalâ, âWrittenâ] 8674 - 1406 HagridRetrieval Kamalloo et al. (2023) [âengâ] [âEncyclopaedicâ, âWrittenâ] found expert-annotated 496 - 496 LegalBenchCorporateLobbying Guha et al. (2023) [âengâ] [âLegalâ, âWrittenâ] found derived 319 - 340 LEMBPasskeyRetrieval Zhu et al. (2024) [âengâ] [âFictionâ, âWrittenâ] found derived 800 - 400 SCIDOCS Cohan et al. (2020b) [âengâ] [âAcademicâ, âWrittenâ, âNon-fictionâ] found 25657 - 1000 SpartQA Xiao et al. (2024a) [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 1592 -
Chunk 126 ¡ 1,989 chars
found derived 319 - 340 LEMBPasskeyRetrieval Zhu et al. (2024) [âengâ] [âFictionâ, âWrittenâ] found derived 800 - 400 SCIDOCS Cohan et al. (2020b) [âengâ] [âAcademicâ, âWrittenâ, âNon-fictionâ] found 25657 - 1000 SpartQA Xiao et al. (2024a) [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 1592 - 3594 TempReasonL1 Xiao et al. (2024a) [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 12504 - 4000 WinoGrande Xiao et al. (2024a) [âengâ] [âEncyclopaedicâ, âWrittenâ] found derived 5095 - 1267 AlloprofRetrieval Lefebvre-Brossard et al. (2023) [âfraâ] [âEncyclopaedicâ, âWrittenâ] found human-annotated 2556 - 2316 BelebeleRetrieval Bandarkar et al. (2023) [âacmâ, âafrâ, âalsâ, ...] [âWebâ, âNewsâ, âWrittenâ] created expert-annotated 183488 - 338378 StatcanDialogueDatasetRetrieval Lu et al. (2023) [âengâ, âfraâ] [âGovernmentâ, âWebâ, âWrittenâ] found derived 23628 - 9436 WikipediaRetrievalMultilingual [âbenâ, âbulâ, âcesâ, ...] [âEncyclopaedicâ, âWrittenâ] LM-generated and verified LM-generated and reviewed 216000 - 24000 InstructionReranking Core17InstructionRetrieval Weller et al. (2024) [âengâ] [âNewsâ, âWrittenâ] found derived 19939 News21InstructionRetrieval Weller et al. (2024) [âengâ] [âNewsâ, âWrittenâ] found derived 30985 Robust04InstructionRetrieval Weller et al. (2024) [âengâ] [âNewsâ, âWrittenâ] found derived 47596 MultilabelClassification MalteseNewsClassification Chaudhary et al. (2024) [âmltâ] [âConstructedâ, âWrittenâ] found expert-annotated 2297 MultiEURLEXMultilabelClassification Chalkidis et al. (2021) [âbulâ, âcesâ, âdanâ, ...] [âLegalâ, âGovernmentâ, âWrittenâ] found expert-annotated 115000 PairClassification CTKFactsNLI Ullrich et al. (2023) [âcesâ] [âNewsâ, âWrittenâ] found human-annotated 680 SprintDuplicateQuestions Shah et al. (2018) [âengâ] [âProgrammingâ, âWrittenâ] found derived 101000 OpusparcusPC Creutz (2018) [âdeuâ, âengâ, âfinâ, ...]
Chunk 127 ¡ 1,998 chars
âWrittenâ] found expert-annotated 115000 PairClassification CTKFactsNLI Ullrich et al. (2023) [âcesâ] [âNewsâ, âWrittenâ] found human-annotated 680 SprintDuplicateQuestions Shah et al. (2018) [âengâ] [âProgrammingâ, âWrittenâ] found derived 101000 OpusparcusPC Creutz (2018) [âdeuâ, âengâ, âfinâ, ...] [âSpokenâ, âSpokenâ] created human-annotated 18207 RTE3 Giampiccolo et al. (2007) [âdeuâ, âengâ, âfraâ, ...] [âNewsâ, âWebâ, âEncyclopaedicâ, ...] found expert-annotated 1923 XNLI Conneau et al. (2018) [âaraâ, âbulâ, âdeuâ, ...] [âNon-fictionâ, âFictionâ, âGovernmentâ, ...] created expert-annotated 38220 PSC Ogrodniczuk & Kopeâc (2014) [âpolâ] [âNewsâ, âWrittenâ] found derived 1078 Reranking WebLINXCandidatesReranking LĂš et al. (2024) [âengâ] [âAcademicâ, âWebâ, âWrittenâ] created expert-annotated 5592142 AlloprofReranking Lefebvre-Brossard et al. (2023) [âfraâ] [âWebâ, âAcademicâ, âWrittenâ] found expert-annotated 27355 WikipediaRerankingMultilingual Foundation [âbenâ, âbulâ, âcesâ, ...] [âEncyclopaedicâ, âWrittenâ] LM-generated and verified LM-generated and reviewed 240000 STS SICK-R Marelli et al. (2014) [âengâ] [âWebâ, âWrittenâ] human-annotated 9927 STS12 Agirre et al. (2012) [âengâ] [âEncyclopaedicâ, âNewsâ, âWrittenâ] created human-annotated 3108 STS14 Bandhakavi et al. (2014) [âengâ] [âBlogâ, âWebâ, âSpokenâ] created derived 3750 STS15 Biçici (2015) [âengâ] [âBlogâ, âNewsâ, âWebâ, ...] created human-annotated 3000 STSBenchmark May (2021) [âengâ] [âBlogâ, âNewsâ, âWrittenâ] machine-translated and verified human-annotated 1379 FinParaSTS Kanerva et al. (2021) [âfinâ] [âNewsâ, âSubtitlesâ, âWrittenâ] found expert-annotated 2000 STS17 Cer et al. (2017) [âaraâ, âdeuâ, âengâ, ...] [âNewsâ, âWebâ, âWrittenâ] created human-annotated 5346 SICK-R-PL Dadas et al. (2020) [âpolâ] [âWebâ, âWrittenâ] human-translated and localized human-annotated 4906 STSES Agirre et al. (2015)
Chunk 128 ¡ 1,996 chars
[âNewsâ, âSubtitlesâ, âWrittenâ] found expert-annotated 2000 STS17 Cer et al. (2017) [âaraâ, âdeuâ, âengâ, ...] [âNewsâ, âWebâ, âWrittenâ] created human-annotated 5346 SICK-R-PL Dadas et al. (2020) [âpolâ] [âWebâ, âWrittenâ] human-translated and localized human-annotated 4906 STSES Agirre et al. (2015) [âspaâ] [âWrittenâ] 155 Table 14: The tasks included in MTEB(Europe). The language column shows all the languages of the task. When running the tasks we limit it to the languages specified in the benchmark. * For the number of samples, are given the total number of samples all languages included, for Retrieval tasks are given the (number of queries - number of documents). Type Name Languages Domains Sample creation Annotation creators Nb samples* BitextMining IN22ConvBitextMining Gala et al. (2023) [âasmâ, âbenâ, âbrxâ, ...] [âSocialâ, âSpokenâ, âFictionâ, ...] created expert-annotated 760518 IN22GenBitextMining Gala et al. (2023) [âasmâ, âbenâ, âbrxâ, ...] [âWebâ, âLegalâ, âGovernmentâ, ...] created expert-annotated 518144 IndicGenBenchFloresBitextMining Singh et al. (2024a) [âasmâ, âawaâ, âbenâ, ...] [âWebâ, âNewsâ, âWrittenâ] human-translated and localized expert-annotated 116522 LinceMTBitextMining Aguilar et al. (2020) [âengâ, âhinâ] [âSocialâ, âWrittenâ] found human-annotated 8059 Classification BengaliSentimentAnalysis Sazzed (2020) [âbenâ] [âReviewsâ, âWrittenâ] found human-annotated 2048 GujaratiNewsClassification [âgujâ] [âNewsâ, âWrittenâ] found derived 1318 HindiDiscourseClassification Dhanwal et al. (2020) [âhinâ] [âFictionâ, âSocialâ, âWrittenâ] found expert-annotated 2048 IndicLangClassification Madhani et al. (2023) [âasmâ, âbenâ, âbrxâ, ...] [âWebâ, âNon-fictionâ, âWrittenâ] created expert-annotated 30418 MTOPIntentClassification Li et al. (2021) [âdeuâ, âengâ, âfraâ, ...] [âSpokenâ, âSpokenâ] created human-annotated 30517 MalayalamNewsClassification Kunchukuttan et al. (2020) [âmalâ]
Chunk 129 ¡ 1,994 chars
Classification Madhani et al. (2023) [âasmâ, âbenâ, âbrxâ, ...] [âWebâ, âNon-fictionâ, âWrittenâ] created expert-annotated 30418 MTOPIntentClassification Li et al. (2021) [âdeuâ, âengâ, âfraâ, ...] [âSpokenâ, âSpokenâ] created human-annotated 30517 MalayalamNewsClassification Kunchukuttan et al. (2020) [âmalâ] [âNewsâ, âWrittenâ] found derived 1260 MultiHateClassification R"ottger et al. (2021) [âaraâ, âcmnâ, âdeuâ, ...] [âConstructedâ, âWrittenâ] created expert-annotated 11000 NepaliNewsClassification Arora (2020) [ânepâ] [âNewsâ, âWrittenâ] found derived 2048 PunjabiNewsClassification Kunchukuttan et al. (2020) [âpanâ] [âNewsâ, âWrittenâ] found derived 157 SanskritShlokasClassification Arora (2020) [âsanâ] [âReligiousâ, âWrittenâ] found derived 479 SentimentAnalysisHindi Parida et al. (2023) [âhinâ] [âReviewsâ, âWrittenâ] found derived 2048 TweetSentimentClassification Barbieri et al. (2022) [âaraâ, âdeuâ, âengâ, ...] [âSocialâ, âWrittenâ] found human-annotated 2048 UrduRomanSentimentClassification Sharf (2018) [âurdâ] [âSocialâ, âWrittenâ] found derived 2048 Clustering SIB200ClusteringS2S Adelani et al. (2023a) [âaceâ, âacmâ, âacqâ, ...] [âNewsâ, âWrittenâ] human-translated and localized expert-annotated 197788 PairClassification XNLI Conneau et al. (2018) [âaraâ, âbulâ, âdeuâ, ...] [âNon-fictionâ, âFictionâ, âGovernmentâ, ...] created expert-annotated 38220 Reranking WikipediaRerankingMultilingual Foundation [âbenâ, âbulâ, âcesâ, ...] [âEncyclopaedicâ, âWrittenâ] LM-generated and verified LM-generated and reviewed 240000 Retrieval BelebeleRetrieval Bandarkar et al. (2023) [âacmâ, âafrâ, âalsâ, ...] [âWebâ, âNewsâ, âWrittenâ] created expert-annotated 183488 - 338378 XQuADRetrieval Artetxe et al. (2019) [âarbâ, âdeuâ, âellâ, ...] [âWebâ, âWrittenâ] created human-annotated 2880 - 14199 STS IndicCrosslingualSTS Ramesh et al. (2022) [âasmâ, âbenâ, âengâ, ...] [âNewsâ, âNon-fictionâ, âWebâ, ...]
Chunk 130 ¡ 1,990 chars
âalsâ, ...] [âWebâ, âNewsâ, âWrittenâ] created expert-annotated 183488 - 338378 XQuADRetrieval Artetxe et al. (2019) [âarbâ, âdeuâ, âellâ, ...] [âWebâ, âWrittenâ] created human-annotated 2880 - 14199 STS IndicCrosslingualSTS Ramesh et al. (2022) [âasmâ, âbenâ, âengâ, ...] [âNewsâ, âNon-fictionâ, âWebâ, ...] created expert-annotated 3072 q Table 15: The tasks included in MTEB(Indic). The language column shows all the languages of the task. When running the tasks we limit it to the Indic languages specified in the benchmark. * For the number of samples, are given the total number of samples all languages included, for Retrieval tasks are given the (number of queries - number of documents). 55 -- 55 of 57 -- Published as a conference paper at ICLR 2025 Type Name Languages Domains Sample creation Annotation creators Nb samples* Classification AmazonCounterfactualClassification OâNeill et al. (2021) [âdeuâ, âengâ, âjpnâ] [âReviewsâ, âWrittenâ] found human-annotated 5805 Banking77Classification Casanueva et al. (2020) [âengâ] [âWrittenâ] found human-annotated 3080 ImdbClassification Maas et al. (2011) [âengâ] [âReviewsâ, âWrittenâ] found derived 25000 MTOPDomainClassification Li et al. (2021) [âdeuâ, âengâ, âfraâ, ...] [âSpokenâ, âSpokenâ] created human-annotated 30517 MassiveIntentClassification FitzGerald et al. (2022) [âafrâ, âamhâ, âaraâ, ...] [âSpokenâ] human-translated and localized human-annotated 255357 MassiveScenarioClassification FitzGerald et al. (2022) [âafrâ, âamhâ, âaraâ, ...] [âSpokenâ] human-translated and localized human-annotated 255357 ToxicConversationsClassification cjadams et al. (2019) [âengâ] [âSocialâ, âWrittenâ] found human-annotated 2048 TweetSentimentExtractionClassification Maggie (2020) [âengâ] [âSocialâ, âWrittenâ] found human-annotated 3534 Clustering ArXivHierarchicalClusteringP2P [âengâ] [âAcademicâ, âWrittenâ] found derived 2048 ArXivHierarchicalClusteringS2S [âengâ]
Chunk 131 ¡ 1,999 chars
019) [âengâ] [âSocialâ, âWrittenâ] found human-annotated 2048 TweetSentimentExtractionClassification Maggie (2020) [âengâ] [âSocialâ, âWrittenâ] found human-annotated 3534 Clustering ArXivHierarchicalClusteringP2P [âengâ] [âAcademicâ, âWrittenâ] found derived 2048 ArXivHierarchicalClusteringS2S [âengâ] [âAcademicâ, âWrittenâ] found derived 2048 BiorxivClusteringP2P.v2 [âengâ] [âAcademicâ, âWrittenâ] created derived 53787 MedrxivClusteringP2P.v2 [âengâ] [âAcademicâ, âMedicalâ, âWrittenâ] created derived 37500 MedrxivClusteringS2S.v2 [âengâ] [âAcademicâ, âMedicalâ, âWrittenâ] created derived 37500 StackExchangeClustering.v2 Geigle et al. (2021) [âengâ] [âWebâ, âWrittenâ] found derived 2048 StackExchangeClusteringP2P.v2 Geigle et al. (2021) [âengâ] [âWebâ, âWrittenâ] found derived 74914 TwentyNewsgroupsClustering.v2 Lang (1995) [âengâ] [âNewsâ, âWrittenâ] found derived 59545 PairClassification SprintDuplicateQuestions Shah et al. (2018) [âengâ] [âProgrammingâ, âWrittenâ] found derived 101000 TwitterSemEval2015 Xu et al. (2015) [âengâ] [âSocialâ, âWrittenâ] found human-annotated 16777 TwitterURLCorpus Lan et al. (2017) [âengâ] [âSocialâ, âWrittenâ] found derived 51534 Reranking AskUbuntuDupQuestions Wang et al. (2021a) [âengâ] [âProgrammingâ, âWebâ] found human-annotated 7581 MindSmallReranking Wu et al. (2020a) [âengâ] [âNewsâ, âWrittenâ] found expert-annotated 2367791 Retrieval ArguAna Boteva et al. (2016) [âengâ] [âMedicalâ, âWrittenâ] 8674 - 1406 CQADupstackGamingRetrieval Hoogeveen et al. (2015) [âengâ] [âWebâ, âWrittenâ] found derived 45301 - 1595 CQADupstackUnixRetrieval Hoogeveen et al. (2015) [âengâ] [âWrittenâ, âWebâ, âProgrammingâ] found derived 47382 - 1072 ClimateFEVERHardNegatives Diggelmann et al. (2021) [âengâ] [âEncyclopaedicâ, âWrittenâ] found human-annotated 47416 - 1000 FEVERHardNegatives Thorne et al. (2018a) [âengâ] None 163698 - 1000 FiQA2018 Thakur et
Chunk 132 ¡ 1,989 chars
eval Hoogeveen et al. (2015) [âengâ] [âWrittenâ, âWebâ, âProgrammingâ] found derived 47382 - 1072 ClimateFEVERHardNegatives Diggelmann et al. (2021) [âengâ] [âEncyclopaedicâ, âWrittenâ] found human-annotated 47416 - 1000 FEVERHardNegatives Thorne et al. (2018a) [âengâ] None 163698 - 1000 FiQA2018 Thakur et al. (2021) [âengâ] [âWrittenâ, âFinancialâ] found human-annotated 57638 - 648 HotpotQAHardNegatives Yang et al. (2018) [âengâ] [âWebâ, âWrittenâ] found human-annotated 225621 - 1000 SCIDOCS Cohan et al. (2020b) [âengâ] [âAcademicâ, âWrittenâ, âNon-fictionâ] found 25657 - 1000 TRECCOVID Roberts et al. (2021) [âengâ] [âMedicalâ, âAcademicâ, âWrittenâ] 171332 - 50 Touche2020Retrieval.v3 Thakur et al. (2024) [âengâ] [âAcademicâ] found human-annotated 303732 - 49 STS BIOSSES SoËgancÄąoËglu et al. (2017) [âengâ] [âMedicalâ] found derived 100 SICK-R Marelli et al. (2014) [âengâ] [âWebâ, âWrittenâ] human-annotated 9927 STS12 Agirre et al. (2012) [âengâ] [âEncyclopaedicâ, âNewsâ, âWrittenâ] created human-annotated 3108 STS13 Agirre et al. (2013) [âengâ] [âWebâ, âNewsâ, âNon-fictionâ, ...] created human-annotated 1500 STS14 Bandhakavi et al. (2014) [âengâ] [âBlogâ, âWebâ, âSpokenâ] created derived 3750 STS15 Biçici (2015) [âengâ] [âBlogâ, âNewsâ, âWebâ, ...] created human-annotated 3000 STS17 Cer et al. (2017) [âaraâ, âdeuâ, âengâ, ...] [âNewsâ, âWebâ, âWrittenâ] created human-annotated 5346 STS22.v2 Chen et al. (2022) [âaraâ, âcmnâ, âdeuâ, ...] [âNewsâ, âWrittenâ] found human-annotated 3958 STSBenchmark May (2021) [âengâ] [âBlogâ, âNewsâ, âWrittenâ] machine-translated and verified human-annotated 1379 Summarization SummEvalSummarization.v2 Fabbri et al. (2020) [âengâ] [âNewsâ, âWrittenâ] created human-annotated 100 Table 16: The tasks included in MTEB(eng, v2). The language column shows all the languages of the task. When running the tasks we limit it to the languages specified in the
Chunk 133 ¡ 1,999 chars
ed human-annotated 1379 Summarization SummEvalSummarization.v2 Fabbri et al. (2020) [âengâ] [âNewsâ, âWrittenâ] created human-annotated 100 Table 16: The tasks included in MTEB(eng, v2). The language column shows all the languages of the task. When running the tasks we limit it to the languages specified in the benchmark. * For the number of samples, are given the total number of samples all languages included, for Retrieval tasks are given the (number of queries - number of documents). Type Name Languages Domains Sample creation Annotations creators Nb Samples* Retrieval AppsRetrieval Hendrycks et al. (2021a) [âengâ, âpythonâ] [âProgrammingâ, âWrittenâ] found derived 3765 - 8765 COIRCodeSearchNetRetrieval Husain et al. (2019) [âgoâ, âjavaâ, âjavascriptâ, âphpâ] [âProgrammingâ, âWrittenâ] found derived 52561 - 1003765 CodeEditSearchRetrieval Muennighoff et al. (2023a) [âcâ, âc++â, âgoâ, âjavaâ] [âProgrammingâ, âWrittenâ] found derived 13000 - 13000 CodeFeedbackMT Zheng et al. (2024) [âengâ] [âProgrammingâ, âWrittenâ] found derived 13277 - 66383 CodeFeedbackST Li et al. (2024) [âengâ] [âProgrammingâ, âWrittenâ] found derived 31306 - 156526 CodeSearchNetCCRetrieval Li et al. (2024) [âgoâ, âjavaâ, âjavascriptâ, âphpâ] [âProgrammingâ, âWrittenâ] found derived 52561 - 1005474 CodeSearchNetRetrieval Husain et al. (2019) [âgoâ, âjavaâ, âjavascriptâ, âphpâ] [âProgrammingâ, âWrittenâ] found derived 6000 - 6000 CodeTransOceanContest Yan et al. (2023) [âc++â, âpythonâ] [âProgrammingâ, âWrittenâ] found derived 221 - 1008 CodeTransOceanDL Yan et al. (2023) [âpythonâ] [âProgrammingâ, âWrittenâ] found derived 180 - 816 CosQA Huang et al. (2021) [âengâ, âpythonâ] [âProgrammingâ, âWrittenâ] found derived 500 - 20604 StackOverflowQA Li et al. (2024) [âengâ] [âProgrammingâ, âWrittenâ] found derived 1994 - 19931 SyntheticText2SQL Meyer et al. (2024) [âengâ, âsqlâ] [âProgrammingâ, âWrittenâ] found derived 5851 - 105851 Table 17: The
Chunk 134 ¡ 1,990 chars
Huang et al. (2021) [âengâ, âpythonâ] [âProgrammingâ, âWrittenâ] found derived 500 - 20604 StackOverflowQA Li et al. (2024) [âengâ] [âProgrammingâ, âWrittenâ] found derived 1994 - 19931 SyntheticText2SQL Meyer et al. (2024) [âengâ, âsqlâ] [âProgrammingâ, âWrittenâ] found derived 5851 - 105851 Table 17: The tasks included in MTEB(Code). * For the number of samples, are given the total number of samples all languages included, for Retrieval tasks are given the (number of queries - number of documents). Rank Average Across Average by Category Borda Count All Category Pair Clf. Clf. STS Retrieval Clustering Reranking model e5-mistral-7b-instruct 1 (393) 67.0 67.2 88.4 75.2 83.6 54.8 51.4 49.8 GritLM-7B 2 (384) 66.4 66.7 87.3 77.0 82.5 53.2 50.8 49.6 multilingual-e5-large-instruct 3 (357) 65.2 65.6 86.2 73.2 84.3 51.0 49.9 48.7 multilingual-e5-large 4 (270) 62.1 62.4 84.7 72.8 80.6 49.0 42.8 44.7 all-mpnet-base-v2 5 (211) 56.0 58.1 83.0 56.6 72.2 41.9 46.6 48.4 multilingual-e5-base 6 (211) 60.2 60.9 83.6 70.0 79.1 46.1 42.2 44.3 paraphrase-multilingual-mpnet-base-v2 7 (188) 57.3 58.8 81.7 68.6 79.8 34.1 43.5 45.2 all-MiniLM-L12-v2 8 (172) 54.7 57.0 82.5 55.8 70.7 40.7 44.6 47.5 all-MiniLM-L6-v2 9 (149) 54.4 56.7 82.4 55.4 70.4 39.8 44.9 47.1 multilingual-e5-small 10 (147) 58.4 59.3 82.7 67.7 77.6 43.7 40.8 43.2 paraphrase-multilingual-MiniLM-L12-v2 11 (109) 55.1 57.0 80.0 64.4 77.5 32.8 41.7 45.4 LaBSE 12 (49) 48.6 51.7 78.9 66.8 70.2 16.8 36.1 41.3 Table 18: Performance on MTEB(eng, v2) across task categories. 56 -- 56 of 57 -- Published as a conference paper at ICLR 2025 Rank Average Across Average by Language Borda Count All C++ Go Java JavaScript PHP Python Ruby Model GritLM-7B 1 (88) 73.6 73.1 83.8 84.9 81.7 77.8 86.4 83.8 e5-mistral-7b-instruct 2 (74) 69.2 68.3 83.0 80.9 79.4 75.6 83.6 81.1 multilingual-e5-large-instruct 3 (65) 65.0 56.4 74.7 74.7 71.7 71.6 79.1 74.9 multilingual-e5-large 4 (63) 61.7 46.8 73.4 72.2 66.6 69.1 75.7
Chunk 135 ¡ 871 chars
t All C++ Go Java JavaScript PHP Python Ruby Model GritLM-7B 1 (88) 73.6 73.1 83.8 84.9 81.7 77.8 86.4 83.8 e5-mistral-7b-instruct 2 (74) 69.2 68.3 83.0 80.9 79.4 75.6 83.6 81.1 multilingual-e5-large-instruct 3 (65) 65.0 56.4 74.7 74.7 71.7 71.6 79.1 74.9 multilingual-e5-large 4 (63) 61.7 46.8 73.4 72.2 66.6 69.1 75.7 73.4 multilingual-e5-base 5 (55) 57.5 48.9 73.2 71.0 66.1 67.8 75.2 72.7 multilingual-e5-small 6 (53) 58.4 48.4 70.6 67.9 65.2 66.6 73.6 68.1 all-mpnet-base-v2 7 (44) 56.4 46.3 67.4 62.2 63.1 61.7 69.0 65.7 all-MiniLM-L6-v2 8 (34) 52.7 48.1 64.4 57.4 62.2 60.4 68.1 66.6 all-MiniLM-L12-v2 9 (27) 50.2 46.8 68.1 57.3 63.6 62.7 68.7 67.8 LaBSE 10 (11) 28.8 27.6 40.6 36.6 42.3 34.8 43.9 42.2 Table 19: Performance on MTEB(Code) across task categories. Because all code-related tasks are for retrieval, metrics by category are omitted. 57 -- 57 of 57 --