VietJobs: A Vietnamese Job Advertisement Dataset
Summary
VietJobs is the first large-scale, publicly available dataset of Vietnamese job advertisements, containing 48,092 postings from all 34 provinces and municipalities in Vietnam. It includes over 15 million words and covers 16 occupational domains, with structured data on job titles, salaries, skills, and employment types. The dataset addresses the lack of resources for Vietnamese NLP, supporting research in job classification and salary estimation. The authors benchmarked several large language models (LLMs), finding that instruction-tuned models like Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT performed best in few-shot and fine-tuned settings. VietJobs highlights challenges in multilingual and Vietnamese-specific modeling for labor market prediction. The dataset is valuable for studying linguistic and socio-economic diversity in Vietnamâs labor market and serves as a foundation for future NLP and labor market analysis. It is available at https://github.com/VinNLP/VietJobs.
PDF viewer
Chunks(26)
Chunk 0 ¡ 1,999 chars
VietJobs: A Vietnamese Job Advertisement Dataset
Hieu Pham Dinh, Hung Nguyen Huy, Mo El-Haj
College of Engineering and Computer Science, VinUniversity
{24hieu.pd, 25hung.nh, elhaj.m}@vinuni.edu.vn
Abstract
VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092
postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The
dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and
employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and
internship). Designed to support research in natural language processing and labour market analytics, VietJobs
captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large
language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned
models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and
fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured
labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable
foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market
analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.
Keywords: Vietnamese NLP, job advertisements, large language models, job classification, salary estima-
tion, low-resource language, dataset creation, labour market analysis
1. Introduction
Vietnamâs labour market has expanded alongside
digital transformation, with online recruitment plat-
forms such as TopCV playing an increasingly cen-
tral role in connecting job seekers and employ-
ers. Recent research highlights how such plat-
forms both reflect and reproduce existingChunk 1 ¡ 1,994 chars
market analysis 1. Introduction Vietnamâs labour market has expanded alongside digital transformation, with online recruitment plat- forms such as TopCV playing an increasingly cen- tral role in connecting job seekers and employ- ers. Recent research highlights how such plat- forms both reflect and reproduce existing social norms. For instance, explicit references to gender, age, and physical appearance remain common in Vietnamese job postings, influencing wage offers and perceptions of employability (Perroni et al., 2023; Packard, 2006). Studies have also shown that job advertisements may privilege particular de- mographic or aesthetic traits, such as youth or at- tractiveness, contributing to differentiated opportu- nities across groups (Perroni et al., 2023). Despite these observations, the linguistic and structural fea- tures of Vietnamese recruitment language remain underexplored, in part due to the limited availability of large, well-annotated, and publicly accessible datasets for computational analysis (Tran et al., 2022). While previous research on recruitment language has focused mainly on English and other high- resource languages, Vietnamese remains compar- atively under-resourced for NLP. The languageâs tonal structure, compounding morphology, and fre- quent code-switching with English present addi- tional challenges for tokenisation, normalisation, and semantic interpretation (Bonoli and Hinrichs, 2012). This scarcity of data and tools has con- strained the development of domain-specific and socially informed NLP applications in Vietnamese. Recent advances in NLP for Vietnamese, including work on job description analysis and fake job de- tection, have begun to address these challenges, but large, linguistically representative resources remain limited (Vu et al., 2025; Tran et al., 2022). This paper introduces VietJobs, the first large- scale, open-access dataset of Vietnamese job ad- vertisements. Comprising 48,092 postings col- lected nationwide,
Chunk 2 ¡ 1,997 chars
de- tection, have begun to address these challenges, but large, linguistically representative resources remain limited (Vu et al., 2025; Tran et al., 2022). This paper introduces VietJobs, the first large- scale, open-access dataset of Vietnamese job ad- vertisements. Comprising 48,092 postings col- lected nationwide, VietJobs integrates linguistic, demographic, and occupational information to pro- vide a structured and diverse view of Vietnamâs online labour market. The dataset supports a wide range of NLP and analytical studies, including lin- guistic characterisation of recruitment language and benchmarking of state-of-the-art large lan- guage models (LLMs) for job category classification and salary estimation under zero-shot, few-shot, and fine-tuned settings (Tran et al., 2022; Vu et al., 2025; Otani et al., 2024). Through these contribu- tions, VietJobs establishes a foundational resource for Vietnamese NLP and labour market analysis. It bridges computational and socio-economic per- spectives, offering a platform for studying how lan- guage represents professional, educational, and regional variation in recruitment contexts, and sup- porting future research on language, employment, and AI applications in Southeast Asia (Perroni et al., 2023; Bonoli and Hinrichs, 2012; Otani et al., 2024). 2. Related Works Job advertisements have become a key resource for computational research on labour markets and arXiv:2603.05262v1 [cs.CL] 5 Mar 2026 -- 1 of 10 -- recruitment language (Vogt et al., 2023). Advances in NLP now enable large-scale analyses of job post- ings to identify hiring trends, examine linguistic pat- terns, and estimate salaries or job categories (Daw- son et al., 2020). However, most existing studies fo- cus on English and other high-resource languages, leaving low- and mid-resource contexts such as Vietnamese largely underrepresented. Several corpora have supported empirical re- search on recruitment language. The Adzuna Global Job Listings dataset
Chunk 3 ¡ 1,999 chars
ategories (Daw- son et al., 2020). However, most existing studies fo- cus on English and other high-resource languages, leaving low- and mid-resource contexts such as Vietnamese largely underrepresented. Several corpora have supported empirical re- search on recruitment language. The Adzuna Global Job Listings dataset (Karunarathna, 2025) contains over 17,000 English job postings with metadata on compensation and contract type, while the Djinni Recruitment Dataset (Drushchak and Romanyshyn, 2024) includes 150,000 jobs and 230,000 anonymised candidate profiles in En- glish and Ukrainian. Research on recruitment language increasingly explores how wording re- flects social and cultural norms in hiring contexts. Studies have examined gendered, age-related, and appearance-based phrasing using approaches ranging from lexicon-based methods to transformer models such as RoBERTa (Sharma, 2025), which perform strongly in identifying recurrent linguistic patterns in IT and STEM job markets (Kanij et al., 2024). Work in low-resource languages remains limited, though multilingual initiatives such as the AraJobs corpus for Arabic job advertisements (El- Haj, 2025) highlight the value of culturally grounded datasets for analysing recruitment discourse. VietJobs builds upon this direction by introduc- ing a large-scale, linguistically diverse Vietnamese dataset that enables analysis of language use and representation in Southeast Asian labour markets. Another resource, the Vietnam Jobs Dataset avail- able on Kaggle (Nguyen, 2025), focuses primarily on Vietnamese job titles without including full tex- tual descriptions, which limits its use for NLP-based analysis. Nevertheless, we employ this dataset for comparison purposes, as discussed later in Sec- tion 4. Salary prediction has also become a key focus of computational labour market research. Prior work has applied regression, ensemble learning, and deep neural models to estimate compensa- tion, showing that variables such as job
Chunk 4 ¡ 1,997 chars
rtheless, we employ this dataset for comparison purposes, as discussed later in Sec- tion 4. Salary prediction has also become a key focus of computational labour market research. Prior work has applied regression, ensemble learning, and deep neural models to estimate compensa- tion, showing that variables such as job title, com- pany profile, location, and skills are strong predic- tors (Pluijmaekers and Lelli, 2022; Bana, 2022). Many datasets include incomplete or inconsistent salary information, limiting model reliability (Al- sheyab et al., 2025; El-Haj, 2025). VietJobs miti- gates this by including explicit minimum, maximum, and average salary fields in over 70% of postings, enabling consistent benchmarking for predictive and economic analyses. Overall, VietJobs extends the linguistic and socio-economic scope of existing Vietnamese resources and complements regionally focused datasets such as AraJobs, SkillSpan, JobSkape, and Djinni (El-Haj, 2025; Zhang et al., 2022; Ma- gron et al., 2024; Drushchak and Romanyshyn, 2024). By capturing the linguistic and cultural specificities of Vietnam, it provides a robust re- source for context-sensitive modelling and con- tributes to the broader advancement of low- resource NLP. 3. VietJobs Dataset 3.1. Data Collection and Corpus Overview The VietJobs1 dataset was compiled from pub- licly accessible online recruitment platforms in Viet- nam during July 2025. Data were collected us- ing the open-source Crawl4AI framework (Uncle- Code, 2024), combined with LLM-assisted parsing through GPT-4o (Hurst et al., 2024) and Gemini 2.5 (Comanici et al., 2025), which enabled structured extraction from diverse HTML templates while pre- serving the linguistic integrity of the text. Access to GPT-4o and Gemini 2.5 was provided through API integration, enabling scalable and efficient pars- ing and information extraction. Both models were guided using a detailed, task-specific prompt to ensure consistent and schema-compliant outputs. All
Chunk 5 ¡ 1,987 chars
ates while pre- serving the linguistic integrity of the text. Access to GPT-4o and Gemini 2.5 was provided through API integration, enabling scalable and efficient pars- ing and information extraction. Both models were guided using a detailed, task-specific prompt to ensure consistent and schema-compliant outputs. All collection procedures were conducted ethi- cally and in compliance with national and institu- tional regulations, as detailed in Section 7. The crawling process involved two stages: (1) initial URL acquisition, which took approximately 4â5 hours, and (2) full-page crawling and LLM-based information extraction, which required over 7 days. The resulting corpus captures detailed linguistic and structural information from job advertisements representing all 34 provinces and municipalities of Vietnam, with the highest posting volumes in Hanoi and Ho Chi Minh City. A statistical overview of the dataset is presented in Table 1. With over 15 million words and a vocabulary ex- ceeding 78,000 unique tokens, VietJobs provides an extensive and representative sample of contem- porary Vietnamese recruitment discourse. Its geo- graphical and occupational diversity supports both linguistic and computational analyses, enabling the study of bias, code-switching, and socio-economic variation across regions and industries. The unified occupational taxonomy, adapted from ISCO-08, O*NET, and ESCO standards, ensures analytical consistency and facilitates downstream applica- tions such as job classification, salary estimation, and fairness-aware modelling. 1https://github.com/VinNLP/VietJobs -- 2 of 10 -- Statistic Value General Information Total job postings 48,092 Total tokens (words) 15,429,581 Average tokens per posting 321 (mean) Vocabulary size 78,002 Proportion of English tokens 0.32% (49,375 words) Distinct job categories 16 Collection duration 1 week Geographical Coverage Regions covered 34 provinces Top regions Hanoi, Ho Chi Minh City Salary
Chunk 6 ¡ 1,997 chars
job postings 48,092 Total tokens (words) 15,429,581 Average tokens per posting 321 (mean) Vocabulary size 78,002 Proportion of English tokens 0.32% (49,375 words) Distinct job categories 16 Collection duration 1 week Geographical Coverage Regions covered 34 provinces Top regions Hanoi, Ho Chi Minh City Salary Information Salary range (minâmax) 1â500M VND/month Median (min / avg / max) 10 / 13 / 15M VND Textual Characteristics Longest postings IT & Digital Engineering (354.8 words) Shortest postings Languages & Translation (275.1 words) Employment Characteristics Contract types Full-time, Part-time, Internship Benefits / skills fields Limited coverage Table 1: Summary statistics of the VietJobs dataset. 3.2. Occupational Category Distribution The job_category field denotes the occupa- tional domain or functional area represented in each advertisement (e.g., Sales, Engineering, Healthcare). The original categorisation provided by source platforms consisted of 24 distinct labels, many of which exhibited semantic overlap or incon- sistent assignment. For example, positions such as Financial Analyst and Accountant were placed under separate categories, despite belonging to the same broader industrial sector. To enhance interpretability and comparability, we conducted a systematic category normalisation process. All raw category labels were reviewed and mapped onto a harmonised taxonomy com- prising 16 consolidated occupational domains. This approach aligns with established practices in oc- cupational data standardisation in Vietnam (Mo- roz and Nguyen, 2019), ensuring analytical con- sistency while preserving sufficient granularity for downstream NLP tasks. The resulting taxonomy maintains a balance between semantic precision and practical usability, enabling both fine-grained linguistic analyses and macro-level labour market comparisons. Table 2 summarises the distribution of job post- ings across the 16 categories. The largest seg- ments correspond to
Chunk 7 ¡ 1,996 chars
stream NLP tasks. The resulting taxonomy maintains a balance between semantic precision and practical usability, enabling both fine-grained linguistic analyses and macro-level labour market comparisons. Table 2 summarises the distribution of job post- ings across the 16 categories. The largest seg- ments correspond to Business, Sales & Customer Service (8,276 ads) and Manufacturing, Manual Labour & Mechanics (6,407 ads), reflecting Viet- namâs continued emphasis on commerce, produc- tion, and industrial operations. By contrast, spe- cialised fields such as Agriculture, Energy & Envi- ronment (322 ads) and Languages & Translation (384 ads) account for smaller shares, indicative of niche professional markets. In total, the dataset comprises 48,092 valid and categorised job ad- Job Category (translated to English) Ad Count Business, Sales & Customer Service 8,276 Manufacturing, Manual Labour & Mechanics 6,407 Marketing, Communications, Advertising & Content 5,950 Finance, Accounting, Banking & Insurance 5,530 Tourism, Hospitality & Services 4,243 Design, Arts, Entertainment, Media & Journalism 3,465 Human Resources, Administration, Legal & Consulting 3,276 Construction, Architecture & Real Estate 2,811 Information Technology & Digital Engineering 1,906 Logistics, Transportation & Supply Chain 1,813 Electrical, Electronics & Telecommunications Engineering 1,236 Education, Training & Research 1,165 Healthcare, Pharmaceuticals & Biotechnology 963 Languages & Translation 384 Agriculture, Energy & Environment 322 Other Occupations 345 Total 48,092 Table 2: Distribution of job advertisements across the 16 consolidated occupational categories. vertisements, offering comprehensive coverage of Vietnamâs online recruitment landscape. 3.3. Salary Distribution Salary information in the VietJobs dataset was normalised into three quantitative fields: salary_min, salary_max, and salary_avg, each expressed in millions of Vietnamese Dong (VND) per month. Out of the
Chunk 8 ¡ 1,993 chars
. vertisements, offering comprehensive coverage of Vietnamâs online recruitment landscape. 3.3. Salary Distribution Salary information in the VietJobs dataset was normalised into three quantitative fields: salary_min, salary_max, and salary_avg, each expressed in millions of Vietnamese Dong (VND) per month. Out of the total 48,092 job postings, 34,365 (71.5%) explicitly specify a salary range, while the remaining 13,727 (28.5%) are listed as negotiable. This distribution aligns with prior research on labour market transparency in Southeast Asia, where employers often withhold salary details for competitive or negotiation-related reasons (Van Thang et al., 2020). Figure 1 presents the distribution of minimum and maximum salary values after removing extreme outliers. The median minimum salary is approxi- mately 10 million VND, while the median maximum salary is around 15 million VND. This concentra- tion suggests that the majority of advertised posi- tions correspond to entry- or mid-level roles within Vietnamâs labour market. Nonetheless, the upper quartiles extend beyond 30 million VND, indicat- ing the presence of high-paying professional and managerial roles. The wider spread of maximum salaries relative to minimum salaries highlights substantial wage vari- ability across postings. This variation likely reflects factors such as job seniority, sectoral demand, and skill specialisation. For example, service-oriented occupationsâparticularly within Business, Sales & Customer Serviceâtend to exhibit broader salary ranges than technical domains like Information Technology & Digital Engineering. Such disparities suggest that employers frequently adopt flexible -- 3 of 10 -- salary ranges to appeal to a wider pool of appli- cants, a strategy commonly observed in dynamic and rapidly expanding economies (Lu, 2023). Figure 1: The distribution of Minimum and Max- imum Salaries (extreme outliers are removed for better visualisation) The salary distribution exhibits a
Chunk 9 ¡ 1,990 chars
t flexible -- 3 of 10 -- salary ranges to appeal to a wider pool of appli- cants, a strategy commonly observed in dynamic and rapidly expanding economies (Lu, 2023). Figure 1: The distribution of Minimum and Max- imum Salaries (extreme outliers are removed for better visualisation) The salary distribution exhibits a right-skewed pattern, characterised by a concentration of lower- to mid-income positions alongside a smaller subset of high-income roles. This structure forms the em- pirical basis for the salary estimation experiments described in Section 4, where models are tasked with predicting salary values for postings using their key details such as job title or experience required. 4. Evaluating Job Classification and Salary Estimation with Large Language Models 4.1. Job Category Classification The job category classification task focuses on pre- dicting the standardised job_category label (16 classes) from the textual content of each adver- tisement, primarily the description field. We evaluate a range of generative large language mod- els (LLMs) available on Hugging Face under three experimental conditions: zero-shot, few-shot, and fine-tuned. All models employ their chat or in- struct variants to ensure consistent adherence to natural language instructions and task formatting. Prompting and fine-tuning. In the zero-shot setting, each model receives a concise instruction prompt requesting a single category prediction from the 16 predefined occupational labels. This setup tests a modelâs ability to generalise without prior exposure to task-specific examples. In the few- shot setting, the prompt includes a small number of annotated examples before the test instance, enabling the model to infer task structure and cate- gory mappings from limited in-context demonstra- tions. For fine-tuning, the models are trained us- ing structured instructionâresponse pairs, where the prompt contains the job description and the target output corresponds to the correct
Chunk 10 ¡ 1,999 chars
es before the test instance, enabling the model to infer task structure and cate- gory mappings from limited in-context demonstra- tions. For fine-tuning, the models are trained us- ing structured instructionâresponse pairs, where the prompt contains the job description and the target output corresponds to the correct category label. This conversational-style format allows mod- els to internalise consistent mappings between job descriptions and their respective categories. Dur- ing inference, model outputs are restricted to the canonical list of category names. Any responses that deviate from these standard labels are auto- matically treated as incorrect. This ensures compa- rability across models and mitigates the influence of generative variability, enabling a fair evaluation of classification accuracy and robustness across instruction-tuned architectures. Evaluation metrics. Performance is assessed using Accuracy and Macro F1 scores on the held- out test set. Let N denote the total number of sam- ples, K the number of classes, Ëyi the predicted label, and yi the ground truth. Accuracy is com- puted as: Accuracy = 1 N N X i=1 I(Ëyi = yi) For each class k, Precision, Recall, and F1 are defined as: Precisionk = T Pk T Pk + F Pk Recallk = T Pk T Pk + F Nk F1k = 2 Ă Precisionk Ă Recallk Precisionk + Recallk The overall Macro F1 is then given by: Macro F1 = 1 K K X k=1 F1k Accuracy provides a measure of overall correctness, while Macro F1 offers a class-balanced evaluation that accounts for performance across both frequent and in- frequent job categories. 4.2. Salary Estimation The salary estimation task aims to predict the expected salary range (expressed in the format âX triáťuâ, which means X million VND) based on structured job informa- tion. Each sample includes the following input fields: job_title, contract_type, location, country, and experience_required. We evaluate a set of generative large language models (LLMs) from Hugging- Face under three configurations:
Chunk 11 ¡ 1,998 chars
d in the format âX triáťuâ, which means X million VND) based on structured job informa- tion. Each sample includes the following input fields: job_title, contract_type, location, country, and experience_required. We evaluate a set of generative large language models (LLMs) from Hugging- Face under three configurations: zero-shot, few-shot and fine-tuned. All models employ their chat or instruct variants to enhance instruction-following capability. Prompting and fine-tuning. In the zero-shot setting, models are prompted to generate a salary prediction in the required âX triáťuâ format using only the provided job -- 4 of 10 -- attributes. In the few-shot setting, the prompt includes a small number of example pairs of job attributes and their corresponding salary values before the test instance, allowing the model to infer the desired output structure and relationship through in-context learning. For the fine- tuned configuration, training instances are structured as instructionâresponse pairs following the same conver- sational schema, enabling models to learn the mapping between job attributes and corresponding salary values. During inference, the model output is required to strictly adhere to the âX triáťuâ format. The numeric value X is extracted and compared to the gold-standard salary value. Predictions that are unparsable or deviate from the expected format are considered invalid. Evaluation metrics. Model performance is evaluated using Root Mean Square Error (RMSE) and the coeffi- cient of determination (R2) on the held-out test set. Let N be the number of samples, yi the ground-truth salary, and Ëyi the predicted value parsed from the model output. RMSE is computed as: RMSE = v u u t 1 N N X i=1 (yi â Ëyi)2 and R2 is defined as: R2 = 1 â PN i=1(yi â Ëyi)2 PN i=1(yi â ÂŻy)2 where ÂŻy denotes the mean of the ground-truth salary values. RMSE captures the magnitude of prediction er- rors, while R2 quantifies the proportion of salary variance explained by the model,
Chunk 12 ¡ 1,990 chars
t. RMSE is computed as: RMSE = v u u t 1 N N X i=1 (yi â Ëyi)2 and R2 is defined as: R2 = 1 â PN i=1(yi â Ëyi)2 PN i=1(yi â ÂŻy)2 where ÂŻy denotes the mean of the ground-truth salary values. RMSE captures the magnitude of prediction er- rors, while R2 quantifies the proportion of salary variance explained by the model, providing complementary per- spectives on predictive accuracy and generalisation. 4.3. Model Selection To evaluate model performance on Vietnamese job clas- sification and salary estimation, we selected a diverse suite of generative large language models (LLMs) that vary in scale, linguistic coverage, and regional focus. The selection aims to provide a comprehensive compar- ison across three categories: globally-trained multilin- gual models, regionally-oriented ASEAN models, and Vietnamese-specialised models. Multilingual models. We include Qwen2.5-7B- Instruct (Qwen et al., 2025), Llama-3.1-8B-Instruct (Dubey et al., 2024), Granite-3.3-8B-Instruct (Team, 2025), and Ministral-8B-Instruct-24102. These models are trained on extensive multilingual corpora and demon- strate robust generalisation across languages. Their abil- ity to process code-mixed or EnglishâVietnamese inputs makes them suitable for tasks involving international or bilingual job postings commonly observed in Vietnamese recruitment data. ASEAN-focused models. To assess models tailored for regional linguistic contexts, we evaluate Llama-SEA- LION-v3-8B-IT (Chan et al., 2024), Sailor2-8B-Chat (Dou et al., 2025), and SeaLLMs-v3-7B-Chat (Zhang et al., 2https://huggingface.co/mistralai/ Ministral-8B-Instruct-2410 2024). These models are explicitly designed to improve coverage of Southeast Asian languages, including Viet- namese, Thai, and Indonesian. Their training objec- tives and tokenisation strategies better accommodate regional linguistic characteristics, which may enhance performance on language-specific nuances in job adver- tisements. Vietnamese-specific models. Finally, we
Chunk 13 ¡ 1,993 chars
overage of Southeast Asian languages, including Viet- namese, Thai, and Indonesian. Their training objec- tives and tokenisation strategies better accommodate regional linguistic characteristics, which may enhance performance on language-specific nuances in job adver- tisements. Vietnamese-specific models. Finally, we include PhoGPT-4B-Chat (Nguyen et al., 2023), BloomVN- 8B-Chat (BlossomsAI, 2025), and Vistral-7B-Chat (Van Nguyen et al., 2023), all of which are trained or fine-tuned predominantly on Vietnamese textual data. These models capture syntactic, lexical, and cultural par- ticularities of Vietnamese, potentially providing superior understanding of contextually rich or idiomatic expres- sions frequently used in job descriptions. This configuration enables a systematic comparison between globally multilingual, regionally adapted, and locally specialised models. By evaluating them under both zero-shot and fine-tuned settings, we aim to anal- yse how linguistic coverage, training focus, and cultural alignment influence LLM performance in Vietnamese recruitment-related tasks. 4.4. Fine-tuning Configuration Data Split. The dataset is partitioned into training, de- velopment, and test subsets using an 80%, 10%, 10% ratio. The development set is used for hyperparameter tuning and early stopping, while the final evaluation is conducted exclusively on the held-out test set to ensure fair comparison across models. Fine-tuning Hyperparameters. Fine-tuning is per- formed on a single NVIDIA A40 GPU using Low-Rank Adaptation (LoRA) to enable efficient training of large models. The configuration is consistent across all model variants to ensure comparability. The key hyperparame- ters are as follows: ⢠LoRA parameters: rank = 8, Îą = 16, dropout = 0.2 ⢠Target modules: q_proj, k_proj, v_proj, o_proj ⢠Optimiser: AdamW with learning rate 5 Ă 10â5 ⢠Batch size: micro-batch size = 4; effective batch size = 64 ⢠Training epochs: 2 ⢠Maximum sequence length: 512 tokens â˘
Chunk 14 ¡ 1,992 chars
bility. The key hyperparame- ters are as follows: ⢠LoRA parameters: rank = 8, Îą = 16, dropout = 0.2 ⢠Target modules: q_proj, k_proj, v_proj, o_proj ⢠Optimiser: AdamW with learning rate 5 Ă 10â5 ⢠Batch size: micro-batch size = 4; effective batch size = 64 ⢠Training epochs: 2 ⢠Maximum sequence length: 512 tokens ⢠Evaluation and checkpoint frequency: every 200 steps ⢠Precision: BF16 enabled Compute and Efficiency. All fine-tuning experiments were executed under identical computational conditions. The job classification fine-tuning process required ap- proximately 5 hours per model, while salary estimation models converged within 1â2 hours. The use of LoRA substantially reduced memory consumption and training time compared to full model fine-tuning, allowing efficient experimentation across multiple model families. -- 5 of 10 -- 5. Results and Discussion 5.1. Job Classification Results Table 3 summarises the performance of the evaluated LLMs on the job classification task across three evalua- tion settings: zero-shot, few-shot, and fine-tuned. Zero-shot performance. In the absence of task- specific examples, Qwen2.5-7B-Instruct achieves the highest scores, with an accuracy of 0.31 and a Macro F1 of 0.32. This demonstrates its strong cross-lingual gener- alisation and instruction-following capability, even without prior exposure to the classification schema. In contrast, models such as PhoGPT-4B-Chat and Granite-3.3-8B- Instruct produce inconsistent or malformed outputs that fail to match the canonical label set, indicating limited understanding of task constraints and label semantics under zero-shot prompting. Few-shot performance. Incorporating a small num- ber of in-context examples substantially improves per- formance across all models, with accuracy scores rising to the 0.4x range. Qwen2.5-7B-Instruct again leads with 0.47 accuracy, followed by Llama-SEA-LION-v3-8B-IT (0.45) and Sailor2-8B-Chat (0.44). These results high- light the effectiveness of
Chunk 15 ¡ 1,992 chars
rating a small num- ber of in-context examples substantially improves per- formance across all models, with accuracy scores rising to the 0.4x range. Qwen2.5-7B-Instruct again leads with 0.47 accuracy, followed by Llama-SEA-LION-v3-8B-IT (0.45) and Sailor2-8B-Chat (0.44). These results high- light the effectiveness of few-shot prompting in structured text classification, where minimal context allows LLMs to infer task boundaries, label semantics, and linguistic cues from examples. The improvement underscores the adaptability of instruction-tuned models to new domains with limited supervision. Fine-tuned performance. Parameter-efficient fine- tuning yields moderate gains relative to the zero-shot baseline but does not consistently outperform few-shot prompting. Accuracy values plateau around 0.3x across models, suggesting that while fine-tuning enhances task alignment, it may not fully replicate the contextual rea- soning and flexibility afforded by in-context learning. No- tably, Ministral-8B-Instruct-2410 exhibits a slight decline in both accuracy and Macro F1 after fine-tuning, possibly reflecting the overfitting problem of the model. Discussion. Across all experimental settings, Qwen2.5-7B-Instruct demonstrates the most consistent and robust performance, achieving the best results in both zero-shot and few-shot configurations. The strong results of instruction-tuned multilingual models relative to Vietnamese-specific ones suggest that large-scale multilingual pretraining remains advantageous for gen- eralisable task understanding, even in resource-specific contexts such as Vietnamese recruitment data. 5.2. Salary Estimation Results Table 4 summarises the performance of the evaluated LLMs on the salary estimation task under five experimen- tal settings: zero-shot, few-shot, fine-tuned on VietJobs, fine-tuned on the Vietnam Jobs Dataset (Nguyen, 2025), and fine-tuned on the combined datasets. Overall, model performance improves consistently as task-specific
Chunk 16 ¡ 1,993 chars
mmarises the performance of the evaluated LLMs on the salary estimation task under five experimen- tal settings: zero-shot, few-shot, fine-tuned on VietJobs, fine-tuned on the Vietnam Jobs Dataset (Nguyen, 2025), and fine-tuned on the combined datasets. Overall, model performance improves consistently as task-specific su- pervision is introduced through fine-tuning. For example, Qwen2.5-7B-Instruct achieves an RMSE of 14.06 and an R2 of â0.46 in the zero-shot setting on VietJobs, which improves to an RMSE of 11.32 and an R2 of 0.06 after fine-tuning on both datasets. Comparable trends are observed across the other models, highlighting the importance of domain adaptation in salary prediction. Across all configurations, Llama-SEA-LION-v3-8B-IT delivers the strongest and most consistent results. Even without explicit task conditioning, it achieves an RMSE of 11.72 and an R2 of 0.07 in the zero-shot scenario on VietJobs, outperforming all other models. With few-shot prompting, its performance further improves to 10.65 RMSE and 0.16 R2. After fine-tuning on both datasets, the model maintains robust performance with an RMSE of 12.40 and R2 of 0.16 when evaluated on the combined data. This pattern suggests that Llama-SEA-LION effec- tively integrates pre-trained multilingual knowledge with task-specific cues, allowing it to generalise well across both datasets. The observed performance gains from fine-tuning on the Vietnam Jobs Dataset and the combined data can be attributed to increased data diversity and represen- tativeness. Compared with VietJobs, the Vietnam Jobs Dataset provides a broader range of industries, salary brackets, and contextual descriptors, offering richer in- formation for modelling salary patterns. Fine-tuning on the combined corpus exposes models to this wider dis- tribution of job contexts, leading to lower RMSE values and higher R2 scoresâindicative of improved predictive accuracy and generalisability. In summary, the results reveal three key
Chunk 17 ¡ 1,994 chars
ptors, offering richer in- formation for modelling salary patterns. Fine-tuning on the combined corpus exposes models to this wider dis- tribution of job contexts, leading to lower RMSE values and higher R2 scoresâindicative of improved predictive accuracy and generalisability. In summary, the results reveal three key findings: (1) task-specific fine-tuning substantially enhances model accuracy for salary estimation; (2) performance generally follows the trend zero-shot < few-shot < fine-tuned on Vi- etJobs < fine-tuned on Vietnam Jobs Dataset < fine-tuned on both datasets; and (3) Llama-SEA-LION-v3-8B-IT emerges as the most robust and effective model for this task, demonstrating strong adaptability and generalisa- tion across different data sources. 6. Limitations While VietJobs represents the largest publicly available dataset of Vietnamese job advertisements, several limi- tations remain. It is sourced from a single online recruit- ment platform called TopCV, which may not reflect all sectors or informal employment in Vietnam, resulting in underrepresentation of some industries or job types. The postings also follow the platformâs linguistic and structural conventions, which could introduce systematic biases. Salary information, though common, is not consis- tently standardised and may involve rounding or omis- sion. Textual fields such as job descriptions sometimes contain duplicated or templated content, potentially af- fecting linguistic analyses and model outcomes. This study focuses on two core tasksâjob category classifica- tion and salary estimationâusing general-purpose large language models. These provide initial benchmarks but do not capture the full scope of possible NLP applica- tions. Finally, although the dataset excludes identifiable personal data, downstream users should still apply ap- propriate ethical and legal safeguards. -- 6 of 10 -- LLM Job Classification Salary Estimation Acc Macro F1 RMSE R2 Zero-shot Qwen2.5-7B-Instruct 0.31 0.32
Chunk 18 ¡ 1,998 chars
capture the full scope of possible NLP applica- tions. Finally, although the dataset excludes identifiable personal data, downstream users should still apply ap- propriate ethical and legal safeguards. -- 6 of 10 -- LLM Job Classification Salary Estimation Acc Macro F1 RMSE R2 Zero-shot Qwen2.5-7B-Instruct 0.31 0.32 14.06 -0.46 Llama-3.1-8B-Instruct 0.2 0.2 16.01 -0.87 Ministral-8B-Instruct-2410 0.19 0.16 40.75 -13.12 Llama-SEA-LION-v3-8B-IT 0.26 0.29 11.72 0.07 Sailor2-8B-Chat 0.16 0.16 13.74 -3.29 PhoGPT-4B-Chat 0 0 - - Granite-3.3-8B-Instruct 0 0 35.37 -8.19 BloomVN-8B-chat 0 0 18.73 -0.91 SeaLLMs-v3-7B-Chat 0.09 0.07 13.14 -0.28 Vistral-7B-Chat - - 167.07 -195.89 Few-shot Qwen2.5-7B-Instruct 0.47 0.42 11.45 0.03 Llama-3.1-8B-Instruct 0.42 0.36 14.73 -0.60 Ministral-8B-Instruct-2410 0.32 0.20 46.76 -15.26 Llama-SEA-LION-v3-8B-IT 0.45 0.38 10.65 0.16 Sailor2-8B-Chat 0.44 0.42 11.25 0.07 SeaLLMs-v3-7B-Chat 0.40 0.34 20.80 -2.19 Finetuned Qwen2.5-7B-Instruct 0.34 0.33 11.47 0.03 Llama-3.1-8B-Instruct 0.30 0.32 10.62 0.17 Ministral-8B-Instruct-2410 0.11 0.04 12.05 -0.07 Llama-SEA-LION-v3-8B-IT 0.30 0.33 10.60 0.17 Sailor2-8B-Chat 0.30 0.30 13.31 -0.30 SeaLLMs-v3-7B-Chat 0.32 0.31 10.75 0.15 Table 3: Performance comparison of various large language models (LLMs) across 2 tasks: Job Classifi- cation and Salary Estimation 7. Conclusion and Future Work This paper introduced VietJobs, a large-scale dataset of 48,092 Vietnamese job advertisements for research in job classification, salary estimation, and labour market analysis. The dataset covers 16 normalised job cat- egories and includes structured fields such as job ti- tles, salaries, skills, and employment conditions. Its util- ity was evaluated through two core tasks using large language models, with instruction-tuned systems such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT achieving the most consistent results. Fine-tuning on combined datasets produced the most accurate salary estimates. VietJobs provides
Chunk 19 ¡ 1,997 chars
yment conditions. Its util- ity was evaluated through two core tasks using large language models, with instruction-tuned systems such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT achieving the most consistent results. Fine-tuning on combined datasets produced the most accurate salary estimates. VietJobs provides a foundation for future research in computational labour market monitoring and recruitment modelling. Future work may expand coverage across additional platforms and time periods, incorporate mul- tilingual or demographic data, and explore advanced modelling methods such as retrieval-augmented gener- ation or domain-adaptive pretraining. In addition, while this study focused exclusively on the performance of large language models, future research should conduct systematic comparisons with traditional machine learn- ing approaches, such as TF-IDF feature representations combined with classifiers like Logistic Regression. Such comparisons would help quantify the relative benefits of LLM-based methods over conventional baselines in Vietnamese job classification and salary prediction tasks. Ethical Considerations and Limitations All data used in this study were collected from publicly accessible pages of the TopCV.vn website, which does not prohibit web scraping under its robots.txt policy. Only non-personal, publicly available job advertisement content was included; no user profiles, rĂŠsumĂŠs, or pri- vate materials were accessed. The dataset contains no personally identifiable information (PII), and all text was processed solely for research purposes in accordance with fair-use and data protection principles. This study received ethical approval from an Institutional Ethical Review Board under the principal investigator: Mo El- Haj. The approval (Decision No. J/2025/CN/HDDD) was granted in accordance with Circular 43/2024/TT-BYT, which regulates the establishment and operation of ethics councils in research, alongside relevant institutional poli- cies.
Chunk 20 ¡ 1,999 chars
proval from an Institutional Ethical Review Board under the principal investigator: Mo El- Haj. The approval (Decision No. J/2025/CN/HDDD) was granted in accordance with Circular 43/2024/TT-BYT, which regulates the establishment and operation of ethics councils in research, alongside relevant institutional poli- cies. The study was classified as minimal risk and au- thorised for the period OctoberâDecember 2025. All re- search activities adhered to institutional guidelines, Viet- -- 7 of 10 -- LLM VietJobs Vietnam Jobs Dataset Both RMSE R2 RMSE R2 RMSE R2 Base Model Qwen2.5-7B-Instruct 14.06 -0.46 16.19 -0.31 15.61 -0.34 Llama-3.1-8B-Instruct 16.01 -0.87 1319.94 -7744.05 1092.21 -6031.91 Ministral-8B-Instruct-2410 40.75 -13.12 2600.09 -30621.95 2190.20 -25135.32 Llama-SEA-LION-v3-8B-IT 11.72 0.07 15.78 -0.05 14.61 -0.02 Sailor2-8B-Chat 13.74 -3.29 99.92 -12.80 85.19 -12.55 SeaLLMs-v3-7B-Chat 13.14 -0.28 17.93 -0.08 15.82 -0.14 Few-shot Qwen2.5-7B-Instruct 11.45 0.03 15.43 -0.18 14.40 -0.14 Llama-3.1-8B-Instruct 14.73 -0.60 20.80 -1.16 19.25 -1.04 Ministral-8B-Instruct-2410 46.76 -15.26 45.66 -9.48 45.99 -10.72 Llama-SEA-LION-v3-8B-IT 10.65 0.16 14.57 -0.06 13.40 0.01 Sailor2-8B-Chat 11.25 0.07 15.62 -0.21 14.53 -0.15 SeaLLMs-v3-7B-Chat 20.80 -2.19 21.78 -1.37 21.50 -1.54 Fine-tuned on VietJobs Qwen2.5-7B-Instruct 11.47 0.03 14.28 -0.02 13.48 0 Llama-3.1-8B-Instruct 10.62 0.17 13.93 0.03 13.08 0.06 Ministral-8B-Instruct-2410 12.05 -0.07 14.62 -0.07 13.94 -0.07 Llama-SEA-LION-v3-8B-IT 10.60 0.17 13.89 0.04 13.04 0.07 Sailor2-8B-Chat 13.31 -0.30 16.84 -0.41 15.97 -0.40 SeaLLMs-v3-7B-Chat 10.75 0.15 14.05 0.02 13.20 0.04 Fine-tuned on Vietnam Jobs Dataset Qwen2.5-7B-Instruct 10.77 0.15 13.40 0.10 12.70 0.11 Llama-3.1-8B-Instruct 10.82 0.14 13.34 0.11 12.68 0.12 Ministral-8B-Instruct-2410 11.70 -0.01 14.40 -0.03 13.69 -0.03 Llama-SEA-LION-v3-8B-IT 10.70 0.16 13.24 0.12 12.58 0.13 Sailor2-8B-Chat 10.94 0.12 13.78 0.05 13.03 0.07 SeaLLMs-v3-7B-Chat 10.83 0.14 13.44 0.10
Chunk 21 ¡ 1,999 chars
t Qwen2.5-7B-Instruct 10.77 0.15 13.40 0.10 12.70 0.11 Llama-3.1-8B-Instruct 10.82 0.14 13.34 0.11 12.68 0.12 Ministral-8B-Instruct-2410 11.70 -0.01 14.40 -0.03 13.69 -0.03 Llama-SEA-LION-v3-8B-IT 10.70 0.16 13.24 0.12 12.58 0.13 Sailor2-8B-Chat 10.94 0.12 13.78 0.05 13.03 0.07 SeaLLMs-v3-7B-Chat 10.83 0.14 13.44 0.10 12.75 0.11 Fine-tuned on Both Qwen2.5-7B-Instruct 11.32 0.06 13.45 0.10 12.88 0.09 Llama-3.1-8B-Instruct 10.31 0.22 13.20 0.13 12.44 0.15 Ministral-8B-Instruct-2410 11.85 -0.03 14.29 -0.02 13.64 -0.02 Llama-SEA-LION-v3-8B-IT 10.24 0.23 13.17 0.13 12.40 0.16 Sailor2-8B-Chat 12.13 -0.08 13.73 0.06 13.34 0.02 SeaLLMs-v3-7B-Chat 10.48 0.19 13.41 0.10 12.64 0.12 Table 4: Performance comparison of LLMs in Salary Estimation with different settings namese national regulations, and internationally recog- nised ethical standards, including the principles of the Declaration of Helsinki. No personal or sensitive data were collected, and no human subjects were involved. References Abdel Rahman Alsheyab, Mohammad Alkhasawneh, and Nidal Shahin. 2025. Job market cheat codes: Proto- typing salary prediction and job grouping with synthetic job listings. Sarah H Bana. 2022. Work2vec: using language models to understand wage premia. BlossomsAI. 2025. Bloomvn-8b-chat. Giuliano Bonoli and Karl Hinrichs. 2012. Statistical dis- crimination and employersârecruitment: Practices for low-skilled workers. European Societies, 14(3):338â 361. Adwin Chan, Nicholas Cheng, Esther Choa, Yuli Huang, Adithya Venkatadri Hulagadri, Wayne Lau, Chwan Ren Lee, Wai Yi Leong, Wei Qi Leong, Peerat Limkon- chotiwat, Bing Jie Darius Liu, Jann Railey Monta- lan, Boon Cheong Raymond Ng, Jian Gang Ngui, Thanh Ngan Nguyen, Brandon Ong, Tat-Wee David Ong, Zhi Hao Ong, Hamsawardhini Rengarajan, Bryan Siow, Yosephine Susanto, Ngee Chia Tai, Choon Meng Tan, Walter Teng, Eng Sipp Leslie -- 8 of 10 -- Teo, Wei Yi Teo, William Tjhi, Yeow Tong Yeo, and Xianbin Yong. 2024. Llama-SEA-LION-v3- 8B-IT: Southeast
Chunk 22 ¡ 1,995 chars
Jian Gang Ngui, Thanh Ngan Nguyen, Brandon Ong, Tat-Wee David Ong, Zhi Hao Ong, Hamsawardhini Rengarajan, Bryan Siow, Yosephine Susanto, Ngee Chia Tai, Choon Meng Tan, Walter Teng, Eng Sipp Leslie -- 8 of 10 -- Teo, Wei Yi Teo, William Tjhi, Yeow Tong Yeo, and Xianbin Yong. 2024. Llama-SEA-LION-v3- 8B-IT: Southeast asian llm with instruction tun- ing. https://huggingface.co/aisingapore/ Llama-SEA-LION-v3-8B-IT. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261. Nikolas Dawson, Marian-Andrei Rizoiu, Benjamin John- ston, and Mary-Anne Williams. 2020. Predicting skill shortages in labor markets: A machine learning ap- proach. In 2020 IEEE International Conference on Big Data (Big Data), pages 3052â3061. IEEE. Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, et al. 2025. Sailor2: Sailing in south- east asia with inclusive multilingual llms. arXiv preprint arXiv:2502.12982. Nazarii Drushchak and Mariana Romanyshyn. 2024. In- troducing the djinni recruitment dataset: A corpus of anonymized CVs and job postings. In Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, pages 8â13, Torino, Italia. ELRA and ICCL. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv e-prints, pages arXivâ2407. Mo El-Haj. 2025. Arabjobs: A multinational corpus of arabic job ads. In Proceedings of the Third Arabic Natural Language Processing Conference (ArabicNLP 2025), Suzhou, China. Association for Computational Linguistics. Co-located with EMNLP 2025. Aaron Hurst,
Chunk 23 ¡ 1,996 chars
e llama 3 herd of models. arXiv e-prints, pages arXivâ2407. Mo El-Haj. 2025. Arabjobs: A multinational corpus of arabic job ads. In Proceedings of the Third Arabic Natural Language Processing Conference (ArabicNLP 2025), Suzhou, China. Association for Computational Linguistics. Co-located with EMNLP 2025. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt- 4o system card. Tanjila Kanij, John Grundy, and Jennifer McIntosh. 2024. Enhancing understanding and addressing gender bias in it/se job advertisements. Journal of Systems and Software, 217:112169. Kanchana Karunarathna. 2025. Adzuna global job list- ings 2025. Jackson G Lu. 2023. Asians donât ask? relational con- cerns, negotiation propensity, and starting salaries. Journal of Applied Psychology, 108(2):273. Antoine Magron, Anna Dai, Mike Zhang, Syrielle Montar- iol, and Antoine Bosselut. 2024. JobSkape: A frame- work for generating synthetic job postings to enhance skill matching. In Proceedings of the First Workshop on Natural Language Processing for Human Resources (NLP4HR 2024), pages 43â58, St. Julianâs, Malta. Association for Computational Linguistics. Harry Moroz and Nga Thi Nguyen. 2019. Skills profiling of priority occupations in vietnam. Technical report, World Bank. Chitinh Nguyen. 2025. Vietnam jobs dataset. Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Dinh Phung, and Hung Bui. 2023. PhoGPT: Generative Pre-training for Viet- namese. arXiv preprint, arXiv:2311.02945. Naoki Otani, Nikita Bhutani, and Estevam Hruschka. 2024. Natural language processing for human re- sources: A survey. arXiv preprint arXiv:2410.16498. Le Anh Tu Packard. 2006. Gender dimensions of Viet Namâs comprehensive macroeconomic and structural reform policies. 14. UNRISD Occasional Paper. Carlo Perroni, Kimberly Scharf, Oleksandr Talavera, and Linh Vi. 2023. Gender beauty premia in wage offers: Evidence from
Chunk 24 ¡ 1,994 chars
sources: A survey. arXiv preprint arXiv:2410.16498. Le Anh Tu Packard. 2006. Gender dimensions of Viet Namâs comprehensive macroeconomic and structural reform policies. 14. UNRISD Occasional Paper. Carlo Perroni, Kimberly Scharf, Oleksandr Talavera, and Linh Vi. 2023. Gender beauty premia in wage offers: Evidence from vietnamese online job postings. Pieterbas Pluijmaekers and Francesco Lelli. 2022. A dataset containing job descriptions suitable for nlp and nn processing. Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tian- hao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xu- ancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. 2025. Qwen2.5 technical report. Rohan Prakash Sharma. 2025. Recognizing and ex- plaining bias in job descriptions: A roberta-powered recruitment framework. Science and Engineering Re- search Journal, 13(3):1â9. Granite Team. 2025. Granite-3.3-8b-instruct. Viet-Trung Tran, Hai-Nam Cao, and Tuan-Dung Cao. 2022. A practical method for occupational skills detec- tion in vietnamese job listings. In Asian Conference on Intelligent Information and Database Systems, pages 571â581. Springer. UncleCode. 2024. Crawl4ai: Open-source llm friendly web crawler and scraper. https://github.com/ unclecode/crawl4ai. Chien Van Nguyen, Thuat Nguyen, Quan Nguyen, Huy Nguyen, Bj ¨orn Pl ¨uster, Nam Pham, Huu Nguyen, Patrick Schramowski, and Thien Nguyen. 2023. Vistral-7b-chat-towards a state-of-the-art large lan- guage model for vietnamese. N. Van Thang, J. M. PeirĂł, L. Q. Canh, V. GonzĂĄlez- RomĂĄ, and V. MartĂnez-Tur. 2020. Vietnamese gradu- atesâ labour market entry and employment: A tracer study. Social Sciences,
Chunk 25 ¡ 1,229 chars
m, Huu Nguyen, Patrick Schramowski, and Thien Nguyen. 2023. Vistral-7b-chat-towards a state-of-the-art large lan- guage model for vietnamese. N. Van Thang, J. M. PeirĂł, L. Q. Canh, V. GonzĂĄlez- RomĂĄ, and V. MartĂnez-Tur. 2020. Vietnamese gradu- atesâ labour market entry and employment: A tracer study. Social Sciences, 12(2):94. -- 9 of 10 -- Jan Vogt, Thilo Voigt, Annika Nowak, and Jan M Pawlowski. 2023. Development of a job advertise- ment analysis for assessing data science competen- cies. Data Science Journal, 22(1). Dinh-Hong Vu, Kien Nguyen, Khai Thien Tran, Bay Vo, and Tuong Le. 2025. Improving fake job descrip- tion detection using deep learning-based nlp tech- niques. Journal of Information and Telecommunica- tion, 9(1):113â125. Mike Zhang, Kristian Nørgaard Jensen, Sif Dam Sonniks, and Barbara Plank. 2022. Skillspan: Hard and soft skill extraction from english job postings. arXiv preprint arXiv:2204.12811. Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Ma- hani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, et al. 2024. Seallms 3: Open foundation and chat multilingual large language models for southeast asian languages. arXiv preprint arXiv:2407.19672. -- 10 of 10 --