K/DA: Automated Data Generation Pipeline for Detoxifying Implicitly Offensive Language in Korean

1Korea University   2Seoul National University   3KAIST AI
*Indicates Equal Contribution

ACL 2025
K/DA teaser
An overview of K/DA, the pipeline for automated offensive language data generation.

Step 1
Retrieve 9 semantically similar sentences from the community using cosine similarity. An LLM then synthesizes a toxic version by incorporating trend-aligned slang from these sentences.

Step 2
An off-the-shelf LLM filters the candidates based on two criteria:
Pair consistency: How well the neutral-toxic pair shares the same content.
Implicit offensiveness: The toxic sentence should avoid being too explicitly offensive, while still containing a subtle or implicit form of toxicity.

Abstract

⚠️ Caution: This research includes content that may be considered offensive.
Language detoxification involves removing toxicity from offensive language. While a neutral-toxic paired dataset provides a straightforward approach for training detoxification models, creating such datasets presents several challenges: i) the need for human annotation to build paired data, and ii) the rapid evolution of offensive terms, rendering static datasets quickly outdated. To tackle these challenges, we introduce an automated paired data generation pipeline, called K/DA. This pipeline is designed to generate offensive language with implicit offensiveness and trend-aligned slang, making the resulting dataset suitable for detoxification model training. We demonstrate that the dataset generated by K/DA exhibits high pair consistency and greater implicit offensiveness compared to existing Korean datasets, and also demonstrates applicability to other languages. Furthermore, it enables effective training of a high-performing detoxification model with simple instruction fine-tuning.

Definition of Implicit Offensiveness

K/DA pipeline


Implicit offensiveness is a form of offensive language characterized by a tone of disregard or mockery that conveys derogatory meaning, such as sarcasm or social bias within context, while avoiding explicit expressions. This figure illustrates the types of offensive comments collected from Korean online communities. These expressions are hard to capture without proper context. We divide the implicitly offensive comments into three subcategories:
(1) disregard and mockery, consistent with past definitions of implicit offensiveness
(2) community-specific slang that is familiar within certain groups but difficult for outsiders to interpret
(3) variations of profanity used to avoid detection

Specifically, communities with high-context languages such as Korean are more likely to use these types of implicit offensive expressions. Therefore, we use these categories to guide the data generation process. Furthermore, we demonstrate the language- and model- agnostic nature of this pipeline by generating data in English.

Dataset Evaluation Results

Dataset comparison


This table presents the G-Eval evaluations of the dataset generated from the K/DA pipeline compared to other Korean offensive language datasets. Using the proposed pipeline, we were able to create a paired dataset with greater implicit offensiveness and higher consistency between pairs. The tendency for overall offensiveness to be the lowest, while implicit offensiveness remains the highest, indicates that the dataset has been appropriately constructed, aligning with the definition of offensive language targeted in our paper.

Instruction Tuning Results

G-Eval results on instruction tuning


This table presents the G-Eval results for detoxification. The goal is to achieve low offensiveness in the detoxified output. Along with reducing offensiveness, high consistency and fluency scores are essential, as a model could easily lower offensiveness by removing most of the potentially offensive content, but this would result in lower consistency and fluency scores.

The instruction-tuned detoxification model based on K/DA demonstrates improvements across all five criteria when tested on Ours and KOLD datasets. It is also evident that the superior detoxification performance achieved through instruction tuning on K/DA diminishes as we attempt to generalize further and disappears when tested in the most challenging transfer setting, BEEP. This decline is primarily due to the limited coverage of the neutral sentence from the dataset used, a limitation that can be easily addressed by diversifying the neutral sentence data.

Human Evaluation Results

Human evaluation dataset comparison


Human evaluation result of 50 random samples from K/DA and K-OMG. The numbers in parentheses represent the Cronbach's α. K/DA received higher scores for O and I, which are incorporated as O in K-OMG, reflecting offensive language more effectively in online communities. While K-OMG achieved a higher score for C, its Cronbach's α was relatively low, making it less reliable for direct comparison. Fluency was also rated higher in K-OMG; however, unlike K-OMG's evaluation instruction, which allowed evaluators to disregard grammatical errors, we did not include such a provision, leading to lower fluency scores in our evaluation.

Detoxification Performance Comparison

Human evaluation dataset comparison


Human evaluation of detoxification performance tested on our model. It represents the percentage of preference for detoxified responses generated by our model, the model trained on another dataset (K-OMG, translated CADD), and cases where the performances are indistinguishable.

Video Presentation

Poster

BibTeX

@misc{jeon2025kdaautomateddatageneration,
      title={K/DA: Automated Data Generation Pipeline for Detoxifying Implicitly Offensive Language in Korean}, 
      author={Minkyeong Jeon and Hyemin Jeong and Yerang Kim and Jiyoung Kim and Jae Hyeon Cho and Byung-Jun Lee},
      year={2025},
      eprint={2506.13513},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.13513}, 
}