Fairness in NLP · 2026
Mitigating Identity Bias in Hindi–English Code-Mixed Toxicity Detection
A controlled comparison of Counterfactual Data Augmentation and Adversarial Debiasing on XLM-RoBERTa, evaluated across four identity subgroups using FPR Disparity, Demographic Parity, and Counterfactual Fairness.
Abstract
Toxicity classifiers for Hindi–English code-mixed text systematically over-flag benign content that mentions identity groups such as religion, caste, gender, and region. We fine-tune XLM-RoBERTa on a 29,550-sample code-mixed corpus and compare two mitigation strategies: Counterfactual Data Augmentation (CDA), which augments training data by swapping identity terms, and Adversarial Debiasing, which jointly trains the classifier against an identity predictor via gradient reversal. CDA reduces False Positive Rate Disparity by 54% and the Demographic Parity gap by 73% with only a 1.8 point F1 drop. Adversarial debiasing better preserves accuracy but worsens counterfactual fairness, suggesting it superficially decorrelates identity from prediction without changing the underlying decision logic. We release code, processed data, and a counterfactual fairness test set for reproducibility.
Dataset
We use the publicly available Kaggle Code-Mixed Hinglish Hate Speech corpus, a combined release aggregating multiple Hindi–English social media sources. Identity terms are auto-annotated using a curated lexicon spanning four protected subgroups.
Corpus Composition
| Split | Samples | Hate | Non-Hate |
|---|---|---|---|
| Train | 20,684 | 9,607 | 11,077 |
| Dev | 2,955 | 1,373 | 1,582 |
| Test | 5,911 | 2,745 | 3,166 |
| Total | 29,550 | 13,725 | 15,825 |
70 / 10 / 20 stratified split, seed 42.
Identity Coverage
| Group | Samples | % of corpus |
|---|---|---|
| Gender | 4,722 | 16.0% |
| Religion | 1,777 | 6.0% |
| Caste | 411 | 1.4% |
| Region | 63 | 0.2% |
| Any identity | 6,973 | 23.6% |
Region subgroup is small — treated cautiously in analysis.
Methods
Three models are trained on the same data with identical hyperparameters; only the objective changes.
XLM-R Fine-tune
Standard fine-tuning of XLM-RoBERTabase with cross-entropy on the toxicity label. No fairness intervention.
- Loss:
Ltox - No identity signal used
Counterfactual Augmentation
For each training instance containing an identity term, generate a counterfactual by swapping the term with another from the same group. Train on the union of original and augmented data (1:1 ratio).
- Loss:
Ltoxon augmented set - Augmentation ratio: 1.0
Gradient Reversal
Two-head model: a toxicity classifier and an identity predictor share the encoder. Gradient reversal pushes the encoder to be uninformative about identity.
- Loss:
Ltotal = Ltox − λ·Ladv - λ = 0.5, hidden dim 128
Training Configuration
| Backbone | xlm-roberta-base |
| Max sequence length | 128 tokens |
| Batch size | 32 |
| Epochs | 5 |
| Optimizer | AdamW (β₁=0.9, β₂=0.98) |
| Learning rate | 2 × 10⁻⁵ |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Gradient clip | 1.0 |
| Seed | 42 |
Results
All metrics computed on the 5,911-sample held-out test set. Bias metrics computed across the four identity subgroups; lower is better.
Headline Comparison
| Model | Utility | Fairness (lower is better) | ||||||
|---|---|---|---|---|---|---|---|---|
| Acc | Prec | Rec | F1 | Macro F1 | FPR Δ | DP Δ | CFT Gap | |
| Baseline | 0.749 | 0.778 | 0.642 | 0.703 | 0.743 | 0.421 | 0.505 | 0.052 |
| CDA | 0.734 | 0.761 | 0.624 | 0.685 | 0.728 | 0.196 | 0.136 | 0.042 |
| Adversarial | 0.747 | 0.781 | 0.633 | 0.699 | 0.741 | 0.256 | 0.294 | 0.068 |
Bold = best per column. CDA achieves the best fairness on all three metrics; F1 cost is 1.8 percentage points.
Per-Group False Positive Rates
CDA reduces caste-group FPR from 0.42 to 0.34. Religion FPR is unchanged across CDA/Baseline; Adversarial improves it slightly. Region (n=63) is volatile and should be interpreted with caution.
Analysis & Discussion
1. CDA wins on every fairness metric
By directly exposing the model to identity-swapped variants of training examples, CDA breaks the spurious correlation between identity tokens and the toxicity label. The 73% reduction in Demographic Parity gap shows the model's positive prediction rate is now nearly equal across groups, and the CFT Gap reduction confirms that identity-only edits no longer flip the prediction.
2. Adversarial debiasing has a hidden failure mode
Adversarial debiasing improves FPR Disparity and DP Δ but worsens the CFT Gap (0.068 vs 0.052 baseline). Gradient reversal pushes the encoder toward identity-invariant representations on average, but does not guarantee that swapping a single identity token in a fixed context preserves the prediction. This is consistent with prior findings that adversarial debiasing can produce shallow decorrelation rather than genuine counterfactual robustness.
3. The fairness–utility tradeoff is small
CDA costs 1.8 points of F1 in exchange for a 54% reduction in FPR Disparity. For any deployment where false positives on identity-mentioning content carry a real harm (community moderation, journalism, legal text), this tradeoff is favourable.
4. Limitations
- Single dataset. Results are reported on one Hindi–English code-mixed corpus. Generalisation to other code-mixed languages or domains requires further study.
- Small region subgroup. Only 63 samples mention regional identity terms. Per-group metrics for region are noisy and reported but not interpreted strongly.
- Lexicon-based identity detection. Identity annotation relies on a curated lexicon, which under-counts implicit references and informal spellings.
- Binary toxicity label. We do not distinguish between hate speech, profanity, and offensive humour, which limits the granularity of conclusions.
Reproduce
Five commands recreate every number on this page.
-
Install dependencies
python3 -m venv .venv .venv/bin/pip install -r requirements.txt -
Prepare data
# place combined_hate_speech_dataset.csv in data/raw/ .venv/bin/python main.py --stage data -
Train baseline
.venv/bin/python main.py --stage baseline -
Train CDA & Adversarial
.venv/bin/python main.py --stage cda .venv/bin/python main.py --stage adversarial -
Evaluate & compare
.venv/bin/python main.py --stage evaluate # → results/model_comparison.csv
Environment
- Python 3.9+
- PyTorch 2.0+ (CUDA, MPS, or CPU)
- transformers ≥ 4.35
- ~1.5 GB disk for model weights