Fairness in NLP · 2026

Mitigating Identity Bias in Hindi–English Code-Mixed Toxicity Detection

A controlled comparison of Counterfactual Data Augmentation and Adversarial Debiasing on XLM-RoBERTa, evaluated across four identity subgroups using FPR Disparity, Demographic Parity, and Counterfactual Fairness.

29,550 Annotated samples
−54% FPR Disparity (CDA)
−73% Demographic Parity Δ (CDA)
1.8% F1 cost

Abstract

Toxicity classifiers for Hindi–English code-mixed text systematically over-flag benign content that mentions identity groups such as religion, caste, gender, and region. We fine-tune XLM-RoBERTa on a 29,550-sample code-mixed corpus and compare two mitigation strategies: Counterfactual Data Augmentation (CDA), which augments training data by swapping identity terms, and Adversarial Debiasing, which jointly trains the classifier against an identity predictor via gradient reversal. CDA reduces False Positive Rate Disparity by 54% and the Demographic Parity gap by 73% with only a 1.8 point F1 drop. Adversarial debiasing better preserves accuracy but worsens counterfactual fairness, suggesting it superficially decorrelates identity from prediction without changing the underlying decision logic. We release code, processed data, and a counterfactual fairness test set for reproducibility.

Authors

Department of Computer Science and Engineering
SRM Institute of Science and Technology, Uttar Pradesh, India — 201204

Paper + Simple Guide

Choose the full research paper for technical detail, or open the plain-language PDF for a simpler walkthrough of the same ideas.

Dataset

We use the publicly available Kaggle Code-Mixed Hinglish Hate Speech corpus, a combined release aggregating multiple Hindi–English social media sources. Identity terms are auto-annotated using a curated lexicon spanning four protected subgroups.

Corpus Composition

SplitSamplesHateNon-Hate
Train20,6849,60711,077
Dev2,9551,3731,582
Test5,9112,7453,166
Total29,55013,72515,825

70 / 10 / 20 stratified split, seed 42.

Identity Coverage

GroupSamples% of corpus
Gender4,72216.0%
Religion1,7776.0%
Caste4111.4%
Region630.2%
Any identity6,97323.6%

Region subgroup is small — treated cautiously in analysis.

Methods

Three models are trained on the same data with identical hyperparameters; only the objective changes.

Baseline

XLM-R Fine-tune

Standard fine-tuning of XLM-RoBERTabase with cross-entropy on the toxicity label. No fairness intervention.

  • Loss: Ltox
  • No identity signal used
CDA

Counterfactual Augmentation

For each training instance containing an identity term, generate a counterfactual by swapping the term with another from the same group. Train on the union of original and augmented data (1:1 ratio).

  • Loss: Ltox on augmented set
  • Augmentation ratio: 1.0
Adversarial

Gradient Reversal

Two-head model: a toxicity classifier and an identity predictor share the encoder. Gradient reversal pushes the encoder to be uninformative about identity.

  • Loss: Ltotal = Ltox − λ·Ladv
  • λ = 0.5, hidden dim 128

Training Configuration

Backbonexlm-roberta-base
Max sequence length128 tokens
Batch size32
Epochs5
OptimizerAdamW (β₁=0.9, β₂=0.98)
Learning rate2 × 10⁻⁵
Warmup ratio0.1
Weight decay0.01
Gradient clip1.0
Seed42

Results

All metrics computed on the 5,911-sample held-out test set. Bias metrics computed across the four identity subgroups; lower is better.

Headline Comparison

Model Utility Fairness (lower is better)
AccPrecRecF1Macro F1 FPR ΔDP ΔCFT Gap
Baseline 0.7490.7780.6420.7030.743 0.4210.5050.052
CDA 0.7340.7610.6240.6850.728 0.1960.1360.042
Adversarial 0.7470.7810.6330.6990.741 0.2560.2940.068

Bold = best per column. CDA achieves the best fairness on all three metrics; F1 cost is 1.8 percentage points.

FPR Disparity −54% CDA vs Baseline
Demographic Parity Δ −73% CDA vs Baseline
CFT Gap −19% CDA vs Baseline

Per-Group False Positive Rates

CDA reduces caste-group FPR from 0.42 to 0.34. Religion FPR is unchanged across CDA/Baseline; Adversarial improves it slightly. Region (n=63) is volatile and should be interpreted with caution.

Analysis & Discussion

1. CDA wins on every fairness metric

By directly exposing the model to identity-swapped variants of training examples, CDA breaks the spurious correlation between identity tokens and the toxicity label. The 73% reduction in Demographic Parity gap shows the model's positive prediction rate is now nearly equal across groups, and the CFT Gap reduction confirms that identity-only edits no longer flip the prediction.

2. Adversarial debiasing has a hidden failure mode

Adversarial debiasing improves FPR Disparity and DP Δ but worsens the CFT Gap (0.068 vs 0.052 baseline). Gradient reversal pushes the encoder toward identity-invariant representations on average, but does not guarantee that swapping a single identity token in a fixed context preserves the prediction. This is consistent with prior findings that adversarial debiasing can produce shallow decorrelation rather than genuine counterfactual robustness.

3. The fairness–utility tradeoff is small

CDA costs 1.8 points of F1 in exchange for a 54% reduction in FPR Disparity. For any deployment where false positives on identity-mentioning content carry a real harm (community moderation, journalism, legal text), this tradeoff is favourable.

4. Limitations

  • Single dataset. Results are reported on one Hindi–English code-mixed corpus. Generalisation to other code-mixed languages or domains requires further study.
  • Small region subgroup. Only 63 samples mention regional identity terms. Per-group metrics for region are noisy and reported but not interpreted strongly.
  • Lexicon-based identity detection. Identity annotation relies on a curated lexicon, which under-counts implicit references and informal spellings.
  • Binary toxicity label. We do not distinguish between hate speech, profanity, and offensive humour, which limits the granularity of conclusions.

Reproduce

Five commands recreate every number on this page.

  1. Install dependencies

    python3 -m venv .venv
    .venv/bin/pip install -r requirements.txt
  2. Prepare data

    # place combined_hate_speech_dataset.csv in data/raw/
    .venv/bin/python main.py --stage data
  3. Train baseline

    .venv/bin/python main.py --stage baseline
  4. Train CDA & Adversarial

    .venv/bin/python main.py --stage cda
    .venv/bin/python main.py --stage adversarial
  5. Evaluate & compare

    .venv/bin/python main.py --stage evaluate
    # → results/model_comparison.csv

Environment

  • Python 3.9+
  • PyTorch 2.0+ (CUDA, MPS, or CPU)
  • transformers ≥ 4.35
  • ~1.5 GB disk for model weights