Fairness in NLP · 2026

Mitigating Identity Bias in Hindi–English Code-Mixed Toxicity Detection

A controlled comparison of Counterfactual Data Augmentation and Adversarial Debiasing on XLM-RoBERTa, evaluated across four identity subgroups using FPR Disparity, Demographic Parity, and Counterfactual Fairness.

29,550 Annotated samples

−54% FPR Disparity (CDA)

−73% Demographic Parity Δ (CDA)

1.8% F1 cost

Read the Paper Simple Explanation PPT Showcase View Results Reproduce

Abstract

Toxicity classifiers for Hindi–English code-mixed text systematically over-flag benign content that mentions identity groups such as religion, caste, gender, and region. We fine-tune XLM-RoBERTa on a 29,550-sample code-mixed corpus and compare two mitigation strategies: Counterfactual Data Augmentation (CDA), which augments training data by swapping identity terms, and Adversarial Debiasing, which jointly trains the classifier against an identity predictor via gradient reversal. CDA reduces False Positive Rate Disparity by 54% and the Demographic Parity gap by 73% with only a 1.8 point F1 drop. Adversarial debiasing better preserves accuracy but worsens counterfactual fairness, suggesting it superficially decorrelates identity from prediction without changing the underlying decision logic. We release code, processed data, and a counterfactual fairness test set for reproducibility.

Authors

Department of Computer Science and Engineering
SRM Institute of Science and Technology, Uttar Pradesh, India — 201204

Saurabh Gupta

Faculty

saurabhg1@srmist.edu.in

Devansh Singh

Researcher

ds2553@srmist.edu.in

Aviral Chandra

Researcher

ac5379@srmist.edu.in

Rachit Mittal

Researcher

rm8782@srmist.edu.in

Chanakya Nath

Researcher

cn2211@srmist.edu.in

Paper + Simple Guide

Choose the full research paper for technical detail, or open the plain-language PDF for a simpler walkthrough of the same ideas.

Open Paper Open PPT Showcase Read Simple Guide

Dataset

We use the publicly available Kaggle Code-Mixed Hinglish Hate Speech corpus, a combined release aggregating multiple Hindi–English social media sources. Identity terms are auto-annotated using a curated lexicon spanning four protected subgroups.

Corpus Composition

Split	Samples	Hate	Non-Hate
Train	20,684	9,607	11,077
Dev	2,955	1,373	1,582
Test	5,911	2,745	3,166
Total	29,550	13,725	15,825

70 / 10 / 20 stratified split, seed 42.

Identity Coverage

Group	Samples	% of corpus
Gender	4,722	16.0%
Religion	1,777	6.0%
Caste	411	1.4%
Region	63	0.2%
Any identity	6,973	23.6%

Region subgroup is small — treated cautiously in analysis.

Methods

Three models are trained on the same data with identical hyperparameters; only the objective changes.

Baseline

XLM-R Fine-tune

Standard fine-tuning of XLM-RoBERTa_base with cross-entropy on the toxicity label. No fairness intervention.

Loss: L_tox
No identity signal used

CDA

Counterfactual Augmentation

For each training instance containing an identity term, generate a counterfactual by swapping the term with another from the same group. Train on the union of original and augmented data (1:1 ratio).

Loss: L_tox on augmented set
Augmentation ratio: 1.0

Adversarial

Gradient Reversal

Two-head model: a toxicity classifier and an identity predictor share the encoder. Gradient reversal pushes the encoder to be uninformative about identity.

Loss: L_total = L_tox − λ·L_adv
λ = 0.5, hidden dim 128

Training Configuration

Backbone	`xlm-roberta-base`
Max sequence length	128 tokens
Batch size	32
Epochs	5
Optimizer	AdamW (β₁=0.9, β₂=0.98)
Learning rate	2 × 10⁻⁵
Warmup ratio	0.1
Weight decay	0.01
Gradient clip	1.0
Seed	42

Results

All metrics computed on the 5,911-sample held-out test set. Bias metrics computed across the four identity subgroups; lower is better.

Headline Comparison

Model	Utility					Fairness (lower is better)
Model	Acc	Prec	Rec	F1	Macro F1	FPR Δ	DP Δ	CFT Gap
Baseline	0.749	0.778	0.642	0.703	0.743	0.421	0.505	0.052
CDA	0.734	0.761	0.624	0.685	0.728	0.196	0.136	0.042
Adversarial	0.747	0.781	0.633	0.699	0.741	0.256	0.294	0.068

Bold = best per column. CDA achieves the best fairness on all three metrics; F1 cost is 1.8 percentage points.

FPR Disparity −54% CDA vs Baseline

Demographic Parity Δ −73% CDA vs Baseline

CFT Gap −19% CDA vs Baseline

Per-Group False Positive Rates

CDA reduces caste-group FPR from 0.42 to 0.34. Religion FPR is unchanged across CDA/Baseline; Adversarial improves it slightly. Region (n=63) is volatile and should be interpreted with caution.

Analysis & Discussion

1. CDA wins on every fairness metric

By directly exposing the model to identity-swapped variants of training examples, CDA breaks the spurious correlation between identity tokens and the toxicity label. The 73% reduction in Demographic Parity gap shows the model's positive prediction rate is now nearly equal across groups, and the CFT Gap reduction confirms that identity-only edits no longer flip the prediction.

2. Adversarial debiasing has a hidden failure mode

Adversarial debiasing improves FPR Disparity and DP Δ but worsens the CFT Gap (0.068 vs 0.052 baseline). Gradient reversal pushes the encoder toward identity-invariant representations on average, but does not guarantee that swapping a single identity token in a fixed context preserves the prediction. This is consistent with prior findings that adversarial debiasing can produce shallow decorrelation rather than genuine counterfactual robustness.

3. The fairness–utility tradeoff is small

CDA costs 1.8 points of F1 in exchange for a 54% reduction in FPR Disparity. For any deployment where false positives on identity-mentioning content carry a real harm (community moderation, journalism, legal text), this tradeoff is favourable.

4. Limitations

Single dataset. Results are reported on one Hindi–English code-mixed corpus. Generalisation to other code-mixed languages or domains requires further study.
Small region subgroup. Only 63 samples mention regional identity terms. Per-group metrics for region are noisy and reported but not interpreted strongly.
Lexicon-based identity detection. Identity annotation relies on a curated lexicon, which under-counts implicit references and informal spellings.
Binary toxicity label. We do not distinguish between hate speech, profanity, and offensive humour, which limits the granularity of conclusions.

Reproduce

Five commands recreate every number on this page.

Install dependencies

python3 -m venv .venv
.venv/bin/pip install -r requirements.txt

Prepare data

# place combined_hate_speech_dataset.csv in data/raw/
.venv/bin/python main.py --stage data

Train baseline

.venv/bin/python main.py --stage baseline

Train CDA & Adversarial

.venv/bin/python main.py --stage cda
.venv/bin/python main.py --stage adversarial

Evaluate & compare

.venv/bin/python main.py --stage evaluate
# → results/model_comparison.csv

Environment

Python 3.9+
PyTorch 2.0+ (CUDA, MPS, or CPU)
transformers ≥ 4.35
~1.5 GB disk for model weights