HT Hindi Toxicity Bias XLM-R fairness audit Paper PDF

Fairness in Hindi-English code-mixed NLP

Quantifying identity bias in Hinglish toxicity classifiers

A controlled XLM-RoBERTa study comparing standard fine-tuning, Counterfactual Data Augmentation, and Gradient-Reversal Adversarial Debiasing.

Dataset 29,550 Hinglish social posts
Identity-bearing 23.6% 6,973 samples
Best mitigation CDA Wins all fairness metrics
Utility cost 1.8 F1 points vs baseline
01

Identity Coverage

Gender
4,722
Religion
1,777
Caste
411
Region
63

Region is reported but not interpreted strongly because n=63.

"

CDA reduces identity-correlated false positives while preserving most classifier utility.

02

Implemented pipeline

From raw Hinglish text to fairness-aware evaluation

Data

Preprocess + split

Auto-detect text and hate labels, then create 70 / 10 / 20 stratified splits.

src/data_utils.py
Lexicon

Identity tagging

Mark religion, caste, gender, and region terms with curated Hinglish variants.

src/identity_detection.py
Mitigation

CDA generation

Swap identity terms within the same group to build CAHH and CFT test pairs.

src/cda.py
Models

Three matched runs

Baseline, CDA, and adversarial models share XLM-RoBERTa hyperparameters.

src/train_*.py
Audit

Bias metrics

Compare FPR disparity, demographic parity gap, and counterfactual prediction shift.

src/bias_metrics.py
Baseline Ltox

Standard toxicity classification objective.

CDA Ltox + CAHH

Original plus identity-swapped examples.

Adversarial Ltox - λ * Ladv

Gradient reversal with λ = 0.5.

XLM-RoBERTa-base 128 tokens Batch 32 5 epochs AdamW 2e-5 Seed 42
03

Held-out test set, n = 5,911

CDA produces the strongest fairness gains

Fairness Metrics

Lower is better
Model F1 FPR Δ DP Δ CFT Gap
FPR Disparity -54%

CDA vs baseline

Demographic Parity -73%

CDA vs baseline

Adversarial CFT Gap +31%

worse than baseline

Per-Group False Positive Rate

Identity audit

Utility vs Bias

Top-left is best
04

Presentation-ready interpretation

What the results mean

01

CDA is the practical winner

It improves every fairness metric with a small F1 tradeoff, making it the strongest mitigation strategy for this project.

02

Caste false positives are highest

Baseline caste FPR is 0.421, nearly 3x gender FPR, showing why South Asian identity-aware fairness audits matter.

03

Adversarial debiasing masks risk

It improves group-level fairness but increases counterfactual instability from 0.052 to 0.068.

04

Limitations are explicit

Single dataset, lexicon-based tagging, binary labels, single-seed runs, and a small region subgroup constrain the claims.

Main PPT line: Counterfactual Data Augmentation is the more robust fairness intervention for Hinglish toxicity classification.