Fairness in Hindi-English code-mixed NLP
Quantifying identity bias in Hinglish toxicity classifiers
A controlled XLM-RoBERTa study comparing standard fine-tuning, Counterfactual Data Augmentation, and Gradient-Reversal Adversarial Debiasing.
Identity Coverage
Region is reported but not interpreted strongly because n=63.
CDA reduces identity-correlated false positives while preserving most classifier utility.
Implemented pipeline
From raw Hinglish text to fairness-aware evaluation
Preprocess + split
Auto-detect text and hate labels, then create 70 / 10 / 20 stratified splits.
src/data_utils.py
Identity tagging
Mark religion, caste, gender, and region terms with curated Hinglish variants.
src/identity_detection.py
CDA generation
Swap identity terms within the same group to build CAHH and CFT test pairs.
src/cda.py
Three matched runs
Baseline, CDA, and adversarial models share XLM-RoBERTa hyperparameters.
src/train_*.py
Bias metrics
Compare FPR disparity, demographic parity gap, and counterfactual prediction shift.
src/bias_metrics.py
Standard toxicity classification objective.
Original plus identity-swapped examples.
Gradient reversal with λ = 0.5.
Held-out test set, n = 5,911
CDA produces the strongest fairness gains
Fairness Metrics
Lower is better| Model | F1 | FPR Δ | DP Δ | CFT Gap |
|---|
CDA vs baseline
CDA vs baseline
worse than baseline
Per-Group False Positive Rate
Identity auditUtility vs Bias
Top-left is bestPresentation-ready interpretation
What the results mean
CDA is the practical winner
It improves every fairness metric with a small F1 tradeoff, making it the strongest mitigation strategy for this project.
Caste false positives are highest
Baseline caste FPR is 0.421, nearly 3x gender FPR, showing why South Asian identity-aware fairness audits matter.
Adversarial debiasing masks risk
It improves group-level fairness but increases counterfactual instability from 0.052 to 0.068.
Limitations are explicit
Single dataset, lexicon-based tagging, binary labels, single-seed runs, and a small region subgroup constrain the claims.
Main PPT line: Counterfactual Data Augmentation is the more robust fairness intervention for Hinglish toxicity classification.