Fairness in Hindi-English code-mixed NLP

Quantifying identity bias in Hinglish toxicity classifiers

A controlled XLM-RoBERTa study comparing standard fine-tuning, Counterfactual Data Augmentation, and Gradient-Reversal Adversarial Debiasing.

Dataset 29,550 Hinglish social posts

Identity-bearing 23.6% 6,973 samples

Best mitigation CDA Wins all fairness metrics

Utility cost 1.8 F1 points vs baseline

01

Identity Coverage

Gender

4,722

Religion

1,777

Caste

411

Region

63

Region is reported but not interpreted strongly because n=63.

"

CDA reduces identity-correlated false positives while preserving most classifier utility.

02

Implemented pipeline

From raw Hinglish text to fairness-aware evaluation

Data

Preprocess + split

Auto-detect text and hate labels, then create 70 / 10 / 20 stratified splits.

src/data_utils.py

Lexicon

Identity tagging

Mark religion, caste, gender, and region terms with curated Hinglish variants.

src/identity_detection.py

Mitigation

CDA generation

Swap identity terms within the same group to build CAHH and CFT test pairs.

src/cda.py

Models

Three matched runs

Baseline, CDA, and adversarial models share XLM-RoBERTa hyperparameters.

src/train_*.py

Audit

Bias metrics

Compare FPR disparity, demographic parity gap, and counterfactual prediction shift.

src/bias_metrics.py

Baseline L_tox

Standard toxicity classification objective.

CDA L_tox + CAHH

Original plus identity-swapped examples.

Adversarial L_tox - λ * L_adv

Gradient reversal with λ = 0.5.

XLM-RoBERTa-base 128 tokens Batch 32 5 epochs AdamW 2e-5 Seed 42

03

Held-out test set, n = 5,911

CDA produces the strongest fairness gains

Fairness Metrics

Lower is better

Model	F1	FPR Δ	DP Δ	CFT Gap

FPR Disparity -54%

CDA vs baseline

Demographic Parity -73%

CDA vs baseline

Adversarial CFT Gap +31%

worse than baseline

Per-Group False Positive Rate

Identity audit

Utility vs Bias

Top-left is best

04

Presentation-ready interpretation

What the results mean

01

CDA is the practical winner

It improves every fairness metric with a small F1 tradeoff, making it the strongest mitigation strategy for this project.

02

Caste false positives are highest

Baseline caste FPR is 0.421, nearly 3x gender FPR, showing why South Asian identity-aware fairness audits matter.

03

Adversarial debiasing masks risk

It improves group-level fairness but increases counterfactual instability from 0.052 to 0.068.

04

Limitations are explicit

Single dataset, lexicon-based tagging, binary labels, single-seed runs, and a small region subgroup constrain the claims.

Main PPT line: Counterfactual Data Augmentation is the more robust fairness intervention for Hinglish toxicity classification.