publications
Publications by categories in reversed chronological order.
2024
- EMNLPEvaluating Differentially Private Synthetic Data Generation in High-Stakes DomainsKrithika Ramesh , Nupoor Gandhi , Pulkit Madaan, and 3 more authorsIn Findings of the Association for Computational Linguistics: EMNLP 2024 , Nov 2024
The difficulty of anonymizing text data hinders the development and deployment of NLP in high-stakes domains that involve private data, such as healthcare and social services. Poorly anonymized sensitive data cannot be easily shared with annotators or external researchers, nor can it be used to train public models. In this work, we explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in these domains without compromising privacy. In contrast to prior work, we generate synthetic data for real high-stakes domains, and we propose and conduct use-inspired evaluations to assess data quality. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data. Overall, our work underscores the need for further improvements to synthetic data generation for it to be a viable way to enable privacy-preserving data sharing.
@inproceedings{ramesh-etal-2024-evaluating, title = {Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains}, author = {Ramesh, Krithika and Gandhi, Nupoor and Madaan, Pulkit and Bauer, Lisa and Peris, Charith and Field, Anjalie}, editor = {Al-Onaizan, Yaser and Bansal, Mohit and Chen, Yun-Nung}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024}, month = nov, year = {2024}, address = {Miami, Florida, USA}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-emnlp.894/}, doi = {10.18653/v1/2024.findings-emnlp.894}, pages = {15254--15269} }
2022
- NeurIPSA Case for Rejection in Low Resource ML DeploymentJerome White , Pulkit Madaan, Nikhil Shenoy , and 3 more authorsIn NeurIPS 2025 Workshop for Challenges in Deploying and Monitoring Machine Learning Systems , Nov 2022
Building reliable AI decision support systems requires a robust set of data on which to train models; both with respect to quantity and diversity. Obtaining such datasets can be difficult in resource limited settings, or for applications in early stages of deployment. Sample rejection is one way to work around this challenge, however much of the existing work in this area is ill-suited for such scenarios. This paper substantiates that position and proposes a simple solution as a proof of concept baseline.
@inproceedings{https://doi.org/10.48550/arxiv.2208.06359, title = {A Case for Rejection in Low Resource ML Deployment}, author = {White, Jerome and Madaan, Pulkit and Shenoy, Nikhil and Agnihotri, Apoorv and Sharma, Makkunda and Doshi, Jigar}, booktitle = {NeurIPS 2025 Workshop for Challenges in Deploying and Monitoring Machine Learning Systems}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license}, }
2020
- LRECMultilingual Neural Machine Translation involving Indian LanguagesPulkit Madaan, and Fatiha SadatIn Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation , May 2020
Neural Machine Translations (NMT) models are capable of translating a single bilingual pair and require a new model for each new language pair. Multilingual Neural Machine Translation models are capable of translating multiple language pairs, even pairs which it hasn’t seen before in training. Availability of parallel sentences is a known problem in machine translation. Multilingual NMT model leverages information from all the languages to improve itself and performs better. We propose a data augmentation technique that further improves this model profoundly. The technique helps achieve a jump of more than 15 points in BLEU score from the multilingual NMT model. A BLEU score of 36.2 was achieved for Sindhi–English translation, which is higher than any score on the leaderboard of the LoResMT SharedTask at MT Summit 2019, which provided the data for the experiments.
@inproceedings{madaan-sadat-2020-multilingual, title = {Multilingual Neural Machine Translation involving Indian Languages}, author = {Madaan, Pulkit and Sadat, Fatiha}, booktitle = {Proceedings of the WILDRE5{--} 5th Workshop on Indian Language Data: Resources and Evaluation}, month = may, year = {2020}, address = {Marseille, France}, publisher = {European Language Resources Association (ELRA)}, url = {https://www.aclweb.org/anthology/2020.wildre-1.6}, pages = {29--32}, language = {English}, isbn = {979-10-95546-67-2}, }
2019
- ThesisDeep mean shift clusteringPulkit Madaan, Abhishek Maiti , Saket Anand , and 1 more authorMay 2019
We use Mean Shift clustering in the latent space of an auto-encoder to have a better representation of the data and a more structured latent space. Instead of just using the mode of the distribution calculated using kernel density estimates, we use trajectories of data points leading to the modes to better model the basin of attraction of each mode. This helps in better structuring of the latent space and results in a more inferential model. Since mean-shift can be modelled as an RNN-block our method is end-to-end trainable. Tuning the bandwidth of mean-shift gives us the flexibility of clustering the latent space on different hierarchical levels. We modify the original trajectory based LSTM model by incorporating a discounting mechanism. We modified the mean shift implementation by using a fixed kernel for the mean shift iteratiosn. We also apply a new loss (Support Set Loss) to penalize the clusters made on the latent space. This uses the trajectories of the points segregated into groups which ended up in the same mode and those which didn’t. We have used this loss function in both semi-supervised and unsupervised fashion. In the end, we also propose a model which uses Contrastive Predictive Coding loss, in the latent space as well as a regularizer for the encoding network model.
@article{madaan2019deep, title = {Deep mean shift clustering}, author = {Madaan, Pulkit and Maiti, Abhishek and Anand, Saket and Mittal, Sushil}, year = {2019}, publisher = {IIIT-Delhi}, url = {https://repository.iiitd.edu.in/jspui/handle/123456789/915}, language = {English}, }