Selected Projects

IndoLib NLP Toolkit for Low-Resource South Asian Languages

Launched on Wed Sep 01 2021 00:00:00 GMT+0000 (Coordinated Universal Time)

Natural Language Processing
Machine Learning
Deep Learning
AI Ethics

Undertaken as a Master's thesis at Harvard University from Sep 2021 to Oct 2022, IndoLib emerged as a groundbreaking toolkit enhancing Natural Language Processing (NLP) research for underrepresented South Asian languages. It redefined language modeling benchmarks across Indo-Aryan, Dravidian, and Sino-Tibetan languages, pushing the boundaries of NLP.

A snapshot of diverse South Asian scripts

Core Contributions

  • Toolkit Development: Pioneered IndoLib to tackle linguistic challenges across 31 Indic languages, optimizing NER and summarization models.
  • Benchmark Redefinition: Surpassed industry standards in language modeling.
  • Text Normalization: Established a robust normalization and sampling pipeline.
  • Sanskrit-English Translation: Achieved benchmark results in Sanskrit-English machine translation, under peer review for publication.

Technical Proficiency

  • Large Language Models (LLM): Utilized large language models for multilingual analysis.
  • Generative AI: Explored generative models for language translation.
  • AI Ethics: Adhered to ethical guidelines in AI, ensuring respectful representation of linguistic diversity.

Academic Recognition

  • Publication Under Review: Awaiting peer review for publication on Sanskrit-English translation achievements.
  • Interdisciplinary Collaboration: Engaged with linguists and technologists to ensure accurate language representation.