Research

Open datasets, foundation models, and applied AI built for Bharat — with full transparency on what we shipped, what we’re building, and what comes next.


Datasets

Indian Legal Corpus v1

Coming soon (2026)

The largest open Indian legal AI corpus, ever. An anonymized sample will be free for research on launch.

We extracted, cleaned, and structured 76 years of Indian jurisprudence from across all 25 High Courts and the Supreme Court — from 1950 to 2026. 14.48 million documents. approx 87 GB of clean JSONL. Approximately 29.80 billion tokens of high-quality legal text.

It is licensed under CC-BY-4.0. The anonymized public sample may be used for research, training, fine-tuning, retrieval, and embeddings. Victim identities are removed in line with Section 228A IPC, the POCSO Act, and Supreme Court anonymization guidelines. Sensitive categories such as sexual offences, POCSO, and juvenile matters are excluded from the public release. The full corpus is available via verified research access. We ask that you cite the corpus and use it responsibly.

Why we built it: India has 1.4 billion people but no comprehensive open legal AI dataset. Without one, every Indian legal AI startup either pays expensive proprietary providers or builds inferior systems. That is not how a knowledge ecosystem grows. We chose to release this for free so that 100 startups can build, not just one.

  • Documents: 14,480,000
  • Clean text size: 87 GB
  • Estimated tokens: ≈29.80 billion
  • Sources: 25 High Courts + Supreme Court
  • Time coverage: 1950–2026 (76 years)
  • License: CC-BY-4.0

Foundation Models

We pre-train language models from scratch on diverse Indian language corpora, with Hinglish as a first-class citizen. Our work is intentional, not glamorous — we use consumer hardware, share what works publicly, and document what didn’t.

Current model line-up

TRISHA 137M — Pilot model. 12 layers, 768d. 1.78B tokens trained. Architecture validation milestone.

TRISHA 373M — Training complete. 28 layers, 1024d. 7.54B tokens. Best eval loss 2.4705. Hinglish, Hindi, Odia, English. Pending safety and identity fine-tuning before public release.

TRISHA 1.55B (in training) — Dense foundation model. 26 layers, GQA attention, QK-Norm, RoPE. 64K vocabulary, Hinglish-native with 22 Indian languages. Scratch-trained on a single home GPU. Target: 2026. A larger 14B professional model is planned next.

More on our model design choices, training infrastructure, and lessons in upcoming research notes.


What’s Next

Now

Indian Legal Corpus v1 release. Vakil GPT private beta with select legal practitioners. Anonymized sample corpus for the research community.

Next

TRISHA 1.55B foundation model (in training, 2026). Vector embeddings service for legal corpus. Vakil GPT public launch.

Future (2027+)

Multilingual Indic foundation model suite (Hindi, Bengali, Tamil, Telugu, Marathi, and more). Domain-specific datasets in healthcare, education, agriculture. Open research collaborations with Indian academic institutions.


Cite the Work

@dataset{trisha_legal_corpus_2026,
  title  = {Indian Legal Corpus v1},
  author = {TRISHA Vision Team},
  year   = {2026},
  url    = {https://huggingface.co/datasets/trishavision-ai/indian-legal-corpus-v1},
  license = {CC-BY-4.0}
}

Licensing for commercial enrichments

The anonymized sample corpus is free. We also offer paid enrichments: pre-tokenized versions, vector embeddings, citation-resolved subsets, and live-update subscriptions. Reach out at legal@trishavision.com.

Scroll to Top