Research

Open datasets, foundation models, and applied AI built for Bharat — with full transparency on what we shipped, what we’re building, and what comes next.


Datasets

Indian Legal Corpus v1

Launching May 2026 on HuggingFace

The largest open Indian legal AI corpus, ever. Free for the community on launch.

We extracted, cleaned, and structured 76 years of Indian jurisprudence from across all 25 High Courts and the Supreme Court — from 1950 to 2026. 15.2 million documents. 87 GB of clean JSONL. Approximately 20 billion tokens of high-quality legal text.

It is licensed under CC-BY-4.0. You can use it for research, commercial training, fine-tuning, retrieval, embeddings — anything. We ask only that you cite the corpus.

Why we built it: India has 1.4 billion people but no comprehensive open legal AI dataset. Without one, every Indian legal AI startup either pays expensive proprietary providers or builds inferior systems. That is not how a knowledge ecosystem grows. We chose to release this for free so that 100 startups can build, not just one.

  • Documents: 15,200,000
  • Clean text size: 87 GB
  • Estimated tokens: ≈20 billion
  • Sources: 25 High Courts + Supreme Court
  • Time coverage: 1950–2026 (76 years)
  • License: CC-BY-4.0

Foundation Models

We pre-train language models from scratch on diverse Indian language corpora, with Hinglish as a first-class citizen. Our work is intentional, not glamorous — we use consumer hardware, share what works publicly, and document what didn’t.

Current model line-up

TRISHA 137M — Pilot model. 12 layers, 768d. 1.78B tokens trained. Architecture validation milestone.

TRISHA 373M — Training complete. 28 layers, 1024d. 7.54B tokens. Best eval loss 2.4705. Hinglish, Hindi, Odia, English. Pending safety and identity fine-tuning before public release.

TRISHA 2.7B (in development) — Vision-language MoE model. 32 routed experts + 1 shared. SigLIP2 visual connector. Cloud training planned. Target release: Q3 2026.

More on our model design choices, training infrastructure, and lessons in upcoming research notes.


What’s Next

Now (May–Jun 2026)

Indian Legal Corpus v1 release. Vakil GPT private beta with select legal practitioners. Pre-tokenized corpus for community.

Next (Q3 2026)

TRISHA 2.7B multimodal model. Vector embeddings service for legal corpus. Vakil GPT public launch.

Future (2027+)

Multilingual Indic foundation model suite (Hindi, Bengali, Tamil, Telugu, Marathi, and more). Domain-specific datasets in healthcare, education, agriculture. Open research collaborations with Indian academic institutions.


Cite the Work

@dataset{trisha_legal_corpus_2026,
  title  = {Indian Legal Corpus v1},
  author = {TRISHA Vision Team},
  year   = {2026},
  url    = {https://huggingface.co/datasets/trishavision-ai/indian-legal-corpus-v1},
  license = {CC-BY-4.0}
}

Licensing for commercial enrichments

The base corpus is free. We also offer paid enrichments: pre-tokenized versions, vector embeddings, citation-resolved subsets, and live-update subscriptions. Reach out at legal@trishavision.com.

Scroll to Top