← All Projects
AI SystemsIn Development

Project Corpus

Harvesting the entirety of human literature and vectorizing it into usable, queryable knowledge.

Overview

Corpus is a large-scale literature ingestion and vectorization platform targeting 50,000–100,000 biomedical and scientific papers. The pipeline combines programmatic retrieval, AI-driven semantic interpretation, and dense vector embeddings — producing a retrieval-augmented knowledge base that Cadence and other systems can query with biological and conceptual precision.

Key Capabilities

  • PubMed MCP connector for programmatic article retrieval at scale
  • AI interpretation layer — extracts structured insight, not just text
  • Dense vector embeddings indexed for semantic similarity retrieval
  • Targeted ingestion: neuroscience, genomics, longevity, AI architecture
  • Self-modeling loop — Cadence reads its own architecture literature

Strategic Motivation

The limiting factor in AI-assisted research is not compute — it is the absence of deep, domain-specific knowledge grounded in primary literature. Corpus closes that gap.