Overview

Almanac's data pipeline transforms external data into searchable knowledge through three distinct phases:

External APIs → Syncing → Indexing → Search

The Three Phases

1. Syncing (Data Collection)

What: Fetches raw data from MCP servers and stores it in MongoDB

Input: MCP server tools (API calls) Output: Normalized records in MongoDB Duration: Depends on data volume (minutes to hours for large datasets)

Learn More →

What: Creates embeddings for semantic similarity search

Input: MongoDB records Output: Vector embeddings in Qdrant Duration: ~0.2-0.5s per document

Learn More →arrow-up-right

3. Graph Indexing (Knowledge Graph)

What: Extracts entities and relationships for graph traversal

Input: MongoDB records Output: Knowledge graph in Memgraph Duration: ~1-3s per document (includes LLM extraction)

Learn More →arrow-up-right

Why Three Phases?

Each phase serves a different purpose:

Phase
Purpose
Storage
Speed

Syncing

Raw data collection

MongoDB

Variable

Vector Index

Semantic similarity

Qdrant

Fast

Graph Index

Entity/relationship discovery

Memgraph

Slower

Separation Benefits

Flexibility: Re-index without re-syncing

Efficiency: Only sync what changed

Reliability: Each phase can fail independently

Complete Workflow

Here's what happens when you connect a new data source:

Step 1: Configuration

The config defines:

  • Which tools to call (list_channels, get_messages)

  • How to transform the data (field mappings)

  • What record types to create (channel, message)

Step 2: Syncing

Example:

Step 3: Vector Indexing

Example:

Step 4: Graph Indexing

Example:

Step 5: Ready to Query!

Storage Architecture

Almanac uses multiple databases, each optimized for its purpose:

MongoDB (Document Store)

Stores: Raw records from external APIs

Why:

  • Flexible schema (different data sources have different structures)

  • Fast writes (optimized for syncing)

  • Source of truth for all data

Example:

Qdrant (Vector Store)

Stores: Embeddings for semantic search

Why:

  • Optimized for vector similarity search

  • Fast nearest-neighbor queries

  • Scales to billions of vectors

Example:

Memgraph (Graph Store)

Stores: Entities, relationships, and graph structure

Why:

  • Fast graph traversal (1-hop, 2-hop, etc.)

  • Complex relationship queries

  • Optimized for connected data

Example:

Redis (Cache)

Stores: Temporary data, sessions, job queues

Why:

  • Ultra-fast in-memory access

  • Perfect for caching and coordination

  • Handles high throughput

Data Flow Diagram

Performance Characteristics

Syncing

  • Throughput: Limited by external API rate limits

  • Parallelization: 32 concurrent requests by default

  • Bottleneck: External API speed

Vector Indexing

  • Throughput: ~100-200 docs/sec (depends on embedding model)

  • Parallelization: 32 concurrent operations by default

  • Bottleneck: Embedding generation (API or local)

Graph Indexing

  • Throughput: ~5-20 docs/sec (includes LLM extraction)

  • Parallelization: 32 concurrent operations by default

  • Bottleneck: LLM extraction latency

Query Time

  • Naive mode: 50-100ms

  • Local/Global mode: 100-300ms

  • Hybrid mode: 200-400ms

  • Mix mode: 300-600ms (includes reranking)

Monitoring the Pipeline

Via CLI

Via UI

  1. Navigate to Dashboard

  2. See real-time progress:

    • Records synced

    • Documents indexed

    • Entities extracted

    • Relationships found

Via Logs

Common Operations

Re-sync Everything

Re-index Vectors Only

Re-index Graph Only

Incremental Sync

Error Handling

Each phase handles errors independently:

Sync Errors

Index Errors

Failed operations are logged and can be retried without affecting successful operations.

Next Steps

Last updated

Was this helpful?