Data Syncing

How Almanac fetches and synchronizes data from MCP servers.

Overview

Syncing is the first phase of the indexing pipeline, responsible for fetching raw data from MCP servers and storing it in MongoDB for later processing.

MCP Server → Fetch Data → MongoDB → Ready for Indexing

Sync Types

1. Initial Sync

First-time sync of all available data.

// Trigger via API
POST /api/sync
{
  "dataSource": "slack",
  "mode": "full"
}

Process:

  1. Connect to MCP server

  2. Discover available tools/resources

  3. Fetch all historical data

  4. Store in MongoDB with metadata

  5. Mark records as "pending indexing"

Duration: Depends on data volume

  • Small (< 10K records): 5-15 minutes

  • Medium (10K - 100K): 30-60 minutes

  • Large (> 100K): Hours

2. Incremental Sync

Fetch only new/changed records since last sync.

Process:

  1. Check last sync timestamp

  2. Fetch only new records

  3. Update changed records

  4. Delete removed records

  5. Mark as "pending indexing"

Duration: Much faster (seconds to minutes)

3. Scheduled Sync

Automatic syncing on a schedule.

Common Schedules:

  • Every hour: "0 * * * *"

  • Every 6 hours: "0 */6 * * *"

  • Daily at 2 AM: "0 2 * * *"

  • Weekly on Sunday: "0 0 * * 0"

Data Sources

Slack

Tools Used:

  • list_channels - Get all channels

  • get_channel_history - Fetch messages

  • get_thread_replies - Fetch thread replies

  • get_users - Fetch user info

Sync Strategy:

Data Stored:

  • Messages

  • Thread replies

  • Channel metadata

  • User profiles

  • Reactions

GitHub

Tools Used:

  • list_repos - Get repositories

  • get_issues - Fetch issues

  • get_pull_requests - Fetch PRs

  • get_commits - Fetch commits

  • get_readme - Fetch documentation

Sync Strategy:

Data Stored:

  • Issues

  • Pull requests

  • Commits

  • Comments

  • README files

  • Code files

Notion

Tools Used:

  • search - Search all content

  • get_page - Fetch page content

  • get_database - Fetch database

  • get_blocks - Fetch page blocks

Sync Strategy:

Data Stored:

  • Pages

  • Databases

  • Blocks (text content)

  • Properties

  • Relations

Sync Configuration

Via UI

  1. Navigate to Data Sources

  2. Select data source

  3. Click "Sync Settings"

  4. Configure options:

    • Sync mode (full/incremental)

    • Schedule (manual/automatic)

    • Filters (channels, repos, etc.)

  5. Save and trigger sync

Via API

Sync Process Details

1. Pre-Sync Validation

Before starting:

2. Data Fetching

Fetch with pagination and rate limiting:

3. Data Storage

Store in MongoDB with metadata:

4. Batch Processing

Process in batches for efficiency:

5. Error Handling

Handle failures gracefully:

Performance Optimization

1. Parallel Fetching

Fetch multiple resources in parallel:

2. Rate Limit Management

Stay within API limits:

3. Incremental Sync Optimization

Only fetch what changed:

4. Caching

Cache frequently accessed data:

Monitoring

Sync Status

Track sync progress:

Metrics

Monitor sync performance:

Alerts

Set up alerts for issues:

Troubleshooting

Sync Stuck

Rate Limit Exceeded

Duplicate Records

Missing Records

Best Practices

1. Start with Incremental

2. Filter Unnecessary Data

3. Monitor Sync Health

4. Handle Failures Gracefully

Next Steps

Last updated

Was this helpful?