Skip to content

Architecture

Architecture Overview

DataBeak is built as a Model Context Protocol (MCP) server that provides AI assistants with comprehensive CSV data manipulation capabilities. This document explains the technical architecture and design decisions.

Technology Stack

  • Framework: FastMCP 2.11.3+ (Model Context Protocol)
  • Data Processing: Pandas 2.2.3+, NumPy 2.1.3+
  • Package Manager: uv (ultra-fast Python package management)
  • Build System: Hatchling
  • Code Quality: Ruff (linting and formatting), MyPy (type checking)
  • Configuration: Pydantic Settings for environment management

Core Components

src/databeak/
├── server.py                 # FastMCP server composition & routing
├── core/                     # Core infrastructure
│   ├── session.py              # Session management with lifecycle
│   ├── settings.py             # Environment-based configuration
│   └── json_schema_validate.py # Custom JSON schema validation
├── models/                   # Data models and type definitions
│   ├── data_models.py          # Core data types & enums
│   ├── data_session.py         # DataFrame session wrapper
│   ├── typed_dicts.py          # TypedDict definitions for type safety
│   ├── tool_responses.py       # Pydantic response models
│   ├── statistics_models.py    # Statistics response types
│   └── expression_models.py    # Secure expression types
├── servers/                  # Specialized MCP servers (server composition)
│   ├── io_server.py            # Load/export operations
│   ├── transformation_server.py # Data transformation
│   ├── statistics_server.py    # Statistical analysis
│   ├── discovery_server.py     # Data profiling & discovery
│   ├── validation_server.py    # Schema validation & quality
│   ├── column_server.py        # Column operations
│   ├── column_text_server.py   # Text manipulation
│   ├── row_operations_server.py # Row-level operations
│   └── system_server.py        # Health & system info
├── services/                 # Backend operation implementations
│   └── data_operations.py      # Data preview and transformation utilities
├── utils/                    # Utility functions
│   ├── validators.py           # Input validation
│   ├── secure_evaluator.py     # Safe expression evaluation
│   ├── logging_config.py       # Logging configuration
│   └── pydantic_validators.py  # Custom Pydantic validators
├── exceptions.py             # Custom error handling
└── _version.py              # Dynamic version loading

Key Features

Session Management

  • Multi-session support with automatic cleanup
  • Configurable timeouts and resource limits
  • Session isolation for concurrent users

Data Operations

  • 40+ tools covering I/O, manipulation, analysis, and validation
  • Multiple format support: CSV, JSON, Excel, Parquet, HTML, Markdown
  • Streaming processing for large files
  • Type-safe operations with Pydantic validation

Auto-Save & History

  • Automatic saving after each operation
  • Undo/redo functionality with operation tracking
  • Persistent history with JSON storage
  • Configurable strategies: overwrite, backup, versioned

Configuration Management

  • Environment-based settings using Pydantic Settings
  • Centralized configuration in DataBeakSettings class
  • Runtime version detection via importlib.metadata

Code Quality & Architecture

  • Zero static analysis violations - Clean ruff compliance across all categories
  • Strong type safety - 100% mypy compliance with minimal Any usage
  • High test coverage - 880+ unit tests + 43 integration tests with 88%+ coverage
  • Server composition pattern - Modular FastMCP servers for different domains
  • Context-based logging - MCP-integrated logging for better traceability
  • Clear API design - Keyword-only boolean parameters, no boolean traps
  • Defensive practices - Proper exception handling, input validation

Environment Variables

All configuration uses the DATABEAK_ prefix:

Variable Default Purpose
DATABEAK_MAX_FILE_SIZE_MB 1024 Maximum file size limit
DATABEAK_SESSION_TIMEOUT 3600 Session timeout (seconds)
DATABEAK_CHUNK_SIZE 10000 Processing chunk size
DATABEAK_MEMORY_THRESHOLD_MB 2048 Memory threshold for health monitoring
DATABEAK_MAX_VALIDATION_VIOLATIONS 1000 Max validation violations to report
DATABEAK_MAX_ANOMALY_SAMPLE_SIZE 10000 Max sample size for anomaly detection

MCP Integration

The server implements the Model Context Protocol standard:

  • Tools: 40+ data manipulation functions
  • Resources: Session and data access
  • Prompts: Data analysis templates
  • Error Handling: Structured error responses

Tool Categories

  1. I/O Operations - Load/export data in multiple formats
  2. Data Manipulation - Transform, filter, sort, and modify data
  3. Data Analysis - Statistics, correlations, outliers, profiling
  4. Data Validation - Schema validation, quality checking, anomaly detection
  5. Session Management - Stateless data processing with external context management
  6. System Tools - Health monitoring and server information

Design Principles

  1. Type Safety: Full type annotations with Pydantic validation
  2. Modularity: Clear separation of concerns across modules
  3. Performance: Streaming operations for large datasets
  4. Reliability: Comprehensive error handling and logging
  5. Usability: Simple installation and configuration
  6. Maintainability: Modern tooling and clear documentation

Quality Standards

DataBeak maintains strict code quality standards with automated enforcement:

Code Quality Metrics

  • Zero ruff violations - Perfect linting compliance across 46 rules
  • 100% MyPy compliance - Complete type safety with minimal Any usage
  • Perfect MCP documentation - Comprehensive Field descriptions, no Args sections
  • High test coverage - 983 unit tests + 43 integration tests validating all functionality
  • Clean architecture - Stateless MCP design with eliminated complexity

Quality Enforcement Tools

  • Ruff - Comprehensive linting and formatting (46 rules enabled)
  • MyPy - Static type checking with strict configuration
  • Pre-commit hooks - Automated quality gates preventing regressions
  • Custom MCP checkers - Specialized tools for MCP documentation standards:
  • check_docstring_args.py - Ensures no Args sections in MCP tool docstrings
  • check_mcp_field_descriptions.py - Validates comprehensive Field descriptions

Quality Commands

# Run all quality checks
uv run pre-commit run --all-files

# Individual checks
uv run ruff check src/ tests/           # Linting
uv run mypy src/databeak/                        # Type checking
uv run pytest tests/unit/               # Unit tests
scripts/check_docstring_args.py         # MCP Args compliance
scripts/check_mcp_field_descriptions.py # MCP Field compliance

Development Workflow

Package Management

uv sync              # Install dependencies
uv run databeak      # Run server
uv run -m pytest    # Run tests
uv run ruff check && uv run ruff format --check && uv run mypy src/databeak/ && uv run -m pytest

Version Management

  • Single source of truth: pyproject.toml
  • Automatic synchronization: uv run sync-versions
  • Dynamic loading: via importlib.metadata

Quality Assurance

  • Linting: Ruff with comprehensive rule set
  • Formatting: Ruff with 100-character lines
  • Type checking: MyPy with strict configuration
  • Testing: Three-tier testing strategy
  • Unit tests (tests/unit/): Fast, isolated module testing
  • Integration tests (tests/integration/): Component interaction validation
  • E2E tests (tests/e2e/): Complete workflow verification
  • Coverage target: 80%+ with pytest-cov
  • See Testing Guide for best practices

Future Considerations

  • Advanced transformation interface for complex operations
  • Real-time collaboration features
  • Machine learning integrations for data insights
  • Cloud storage support for remote data sources
  • Advanced visualization tools for data exploration

For implementation details and contributing guidelines, see CONTRIBUTING.md