Architecture
Architecture Overview¶
DataBeak is built as a Model Context Protocol (MCP) server that provides AI assistants with comprehensive CSV data manipulation capabilities. This document explains the technical architecture and design decisions.
Technology Stack¶
- Framework: FastMCP 2.11.3+ (Model Context Protocol)
- Data Processing: Pandas 2.2.3+, NumPy 2.1.3+
- Package Manager: uv (ultra-fast Python package management)
- Build System: Hatchling
- Code Quality: Ruff (linting and formatting), MyPy (type checking)
- Configuration: Pydantic Settings for environment management
Core Components¶
src/databeak/
├── server.py # FastMCP server composition & routing
├── core/ # Core infrastructure
│ ├── session.py # Session management with lifecycle
│ ├── settings.py # Environment-based configuration
│ └── json_schema_validate.py # Custom JSON schema validation
├── models/ # Data models and type definitions
│ ├── data_models.py # Core data types & enums
│ ├── data_session.py # DataFrame session wrapper
│ ├── typed_dicts.py # TypedDict definitions for type safety
│ ├── tool_responses.py # Pydantic response models
│ ├── statistics_models.py # Statistics response types
│ └── expression_models.py # Secure expression types
├── servers/ # Specialized MCP servers (server composition)
│ ├── io_server.py # Load/export operations
│ ├── transformation_server.py # Data transformation
│ ├── statistics_server.py # Statistical analysis
│ ├── discovery_server.py # Data profiling & discovery
│ ├── validation_server.py # Schema validation & quality
│ ├── column_server.py # Column operations
│ ├── column_text_server.py # Text manipulation
│ ├── row_operations_server.py # Row-level operations
│ └── system_server.py # Health & system info
├── services/ # Backend operation implementations
│ └── data_operations.py # Data preview and transformation utilities
├── utils/ # Utility functions
│ ├── validators.py # Input validation
│ ├── secure_evaluator.py # Safe expression evaluation
│ ├── logging_config.py # Logging configuration
│ └── pydantic_validators.py # Custom Pydantic validators
├── exceptions.py # Custom error handling
└── _version.py # Dynamic version loading
Key Features¶
Session Management¶
- Multi-session support with automatic cleanup
- Configurable timeouts and resource limits
- Session isolation for concurrent users
Data Operations¶
- 40+ tools covering I/O, manipulation, analysis, and validation
- Multiple format support: CSV, JSON, Excel, Parquet, HTML, Markdown
- Streaming processing for large files
- Type-safe operations with Pydantic validation
Auto-Save & History¶
- Automatic saving after each operation
- Undo/redo functionality with operation tracking
- Persistent history with JSON storage
- Configurable strategies: overwrite, backup, versioned
Configuration Management¶
- Environment-based settings using Pydantic Settings
- Centralized configuration in DataBeakSettings class
- Runtime version detection via importlib.metadata
Code Quality & Architecture¶
- Zero static analysis violations - Clean ruff compliance across all categories
- Strong type safety - 100% mypy compliance with minimal Any usage
- High test coverage - 880+ unit tests + 43 integration tests with 88%+ coverage
- Server composition pattern - Modular FastMCP servers for different domains
- Context-based logging - MCP-integrated logging for better traceability
- Clear API design - Keyword-only boolean parameters, no boolean traps
- Defensive practices - Proper exception handling, input validation
Environment Variables¶
All configuration uses the DATABEAK_ prefix:
| Variable | Default | Purpose |
|---|---|---|
DATABEAK_MAX_FILE_SIZE_MB |
1024 | Maximum file size limit |
DATABEAK_SESSION_TIMEOUT |
3600 | Session timeout (seconds) |
DATABEAK_CHUNK_SIZE |
10000 | Processing chunk size |
DATABEAK_MEMORY_THRESHOLD_MB |
2048 | Memory threshold for health monitoring |
DATABEAK_MAX_VALIDATION_VIOLATIONS |
1000 | Max validation violations to report |
DATABEAK_MAX_ANOMALY_SAMPLE_SIZE |
10000 | Max sample size for anomaly detection |
MCP Integration¶
The server implements the Model Context Protocol standard:
- Tools: 40+ data manipulation functions
- Resources: Session and data access
- Prompts: Data analysis templates
- Error Handling: Structured error responses
Tool Categories¶
- I/O Operations - Load/export data in multiple formats
- Data Manipulation - Transform, filter, sort, and modify data
- Data Analysis - Statistics, correlations, outliers, profiling
- Data Validation - Schema validation, quality checking, anomaly detection
- Session Management - Stateless data processing with external context management
- System Tools - Health monitoring and server information
Design Principles¶
- Type Safety: Full type annotations with Pydantic validation
- Modularity: Clear separation of concerns across modules
- Performance: Streaming operations for large datasets
- Reliability: Comprehensive error handling and logging
- Usability: Simple installation and configuration
- Maintainability: Modern tooling and clear documentation
Quality Standards¶
DataBeak maintains strict code quality standards with automated enforcement:
Code Quality Metrics¶
- Zero ruff violations - Perfect linting compliance across 46 rules
- 100% MyPy compliance - Complete type safety with minimal Any usage
- Perfect MCP documentation - Comprehensive Field descriptions, no Args sections
- High test coverage - 983 unit tests + 43 integration tests validating all functionality
- Clean architecture - Stateless MCP design with eliminated complexity
Quality Enforcement Tools¶
- Ruff - Comprehensive linting and formatting (46 rules enabled)
- MyPy - Static type checking with strict configuration
- Pre-commit hooks - Automated quality gates preventing regressions
- Custom MCP checkers - Specialized tools for MCP documentation standards:
check_docstring_args.py- Ensures no Args sections in MCP tool docstringscheck_mcp_field_descriptions.py- Validates comprehensive Field descriptions
Quality Commands¶
# Run all quality checks
uv run pre-commit run --all-files
# Individual checks
uv run ruff check src/ tests/ # Linting
uv run mypy src/databeak/ # Type checking
uv run pytest tests/unit/ # Unit tests
scripts/check_docstring_args.py # MCP Args compliance
scripts/check_mcp_field_descriptions.py # MCP Field compliance
Development Workflow¶
Package Management¶
uv sync # Install dependencies
uv run databeak # Run server
uv run -m pytest # Run tests
uv run ruff check && uv run ruff format --check && uv run mypy src/databeak/ && uv run -m pytest
Version Management¶
- Single source of truth: pyproject.toml
- Automatic synchronization:
uv run sync-versions - Dynamic loading: via importlib.metadata
Quality Assurance¶
- Linting: Ruff with comprehensive rule set
- Formatting: Ruff with 100-character lines
- Type checking: MyPy with strict configuration
- Testing: Three-tier testing strategy
- Unit tests (
tests/unit/): Fast, isolated module testing - Integration tests (
tests/integration/): Component interaction validation - E2E tests (
tests/e2e/): Complete workflow verification - Coverage target: 80%+ with pytest-cov
- See Testing Guide for best practices
Future Considerations¶
- Advanced transformation interface for complex operations
- Real-time collaboration features
- Machine learning integrations for data insights
- Cloud storage support for remote data sources
- Advanced visualization tools for data exploration
For implementation details and contributing guidelines, see CONTRIBUTING.md