Architecture
Architecture Overview¶
DataBeak is built as a Model Context Protocol (MCP) server that provides AI assistants with comprehensive CSV data manipulation capabilities. This document explains the technical architecture and design decisions.
Technology Stack¶
- Framework: FastMCP 2.11.3+ (Model Context Protocol)
- Data Processing: Pandas 2.2.3+, NumPy 2.1.3+
- Package Manager: uv (ultra-fast Python package management)
- Build System: Hatchling
- Code Quality: Ruff (linting and formatting), MyPy (type checking)
- Configuration: Pydantic Settings for environment management
Core Components¶
src/databeak/
├── server.py # FastMCP server entry point
├── models/ # Data models and session management
│ ├── csv_session.py # Session management & settings
│ ├── data_models.py # Core data types
│ └── data_session.py # Data operations
├── tools/ # MCP tool implementations
│ ├── data_io.py # Load/export operations
│ ├── data_manipulation.py # Transform operations
│ ├── data_analysis.py # Statistics & analysis
│ └── data_validation.py # Schema validation
├── exceptions.py # Custom error handling
└── _version.py # Dynamic version loading
Key Features¶
Session Management¶
- Multi-session support with automatic cleanup
- Configurable timeouts and resource limits
- Session isolation for concurrent users
Data Operations¶
- 40+ tools covering I/O, manipulation, analysis, and validation
- Multiple format support: CSV, JSON, Excel, Parquet, HTML, Markdown
- Streaming processing for large files
- Type-safe operations with Pydantic validation
Auto-Save & History¶
- Automatic saving after each operation
- Undo/redo functionality with operation tracking
- Persistent history with JSON storage
- Configurable strategies: overwrite, backup, versioned
Configuration Management¶
- Environment-based settings using Pydantic Settings
- Centralized configuration in CSVSettings class
- Runtime version detection via importlib.metadata
Environment Variables¶
All configuration uses the DATABEAK_
prefix:
Variable | Default | Purpose |
---|---|---|
DATABEAK_MAX_FILE_SIZE_MB |
1024 | Maximum file size |
DATABEAK_CSV_HISTORY_DIR |
"." | History storage location |
DATABEAK_SESSION_TIMEOUT |
3600 | Session timeout (seconds) |
DATABEAK_CHUNK_SIZE |
10000 | Processing chunk size |
DATABEAK_AUTO_SAVE |
true | Enable auto-save |
MCP Integration¶
The server implements the Model Context Protocol standard:
- Tools: 40+ data manipulation functions
- Resources: Session and data access
- Prompts: Data analysis templates
- Error Handling: Structured error responses
Tool Categories¶
- I/O Operations - Load/export data in multiple formats
- Data Manipulation - Transform, filter, sort, and modify data
- Data Analysis - Statistics, correlations, outliers, profiling
- Data Validation - Schema validation, quality checking, anomaly detection
- Session Management - Auto-save, history, undo/redo operations
- System Tools - Health monitoring and server information
Design Principles¶
- Type Safety: Full type annotations with Pydantic validation
- Modularity: Clear separation of concerns across modules
- Performance: Streaming operations for large datasets
- Reliability: Comprehensive error handling and logging
- Usability: Simple installation and configuration
- Maintainability: Modern tooling and clear documentation
Development Workflow¶
Package Management¶
uv sync # Install dependencies
uv run databeak # Run server
uv run -m pytest # Run tests
uv run ruff check && uv run ruff format --check && uv run mypy src/ && uv run -m pytest
Version Management¶
- Single source of truth: pyproject.toml
- Automatic synchronization:
uv run sync-versions
- Dynamic loading: via importlib.metadata
Quality Assurance¶
- Linting: Ruff with comprehensive rule set
- Formatting: Ruff with 100-character lines
- Type checking: MyPy with strict configuration
- Testing: Three-tier testing strategy
- Unit tests (
tests/unit/
): Fast, isolated module testing - Integration tests (
tests/integration/
): Component interaction validation - E2E tests (
tests/e2e/
): Complete workflow verification - Coverage target: 80%+ with pytest-cov
- See Testing Guide for best practices
Future Considerations¶
- Advanced transformation interface for complex operations
- Real-time collaboration features
- Machine learning integrations for data insights
- Cloud storage support for remote data sources
- Advanced visualization tools for data exploration
For implementation details and contributing guidelines, see CONTRIBUTING.md