Appearance
Data Flow & Transformations
Overview
The MDK uses a Data Abstraction Layer (DAL) to enable seamless data communication between models written in different programming languages (Python, Julia, Go). The DAL provides multiple transfer modes optimized for different data sizes and use cases, complemented by a caching system to avoid redundant computation.
Why the DAL Exists
Real-world workflows combine specialist models built by different teams in different languages. A Go model's output needs to become a Python model's input without requiring either team to write custom integration code. The DAL solves this by providing a universal data transfer mechanism that works across any language pairing.
The DAL implements the MDK's core principle: data moves automatically between models, regardless of how they were built.
Data Abstraction Layer (DAL)
The DAL solves two key problems:
- Cross-language interoperability — Models in different languages need a common way to exchange data
- Performance at scale — Different data sizes require different transfer mechanisms
The DAL provides three transfer modes:
| Mode | Mechanism | Best For |
|---|---|---|
| File-Based | Shared file system | Large datasets, complex structures, multi-dimensional arrays |
| Cache-Based | Fast in-memory storage | Small to medium data, state management, cached results |
| Direct | HTTP request/response | Small payloads, simple testing, real-time inference |
The transfer mode is automatically selected based on the data size and workflow configuration. Model authors don't need to know which mode is in use — they simply read inputs and write outputs through a standard interface.
Data Transfer Patterns
1. File-Based Transfer
The standard pattern for large data and complex structures. Models write output files to a shared location, and downstream tasks read from that location.
When Used:
- Large data volumes (simulation results, sensor data, image sets)
- Complex nested structures
- Multiple downstream consumers need access to the same data
- Standard pattern for output dependencies between tasks
How It Works:
- Task A completes and writes output to a structured location
- The WEM records the output file location
- When Task B becomes ready, the WEM constructs an input descriptor pointing to Task A's output
- Task B reads the file and processes the data
Benefits:
- Efficient for large data (written once, read once or multiple times)
- Models work with files natively in their runtime
- No memory constraints from network payloads
- Multiple consumers don't duplicate data
2. Cache-Based Transfer
Used for smaller data, workflow state management, and performance optimization.
When Used:
- Small to medium payloads
- Workflow coordination and state
- Task synchronization
- Caching expensive computation results
How It Works:
- Task A stores output to the cache with a unique key
- The WEM records the cache key
- Task B retrieves the data from cache using the key
- Data is available instantly without file I/O
Benefits:
- Fast access (in-memory retrieval)
- Ideal for coordination data
- Supports caching layer for result reuse
- Lower overhead than file system for small data
3. Direct API Transfer
Used when invoking containerized models or microservices for real-time inference.
When Used:
- Invoking ML models deployed as containers
- Real-time inference requirements
- Small input/output payloads
- Stateless operations
How It Works:
- The WEM builds a request with task inputs and parameters
- HTTP POST to the model's endpoint
- Model returns output directly in the response
- Output is stored to the appropriate DAL location for downstream use
Benefits:
- Low latency for small payloads
- Simple integration for stateless models
- Direct invocation pattern
Cross-Language Interoperability
One of the DAL's core capabilities is enabling models in different languages to work together seamlessly.
The OR HTTP Contract
Every model implements a standard three-type pattern:
- Input — Data the model receives
- Params — Configuration parameters
- Output — Data the model produces
This contract is language-agnostic. A Julia model can consume a Python model's output without either team coordinating data formats — the DAL handles the translation.
Example: Multi-Language Pipeline
Consider a workflow that combines three models:
- Data Preprocessor (Python) — Cleans and transforms raw data
- Traffic Simulator (Julia) — Runs high-performance simulation
- Result Analyzer (Go) — Aggregates and visualizes results
Without the DAL, teams would need to coordinate:
- File formats and naming conventions
- Data serialization (JSON vs CSV vs binary)
- File system locations and cleanup
With the DAL:
- Each model reads inputs and writes outputs through its standard interface
- The WEM coordinates file locations automatically
- Data flows from Python → Julia → Go with zero custom integration code
Input/Output Matching
When tasks have output dependencies, the WEM automatically determines which upstream output fields map to which downstream input fields.
Name-Based Matching
The WEM matches fields by name:
Task A Output:
{
"demand_forecast": {...},
"confidence_score": 0.85
}Task B Input Definition:
{
"demand_forecast": {...} // Matched by name
}Task B automatically receives the demand_forecast from Task A. The WEM constructs the input without manual configuration.
Type Validation
The WEM validates that matched fields have compatible types:
- Ensures data integrity across task boundaries
- Prevents type mismatches that would cause runtime errors
- Provides early validation before execution begins
Partial Matching
Downstream tasks don't need to consume all upstream outputs:
- Task B can read only the fields it needs from Task A
- Unused output fields are ignored
- Multiple downstream tasks can read different subsets of the same output
Caching with useCache
The DAL integrates with a caching system to avoid redundant computation. When useCache is enabled on a task:
How Caching Works
- Before execution, the WEM computes a cache key from:
- Task input data
- Task parameter values
- Task configuration
- The WEM checks if a cached result exists for that key
- If cache hit: Task is skipped, cached output is used
- If cache miss: Task executes normally, result is cached
Benefits
- Faster experimentation — Changing one task doesn't re-run unchanged tasks
- Cost savings — Expensive computations run only once for identical inputs
- Reproducibility — Same inputs always produce same outputs (or use cached results)
When to Use Caching
Good candidates:
- Expensive models (long-running simulations, complex optimizations)
- Tasks with stable outputs (preprocessing, feature engineering)
- Tasks early in the workflow (many dependent tasks benefit)
Avoid caching:
- Non-deterministic models (outputs vary for same inputs)
- Real-time data ingestion (always want fresh data)
- Tasks with side effects (external API calls, database writes)
Data Transformation
Operators for Custom Transformations
When automatic field matching isn't enough, custom operators can transform data:
Use Cases:
- Unit conversion (miles to kilometers)
- Data reshaping (pivot tables, aggregations)
- Filtering and sampling
- Format conversion
- Enrichment with external data
Operators are Python functions that read outputs from one or more tasks and produce transformed inputs for downstream tasks.
IMPORT_DDK and EXPORT_DDK
Special component types connect workflows to external data systems:
IMPORT_DDK:
- Queries data from DDK-generated GraphQL servers
- Brings structured domain data into workflows
- Example: Load rail network topology before running capacity simulation
EXPORT_DDK:
- Writes workflow results back to DDK servers
- Makes model outputs available to operational systems
- Example: Export optimized schedules to fleet management system
Performance Optimization
Transfer Mode Selection
The WEM automatically selects the most efficient transfer mode:
- Small data (< 1MB): Cache-based transfer for speed
- Medium data (1MB - 100MB): File-based transfer for reliability
- Large data (> 100MB): File-based with streaming where possible
Parallel Data Movement
When tasks have independent data dependencies:
- Data transfers can happen in parallel
- Multiple tasks can read from the same cached output simultaneously
- Reduces overall workflow execution time
Cleanup and Resource Management
The DAL manages storage automatically:
- Temporary files are cleaned up after downstream tasks complete
- Cache entries have configurable expiration
- Storage usage is monitored to prevent exhaustion
Best Practices
For Model Developers
- Use clear field names — Descriptive names improve automatic matching
- Document your schema — OpenAPI documentation helps users understand your model
- Keep outputs focused — Only produce data that downstream tasks need
- Use appropriate data types — Primitives (numbers, strings) transfer more reliably than complex objects
For Workflow Designers
- Enable caching on expensive tasks — Especially early-stage preprocessing
- Group related transformations — Reduce number of task boundaries
- Use operators for complex transformations — Don't force models to handle data reshaping
- Test with small data first — Validate logic before scaling to production datasets
For Platform Operators
- Monitor storage growth — Large workflows generate significant intermediate data
- Configure cache expiration — Balance performance vs storage costs
- Tune transfer mode thresholds — Based on infrastructure characteristics
- Implement cleanup policies — Automatic removal of old workflow data
See Also
- Workflow Execution Manager — How the WEM orchestrates the data flows described here
- Component Types — The four task types (MODEL, OPERATOR, IMPORT_DDK, EXPORT_DDK)
- Architecture — How the DAL fits into the broader MDK architecture
