Data Flow & Transformations

Overview

The MDK uses a Data Abstraction Layer (DAL) to enable seamless data communication between models written in different programming languages (Python, Julia, Go). The DAL provides multiple transfer modes optimized for different data sizes and use cases, complemented by a caching system to avoid redundant computation.

Why the DAL Exists

Real-world workflows combine specialist models built by different teams in different languages. A Go model's output needs to become a Python model's input without requiring either team to write custom integration code. The DAL solves this by providing a universal data transfer mechanism that works across any language pairing.

The DAL implements the MDK's core principle: data moves automatically between models, regardless of how they were built.

Data Abstraction Layer (DAL)

The DAL solves two key problems:

Cross-language interoperability — Models in different languages need a common way to exchange data
Performance at scale — Different data sizes require different transfer mechanisms

The DAL provides three transfer modes:

Mode	Mechanism	Best For
File-Based	Shared file system	Large datasets, complex structures, multi-dimensional arrays
Cache-Based	Fast in-memory storage	Small to medium data, state management, cached results
Direct	HTTP request/response	Small payloads, simple testing, real-time inference

The transfer mode is automatically selected based on the data size and workflow configuration. Model authors don't need to know which mode is in use — they simply read inputs and write outputs through a standard interface.

Data Transfer Patterns

1. File-Based Transfer

The standard pattern for large data and complex structures. Models write output files to a shared location, and downstream tasks read from that location.

When Used:

Large data volumes (simulation results, sensor data, image sets)
Complex nested structures
Multiple downstream consumers need access to the same data
Standard pattern for output dependencies between tasks

How It Works:

Task A completes and writes output to a structured location
The WEM records the output file location
When Task B becomes ready, the WEM constructs an input descriptor pointing to Task A's output
Task B reads the file and processes the data

Benefits:

Efficient for large data (written once, read once or multiple times)
Models work with files natively in their runtime
No memory constraints from network payloads
Multiple consumers don't duplicate data

2. Cache-Based Transfer

Used for smaller data, workflow state management, and performance optimization.

When Used:

Small to medium payloads
Workflow coordination and state
Task synchronization
Caching expensive computation results

How It Works:

Task A stores output to the cache with a unique key
The WEM records the cache key
Task B retrieves the data from cache using the key
Data is available instantly without file I/O

Benefits:

Fast access (in-memory retrieval)
Ideal for coordination data
Supports caching layer for result reuse
Lower overhead than file system for small data

3. Direct API Transfer

Used when invoking containerized models or microservices for real-time inference.

When Used:

Invoking ML models deployed as containers
Real-time inference requirements
Small input/output payloads
Stateless operations

How It Works:

The WEM builds a request with task inputs and parameters
HTTP POST to the model's endpoint
Model returns output directly in the response
Output is stored to the appropriate DAL location for downstream use

Benefits:

Low latency for small payloads
Simple integration for stateless models
Direct invocation pattern

Cross-Language Interoperability

One of the DAL's core capabilities is enabling models in different languages to work together seamlessly.

The OR HTTP Contract

Every model implements a standard three-type pattern:

Input — Data the model receives
Params — Configuration parameters
Output — Data the model produces

This contract is language-agnostic. A Julia model can consume a Python model's output without either team coordinating data formats — the DAL handles the translation.

Example: Multi-Language Pipeline

Consider a workflow that combines three models:

Data Preprocessor (Python) — Cleans and transforms raw data
Traffic Simulator (Julia) — Runs high-performance simulation
Result Analyzer (Go) — Aggregates and visualizes results

Without the DAL, teams would need to coordinate:

File formats and naming conventions
Data serialization (JSON vs CSV vs binary)
File system locations and cleanup

With the DAL:

Each model reads inputs and writes outputs through its standard interface
The WEM coordinates file locations automatically
Data flows from Python → Julia → Go with zero custom integration code

Input/Output Matching

When tasks have output dependencies, the WEM automatically determines which upstream output fields map to which downstream input fields.

Name-Based Matching

The WEM matches fields by name:

Task A Output:

{
  "demand_forecast": {...},
  "confidence_score": 0.85
}

Task B Input Definition:

{
  "demand_forecast": {...}  // Matched by name
}

Task B automatically receives the demand_forecast from Task A. The WEM constructs the input without manual configuration.

Type Validation

The WEM validates that matched fields have compatible types:

Ensures data integrity across task boundaries
Prevents type mismatches that would cause runtime errors
Provides early validation before execution begins

Partial Matching

Downstream tasks don't need to consume all upstream outputs:

Task B can read only the fields it needs from Task A
Unused output fields are ignored
Multiple downstream tasks can read different subsets of the same output

Caching with useCache

The DAL integrates with a caching system to avoid redundant computation. When useCache is enabled on a task:

How Caching Works

Before execution, the WEM computes a cache key from:
- Task input data
- Task parameter values
- Task configuration
The WEM checks if a cached result exists for that key
If cache hit: Task is skipped, cached output is used
If cache miss: Task executes normally, result is cached

Benefits

Faster experimentation — Changing one task doesn't re-run unchanged tasks
Cost savings — Expensive computations run only once for identical inputs
Reproducibility — Same inputs always produce same outputs (or use cached results)

When to Use Caching

Good candidates:

Expensive models (long-running simulations, complex optimizations)
Tasks with stable outputs (preprocessing, feature engineering)
Tasks early in the workflow (many dependent tasks benefit)

Avoid caching:

Non-deterministic models (outputs vary for same inputs)
Real-time data ingestion (always want fresh data)
Tasks with side effects (external API calls, database writes)

Data Transformation

Operators for Custom Transformations

When automatic field matching isn't enough, custom operators can transform data:

Use Cases:

Unit conversion (miles to kilometers)
Data reshaping (pivot tables, aggregations)
Filtering and sampling
Format conversion
Enrichment with external data

Operators are Python functions that read outputs from one or more tasks and produce transformed inputs for downstream tasks.

IMPORT_DDK and EXPORT_DDK

Special component types connect workflows to external data systems:

IMPORT_DDK:

Queries data from DDK-generated GraphQL servers
Brings structured domain data into workflows
Example: Load rail network topology before running capacity simulation

EXPORT_DDK:

Writes workflow results back to DDK servers
Makes model outputs available to operational systems
Example: Export optimized schedules to fleet management system

Performance Optimization

Transfer Mode Selection

The WEM automatically selects the most efficient transfer mode:

Small data (< 1MB): Cache-based transfer for speed
Medium data (1MB - 100MB): File-based transfer for reliability
Large data (> 100MB): File-based with streaming where possible

Parallel Data Movement

When tasks have independent data dependencies:

Data transfers can happen in parallel
Multiple tasks can read from the same cached output simultaneously
Reduces overall workflow execution time

Cleanup and Resource Management

The DAL manages storage automatically:

Temporary files are cleaned up after downstream tasks complete
Cache entries have configurable expiration
Storage usage is monitored to prevent exhaustion

Best Practices

For Model Developers

Use clear field names — Descriptive names improve automatic matching
Document your schema — OpenAPI documentation helps users understand your model
Keep outputs focused — Only produce data that downstream tasks need
Use appropriate data types — Primitives (numbers, strings) transfer more reliably than complex objects

For Workflow Designers

Enable caching on expensive tasks — Especially early-stage preprocessing
Group related transformations — Reduce number of task boundaries
Use operators for complex transformations — Don't force models to handle data reshaping
Test with small data first — Validate logic before scaling to production datasets

For Platform Operators

Monitor storage growth — Large workflows generate significant intermediate data
Configure cache expiration — Balance performance vs storage costs
Tune transfer mode thresholds — Based on infrastructure characteristics
Implement cleanup policies — Automatic removal of old workflow data

Creating a Data Schema

Building and Configuring Workflows

DDK (Data)

MDK (Modelling)

Modelling Library

FDK (Frontend)

Nexus (Deployment)

Data Flow & Transformations

Overview

Why the DAL Exists

Data Abstraction Layer (DAL)

Data Transfer Patterns

1. File-Based Transfer

2. Cache-Based Transfer

3. Direct API Transfer

Cross-Language Interoperability

The OR HTTP Contract

Example: Multi-Language Pipeline

Input/Output Matching

Name-Based Matching

Type Validation

Partial Matching

Caching with useCache

How Caching Works

Benefits

When to Use Caching

Data Transformation

Operators for Custom Transformations

IMPORT_DDK and EXPORT_DDK

Performance Optimization

Transfer Mode Selection

Parallel Data Movement

Cleanup and Resource Management

Best Practices

For Model Developers

For Workflow Designers

For Platform Operators

See Also

Modelling Library

Data Flow & Transformations ​

Overview ​

Why the DAL Exists ​

Data Abstraction Layer (DAL) ​

Data Transfer Patterns ​

1. File-Based Transfer ​

2. Cache-Based Transfer ​

3. Direct API Transfer ​

Cross-Language Interoperability ​

The OR HTTP Contract ​

Example: Multi-Language Pipeline ​

Input/Output Matching ​

Name-Based Matching ​

Type Validation ​

Partial Matching ​

Caching with useCache ​

How Caching Works ​

Benefits ​

When to Use Caching ​

Data Transformation ​

Operators for Custom Transformations ​

IMPORT_DDK and EXPORT_DDK ​

Performance Optimization ​

Transfer Mode Selection ​

Parallel Data Movement ​

Cleanup and Resource Management ​

Best Practices ​

For Model Developers ​

For Workflow Designers ​

For Platform Operators ​

See Also ​

Data Flow & Transformations

Overview

Why the DAL Exists

Data Abstraction Layer (DAL)

Data Transfer Patterns

1. File-Based Transfer

2. Cache-Based Transfer

3. Direct API Transfer

Cross-Language Interoperability

The OR HTTP Contract

Example: Multi-Language Pipeline

Input/Output Matching

Name-Based Matching

Type Validation

Partial Matching

Caching with useCache

How Caching Works

Benefits

When to Use Caching

Data Transformation

Operators for Custom Transformations

IMPORT_DDK and EXPORT_DDK

Performance Optimization

Transfer Mode Selection

Parallel Data Movement

Cleanup and Resource Management

Best Practices

For Model Developers

For Workflow Designers

For Platform Operators

See Also