Skip to content

Data Flow & Transformations

Platform Users — Engineers & Low-code Ops Users (ORA / Panel Builder) OR Platform ORA — AI Planning Interface Agent Workflows Plan Visualisation ADK Integration SDK UI — Frontend Shell FDK Architecture Low code Config-driven DDK Schema Definition Code Generator Generated Server MDK WEM DAL Experiment Manager Nexus Deployment Control Live Monitoring Registry Browser SCDK Source Control Pipeline Mgmt Azure DevOps deploys ↓ SDK API — GraphQL Federation Gateway Federation Gateway Component Resolvers Auth & Licensing Plugins: gql-autogeneration Migrator Helm KinD Boilerplate GenAI ··· Microservices — Domain IP Services Data Pipeline Core Platform Metrics & Analytics Spatial & Geo Simulation Event Detection Camera & Device Fire & Resource Opt. Satellite Modelling ↓ Nexus deploys Deployed OR Applications Rail Ops Dashboard Mine Mgmt Dashboard Port Ops Dashboard ··· FDK-built · DDK-backed · MDK-powered · deployed via Nexus ↑ Application Users — Operations Teams (shift managers, analysts, planners)

Overview

The MDK uses a Data Abstraction Layer (DAL) to enable seamless data communication between models written in different programming languages (Python, Julia, Go). The DAL provides multiple transfer modes optimized for different data sizes and use cases, complemented by a caching system to avoid redundant computation.

Why the DAL Exists

Real-world workflows combine specialist models built by different teams in different languages. A Go model's output needs to become a Python model's input without requiring either team to write custom integration code. The DAL solves this by providing a universal data transfer mechanism that works across any language pairing.

The DAL implements the MDK's core principle: data moves automatically between models, regardless of how they were built.

Data Abstraction Layer (DAL)

The DAL solves two key problems:

  1. Cross-language interoperability — Models in different languages need a common way to exchange data
  2. Performance at scale — Different data sizes require different transfer mechanisms

The DAL provides three transfer modes:

ModeMechanismBest For
File-BasedShared file systemLarge datasets, complex structures, multi-dimensional arrays
Cache-BasedFast in-memory storageSmall to medium data, state management, cached results
DirectHTTP request/responseSmall payloads, simple testing, real-time inference

The transfer mode is automatically selected based on the data size and workflow configuration. Model authors don't need to know which mode is in use — they simply read inputs and write outputs through a standard interface.

Data Transfer Patterns

1. File-Based Transfer

The standard pattern for large data and complex structures. Models write output files to a shared location, and downstream tasks read from that location.

When Used:

  • Large data volumes (simulation results, sensor data, image sets)
  • Complex nested structures
  • Multiple downstream consumers need access to the same data
  • Standard pattern for output dependencies between tasks

How It Works:

  1. Task A completes and writes output to a structured location
  2. The WEM records the output file location
  3. When Task B becomes ready, the WEM constructs an input descriptor pointing to Task A's output
  4. Task B reads the file and processes the data

Benefits:

  • Efficient for large data (written once, read once or multiple times)
  • Models work with files natively in their runtime
  • No memory constraints from network payloads
  • Multiple consumers don't duplicate data

2. Cache-Based Transfer

Used for smaller data, workflow state management, and performance optimization.

When Used:

  • Small to medium payloads
  • Workflow coordination and state
  • Task synchronization
  • Caching expensive computation results

How It Works:

  1. Task A stores output to the cache with a unique key
  2. The WEM records the cache key
  3. Task B retrieves the data from cache using the key
  4. Data is available instantly without file I/O

Benefits:

  • Fast access (in-memory retrieval)
  • Ideal for coordination data
  • Supports caching layer for result reuse
  • Lower overhead than file system for small data

3. Direct API Transfer

Used when invoking containerized models or microservices for real-time inference.

When Used:

  • Invoking ML models deployed as containers
  • Real-time inference requirements
  • Small input/output payloads
  • Stateless operations

How It Works:

  1. The WEM builds a request with task inputs and parameters
  2. HTTP POST to the model's endpoint
  3. Model returns output directly in the response
  4. Output is stored to the appropriate DAL location for downstream use

Benefits:

  • Low latency for small payloads
  • Simple integration for stateless models
  • Direct invocation pattern

Cross-Language Interoperability

One of the DAL's core capabilities is enabling models in different languages to work together seamlessly.

The OR HTTP Contract

Every model implements a standard three-type pattern:

  • Input — Data the model receives
  • Params — Configuration parameters
  • Output — Data the model produces

This contract is language-agnostic. A Julia model can consume a Python model's output without either team coordinating data formats — the DAL handles the translation.

Example: Multi-Language Pipeline

Consider a workflow that combines three models:

  1. Data Preprocessor (Python) — Cleans and transforms raw data
  2. Traffic Simulator (Julia) — Runs high-performance simulation
  3. Result Analyzer (Go) — Aggregates and visualizes results

Without the DAL, teams would need to coordinate:

  • File formats and naming conventions
  • Data serialization (JSON vs CSV vs binary)
  • File system locations and cleanup

With the DAL:

  • Each model reads inputs and writes outputs through its standard interface
  • The WEM coordinates file locations automatically
  • Data flows from Python → Julia → Go with zero custom integration code

Input/Output Matching

When tasks have output dependencies, the WEM automatically determines which upstream output fields map to which downstream input fields.

Name-Based Matching

The WEM matches fields by name:

Task A Output:

{
  "demand_forecast": {...},
  "confidence_score": 0.85
}

Task B Input Definition:

{
  "demand_forecast": {...}  // Matched by name
}

Task B automatically receives the demand_forecast from Task A. The WEM constructs the input without manual configuration.

Type Validation

The WEM validates that matched fields have compatible types:

  • Ensures data integrity across task boundaries
  • Prevents type mismatches that would cause runtime errors
  • Provides early validation before execution begins

Partial Matching

Downstream tasks don't need to consume all upstream outputs:

  • Task B can read only the fields it needs from Task A
  • Unused output fields are ignored
  • Multiple downstream tasks can read different subsets of the same output

Caching with useCache

The DAL integrates with a caching system to avoid redundant computation. When useCache is enabled on a task:

How Caching Works

  1. Before execution, the WEM computes a cache key from:
    • Task input data
    • Task parameter values
    • Task configuration
  2. The WEM checks if a cached result exists for that key
  3. If cache hit: Task is skipped, cached output is used
  4. If cache miss: Task executes normally, result is cached

Benefits

  • Faster experimentation — Changing one task doesn't re-run unchanged tasks
  • Cost savings — Expensive computations run only once for identical inputs
  • Reproducibility — Same inputs always produce same outputs (or use cached results)

When to Use Caching

Good candidates:

  • Expensive models (long-running simulations, complex optimizations)
  • Tasks with stable outputs (preprocessing, feature engineering)
  • Tasks early in the workflow (many dependent tasks benefit)

Avoid caching:

  • Non-deterministic models (outputs vary for same inputs)
  • Real-time data ingestion (always want fresh data)
  • Tasks with side effects (external API calls, database writes)

Data Transformation

Operators for Custom Transformations

When automatic field matching isn't enough, custom operators can transform data:

Use Cases:

  • Unit conversion (miles to kilometers)
  • Data reshaping (pivot tables, aggregations)
  • Filtering and sampling
  • Format conversion
  • Enrichment with external data

Operators are Python functions that read outputs from one or more tasks and produce transformed inputs for downstream tasks.

IMPORT_DDK and EXPORT_DDK

Special component types connect workflows to external data systems:

IMPORT_DDK:

  • Queries data from DDK-generated GraphQL servers
  • Brings structured domain data into workflows
  • Example: Load rail network topology before running capacity simulation

EXPORT_DDK:

  • Writes workflow results back to DDK servers
  • Makes model outputs available to operational systems
  • Example: Export optimized schedules to fleet management system

Performance Optimization

Transfer Mode Selection

The WEM automatically selects the most efficient transfer mode:

  • Small data (< 1MB): Cache-based transfer for speed
  • Medium data (1MB - 100MB): File-based transfer for reliability
  • Large data (> 100MB): File-based with streaming where possible

Parallel Data Movement

When tasks have independent data dependencies:

  • Data transfers can happen in parallel
  • Multiple tasks can read from the same cached output simultaneously
  • Reduces overall workflow execution time

Cleanup and Resource Management

The DAL manages storage automatically:

  • Temporary files are cleaned up after downstream tasks complete
  • Cache entries have configurable expiration
  • Storage usage is monitored to prevent exhaustion

Best Practices

For Model Developers

  1. Use clear field names — Descriptive names improve automatic matching
  2. Document your schema — OpenAPI documentation helps users understand your model
  3. Keep outputs focused — Only produce data that downstream tasks need
  4. Use appropriate data types — Primitives (numbers, strings) transfer more reliably than complex objects

For Workflow Designers

  1. Enable caching on expensive tasks — Especially early-stage preprocessing
  2. Group related transformations — Reduce number of task boundaries
  3. Use operators for complex transformations — Don't force models to handle data reshaping
  4. Test with small data first — Validate logic before scaling to production datasets

For Platform Operators

  1. Monitor storage growth — Large workflows generate significant intermediate data
  2. Configure cache expiration — Balance performance vs storage costs
  3. Tune transfer mode thresholds — Based on infrastructure characteristics
  4. Implement cleanup policies — Automatic removal of old workflow data

See Also

User documentation for Optimal Reality