Skip to content

Data Ingestion

Platform Users — Engineers & Low-code Ops Users (ORA / Panel Builder) OR Platform ORA — AI Planning Interface Agent Workflows Plan Visualisation ADK Integration SDK UI — Frontend Shell FDK Architecture Low code Config-driven DDK Schema Definition Code Generator Generated Server MDK WEM DAL Experiment Manager Nexus Deployment Control Live Monitoring Registry Browser SCDK Source Control Pipeline Mgmt Azure DevOps deploys ↓ SDK API — GraphQL Federation Gateway Federation Gateway Component Resolvers Auth & Licensing Plugins: gql-autogeneration Migrator Helm KinD Boilerplate GenAI ··· Microservices — Domain IP Services Data Pipeline Core Platform Metrics & Analytics Spatial & Geo Simulation Event Detection Camera & Device Fire & Resource Opt. Satellite Modelling ↓ Nexus deploys Deployed OR Applications Rail Ops Dashboard Mine Mgmt Dashboard Port Ops Dashboard ··· FDK-built · DDK-backed · MDK-powered · deployed via Nexus ↑ Application Users — Operations Teams (shift managers, analysts, planners)

Overview

The Data Ingestion service provides reliable data acquisition from external sources through HTTP-based APIs. It handles data retrieval with configurable retry logic, timeout management, and specialized support for real-time transit data formats (GTFS-Realtime).

This service is designed for teams that need to bring external data into workflows — from real-time feeds to periodic data pulls — with robust error handling and transformation capabilities.

Key Capabilities

HTTP Data Retrieval

  • Flexible HTTP Requests — GET, POST, PUT, DELETE operations
  • Retry Logic — Automatic retries for transient failures
  • Timeout Handling — Configurable timeouts to prevent hangs
  • Response Processing — Parse JSON, XML, and other formats

GTFS-Realtime Support

  • Transit Feed Parsing — Specialized support for GTFS-R protocol buffers
  • Real-Time Updates — Vehicle positions, trip updates, service alerts
  • Data Transformation — Convert GTFS-R to structured formats
  • Feed Validation — Check data quality and completeness

Error Handling

  • Automatic Retries — Configurable retry attempts and backoff strategies
  • Timeout Protection — Prevent indefinite waits
  • Error Logging — Detailed error messages for debugging
  • Graceful Degradation — Partial data handling when possible

Data Transformation

  • Format Conversion — Transform between data formats
  • Field Mapping — Rename and restructure fields
  • Filtering — Extract only needed data
  • Enrichment — Add computed fields or metadata

Use Cases

Real-Time Transit Data Ingestion

Scenario: A transportation planning team needs live vehicle positions and arrival predictions from transit agencies.

Workflow:

  1. Configure GTFS-Realtime feed URLs
  2. Set polling interval (e.g., every 30 seconds)
  3. Ingest vehicle positions and trip updates
  4. Transform to standardized format
  5. Feed into traffic model or analysis workflow

Value: Enable real-time analysis and passenger information systems.

Periodic API Data Collection

Scenario: A supply chain team pulls inventory levels from multiple vendor APIs daily.

Workflow:

  1. Define API endpoints and authentication
  2. Schedule daily ingestion workflow
  3. Retrieve data from each vendor
  4. Standardize format across vendors
  5. Load into data warehouse for analysis

Value: Centralize data from disparate sources for unified analysis.

Weather Data Integration

Scenario: An agricultural planning workflow needs current and forecast weather data.

Workflow:

  1. Call weather API with location parameters
  2. Retrieve current conditions and forecasts
  3. Extract relevant fields (temperature, precipitation)
  4. Combine with crop and soil data
  5. Feed into irrigation or planting models

Value: Make decisions based on current and predicted conditions.

Market Data Feeds

Scenario: A pricing optimization system needs current market prices for raw materials.

Workflow:

  1. Connect to market data feeds
  2. Retrieve latest commodity prices
  3. Filter to relevant materials
  4. Transform to internal format
  5. Update pricing models with current data

Value: Keep pricing competitive based on real-time market conditions.

Model Inputs

The Data Ingestion service accepts:

  • API Endpoints — URLs to retrieve data from
  • Request Configuration — Headers, authentication, parameters
  • Retry Settings — Number of attempts, backoff strategy
  • Timeout Values — Maximum wait time for responses
  • Transformation Rules — How to process retrieved data

Model Outputs

The service produces:

  • Retrieved Data — Raw or transformed data from external sources
  • Status Information — Success/failure, retry counts, response times
  • Error Messages — Detailed diagnostics for failures
  • Metadata — Timestamps, source info, data quality indicators

Configuration Options

Key parameters you can configure:

  • Endpoint URL — Where to retrieve data from
  • HTTP Method — GET, POST, PUT, DELETE
  • Authentication — API keys, tokens, basic auth
  • Retry Count — How many times to retry failed requests
  • Retry Delay — Time between retry attempts
  • Timeout — Maximum wait time for response
  • Response Format — JSON, XML, Protobuf, etc.

Supported Data Formats

GTFS-Realtime (Protocol Buffers)

Specialized support for transit data:

  • Vehicle Positions — Real-time location of transit vehicles
  • Trip Updates — Predicted arrival/departure times
  • Service Alerts — Disruptions, detours, schedule changes

JSON

Standard JSON parsing and transformation:

  • Nested objects and arrays
  • Field extraction and mapping
  • Schema validation

XML

XML document processing:

  • Element and attribute extraction
  • XPath queries
  • Namespace handling

CSV

Comma-separated values:

  • Header detection
  • Field parsing
  • Type conversion

Integration with Other Models

The Data Ingestion service works well with:

  • Data Loader — Ingest data, then load into databases
  • Traffic Model — Feed real-time transit data into simulations
  • Tiny Time Mixers — Ingest historical data for forecasting
  • AI Agent Python — Agents can trigger ingestion based on needs

Retry Strategies

Exponential Backoff

  • First retry: Wait 1 second
  • Second retry: Wait 2 seconds
  • Third retry: Wait 4 seconds
  • Reduces load on failing services

Fixed Interval

  • Wait same amount between each retry
  • Simpler but can overwhelm recovering services

Immediate Retry

  • Retry without delay
  • For transient network issues
  • Risk of overwhelming service

Error Handling Patterns

Transient Failures

Network issues, temporary unavailability:

  • Automatic retry with backoff
  • Log warnings but don't fail workflow
  • Return partial data if some sources succeed

Permanent Failures

Invalid credentials, non-existent endpoints:

  • Fail immediately without retries
  • Provide clear error message
  • Don't block unrelated workflow tasks

Partial Success

Some data retrieved, some failed:

  • Return available data with warnings
  • Log which sources failed
  • Allow workflow to continue with partial data

Performance Notes

  • Polling Frequency — More frequent polling increases load and costs
  • Batch When Possible — Retrieve multiple items in single request
  • Use Caching — Cache responses when data doesn't change rapidly
  • Parallel Requests — Call independent endpoints concurrently

Getting Started

Basic Workflow

  1. Identify Data Source — Find API endpoint for needed data
  2. Configure Request — Set URL, auth, parameters
  3. Set Retry Policy — Choose retry count and delays
  4. Add to Workflow — Drag Data Ingestion into canvas
  5. Test & Monitor — Verify data retrieval and handle errors

Example: Simple API Call

[Data Ingestion: API] → [Process Data] → [Store Results]

This workflow retrieves data from an API, processes it, and stores the results.

Example: Scheduled Data Collection

[Schedule Trigger] → [Data Ingestion: Multiple Sources] → [Data Loader] → [Database]

This workflow runs on a schedule, ingests from multiple sources, and loads into a database.

Best Practices

Reliability

  1. Use Retries — Network is unreliable, plan for failures
  2. Set Timeouts — Don't wait forever for responses
  3. Log Everything — Track successes, failures, and retries
  4. Monitor Patterns — Watch for recurring failures to fix root causes

Performance

  1. Cache When Appropriate — Don't re-fetch static or slow-changing data
  2. Batch Requests — Reduce overhead with fewer, larger requests
  3. Parallel Retrieval — Call independent sources concurrently
  4. Compress Data — Use gzip for large transfers if supported

Security

  1. Secure Credentials — Don't hardcode API keys in configurations
  2. Use HTTPS — Encrypt data in transit
  3. Validate Certificates — Ensure you're connecting to real endpoints
  4. Rate Limit — Respect API provider limits to avoid bans

Data Quality

  1. Validate Responses — Check data structure and values
  2. Handle Missing Fields — Gracefully deal with incomplete data
  3. Timestamp Everything — Record when data was retrieved
  4. Version APIs — Track which API version you're using

Troubleshooting

Requests Timing Out

  • Increase timeout value
  • Check if endpoint is slow or down
  • Try smaller requests if API supports pagination
  • Verify network connectivity

Authentication Failures

  • Verify credentials are correct and current
  • Check if API key has expired
  • Ensure proper authentication header format
  • Review API documentation for auth requirements

Data Format Errors

  • Verify API is returning expected format
  • Check if API version has changed
  • Review transformation rules
  • Log raw responses for debugging

Retry Exhaustion

  • API may be down, check status pages
  • Increase retry count or delays if transient
  • Investigate root cause if persistent
  • Set up alerts for repeated failures

Next Steps

User documentation for Optimal Reality