Skip to content

Data Ingestion

Platform Users — Engineers & Low-code Ops Users (ORA / Panel Builder) OR Platform ORA — AI Planning Interface Agent Workflows Plan Visualisation ADK Integration SDK UI — Frontend Shell FDK Architecture Low code Config-driven DDK Schema Definition Code Generator Generated Server MDK WEM DAL Experiment Manager Nexus Deployment Control Live Monitoring Registry Browser SCDK Source Control Pipeline Mgmt Azure DevOps deploys ↓ SDK API — GraphQL Federation Gateway Federation Gateway Component Resolvers Auth & Licensing Plugins: gql-autogeneration Migrator Helm KinD Boilerplate GenAI ··· Microservices — Domain IP Services Data Pipeline Core Platform Metrics & Analytics Spatial & Geo Simulation Event Detection Camera & Device Fire & Resource Opt. Satellite Modelling ↓ Nexus deploys Deployed OR Applications Rail Ops Dashboard Mine Mgmt Dashboard Port Ops Dashboard ··· FDK-built · DDK-backed · MDK-powered · deployed via Nexus ↑ Application Users — Operations Teams (shift managers, analysts, planners)

Overview

The Data Ingestion service is the primary entry point for polling-based external data into the OR platform. It collects data from HTTP, FTP, and SQS sources — including client internal feeds (via the ESB), third-party APIs, and government open data endpoints — and forwards it to the Data Transformer for normalisation into OR-compliant formats.

Each external API endpoint has its own schema, authentication method, and update frequency. Rather than building bespoke integrations per source, Data Ingestion provides a configuration-driven ingestion framework that handles the polling lifecycle, credential management, and delivery to the transformation layer. This pattern keeps complexity low and latency minimal, while making it straightforward to add or remove data sources as operational requirements evolve.

Within the broader data pipeline, Data Ingestion sits at the earliest stage — upstream of Data Transformer, Data Fusion, and all downstream consumers. It operates alongside Data Stream Ingestion (which handles event-driven AMQP sources) and Data Redis Ingestion (which handles Redis-based computer vision feeds), collectively forming the platform's ingestion layer.

Architecture

  • Port: :7000
  • Language: Julia
  • Scaling: Fully distributable (multi-tenant or single-tenant)
  • Protocols: HTTPS over TLS, SFTP, SQS

Key Components

  • Configuration-driven polling — A per-environment YAML config defines each data source with its protocol, target URL, request type, response format, headers, API key references, and polling frequency. API keys are loaded as environment variables; certificates are mounted via volume mounts.
  • Redis-managed frequency control — A Redis connection controls request scheduling across tenants. Each source has a Redis key whose TTL governs when the next poll is allowed. This enables multi-tenant scaling where multiple pods can coordinate without duplicate requests.
  • Threaded ingestion — When a request is available (based on Redis keyspace expiry), it runs on a separate thread. Data received is forwarded via HTTP POST to the Data Transformer /data endpoint.
  • HTTP health server — The first thread runs an HTTP server for Kubernetes /ready and /live health checks.

Data Flow

External APIs / SFTP / SQS
        ↓ (configurable polling frequency)
Data Ingestion [:7000]
        ↓ (HTTP POST with target + message)
Data Transformer [:5800]

Configuration

The ingestion config defines data sources as a YAML dictionary. Each source specifies its protocol, polling frequency, and one or more target endpoints:

yaml
DATA_SOURCES:
  ADDINSIGHT:
    protocol: https
    ingestion_frequency: 60
    target_config:
      - target_name: ADDINSIGHT_LINKS
        url: https://data-exchange-api.vicroads.vic.gov.au/bluetooth_data/links?expand=latest_stats
        http_config:
          request: GET
          response_type: Vector{Dict}
          headers:
            Content-Type: application/json
            Ocp-Apim-Subscription-Key: '{API_KEY_ADDINSIGHT}'

Key configuration fields:

FieldDescription
protocolConnection type: https, sftp, or sqs
ingestion_frequencyPolling interval in seconds
target_nameIdentifier used by Data Transformer to route processing
urlSource endpoint
http_configRequest method, response type, and headers

Ingested Sources

The service currently polls a wide range of client and third-party sources:

SourceProtocolFrequency
AddInsight LinksHTTPS30s
STREAMS Vehicle DetectorsHTTPS5 min
RTDMSHTTPS60s
SITREP (Road Closures)HTTPS30s
LUMSHTTPS10 min
VSLSHTTPS10 min
VMS / VMS CompositesHTTPS10 min
Metro Train PositionsHTTPS30s
Metro Train Trip UpdatesHTTPS30s
Metro Train Service AlertsHTTPS30s
PTV DisruptionsHTTPS5 min
BOM RainfallSFTP5 min
RAI JobsHTTPS5 min
IRS EyeFiHTTPS30s
ServiceNow IRSHTTPS30 min
Tow AllocationHTTPS60s
RWEHTTPS5 min
ETSHTTPS5 min
Ramp AHS / Metering / OperationsHTTPS5 min
Off RampsHTTPS5 min
ESLSHTTPS2 min
SCATS Site Status / PFLHTTPS1 hour
RID ImpactsSQSVariable (long poll)
OneViewHTTPS5 min

SQS Ingestion (RID)

RID publishes impact data onto an Amazon SQS queue (via SNS). Unlike HTTP sources that poll on a fixed interval, SQS ingestion uses a long polling approach: the service requests data and waits up to 20 seconds for a message, reducing both end-to-end latency and API call volume.

Long polling is achieved by intentionally deleting the Redis key that controls ingestion frequency for RID when a message is received, allowing a new ingestion cycle to start immediately rather than waiting for key expiry.

Cloudwatch logging tracks which RID impacts have been successfully read from the queue, supporting debugging and audit workflows.

  • Data Transformer — Downstream consumer that normalises ingested data
  • Data Stream Ingestion — Sibling ingestion service for AMQP event streams
  • Data Redis Ingestion — Sibling ingestion service for Redis-based sources
  • Data Fusion — Fuses transformed data from multiple sources
  • Experiment Manager — Central coordination service (GraphQL on :5100)

User documentation for Optimal Reality