Data Ingestion

Overview

The Data Ingestion service is the primary entry point for polling-based external data into the OR platform. It collects data from HTTP, FTP, and SQS sources — including client internal feeds (via the ESB), third-party APIs, and government open data endpoints — and forwards it to the Data Transformer for normalisation into OR-compliant formats.

Each external API endpoint has its own schema, authentication method, and update frequency. Rather than building bespoke integrations per source, Data Ingestion provides a configuration-driven ingestion framework that handles the polling lifecycle, credential management, and delivery to the transformation layer. This pattern keeps complexity low and latency minimal, while making it straightforward to add or remove data sources as operational requirements evolve.

Within the broader data pipeline, Data Ingestion sits at the earliest stage — upstream of Data Transformer, Data Fusion, and all downstream consumers. It operates alongside Data Stream Ingestion (which handles event-driven AMQP sources) and Data Redis Ingestion (which handles Redis-based computer vision feeds), collectively forming the platform's ingestion layer.

Architecture

Port: :7000
Language: Julia
Scaling: Fully distributable (multi-tenant or single-tenant)
Protocols: HTTPS over TLS, SFTP, SQS

Key Components

Configuration-driven polling — A per-environment YAML config defines each data source with its protocol, target URL, request type, response format, headers, API key references, and polling frequency. API keys are loaded as environment variables; certificates are mounted via volume mounts.
Redis-managed frequency control — A Redis connection controls request scheduling across tenants. Each source has a Redis key whose TTL governs when the next poll is allowed. This enables multi-tenant scaling where multiple pods can coordinate without duplicate requests.
Threaded ingestion — When a request is available (based on Redis keyspace expiry), it runs on a separate thread. Data received is forwarded via HTTP POST to the Data Transformer /data endpoint.
HTTP health server — The first thread runs an HTTP server for Kubernetes /ready and /live health checks.

Data Flow

External APIs / SFTP / SQS
        ↓ (configurable polling frequency)
Data Ingestion [:7000]
        ↓ (HTTP POST with target + message)
Data Transformer [:5800]

Configuration

The ingestion config defines data sources as a YAML dictionary. Each source specifies its protocol, polling frequency, and one or more target endpoints:

yaml

DATA_SOURCES:
  ADDINSIGHT:
    protocol: https
    ingestion_frequency: 60
    target_config:
      - target_name: ADDINSIGHT_LINKS
        url: https://data-exchange-api.vicroads.vic.gov.au/bluetooth_data/links?expand=latest_stats
        http_config:
          request: GET
          response_type: Vector{Dict}
          headers:
            Content-Type: application/json
            Ocp-Apim-Subscription-Key: '{API_KEY_ADDINSIGHT}'

Key configuration fields:

Field	Description
`protocol`	Connection type: `https`, `sftp`, or `sqs`
`ingestion_frequency`	Polling interval in seconds
`target_name`	Identifier used by Data Transformer to route processing
`url`	Source endpoint
`http_config`	Request method, response type, and headers

Ingested Sources

The service currently polls a wide range of client and third-party sources:

Source	Protocol	Frequency
AddInsight Links	HTTPS	30s
STREAMS Vehicle Detectors	HTTPS	5 min
RTDMS	HTTPS	60s
SITREP (Road Closures)	HTTPS	30s
LUMS	HTTPS	10 min
VSLS	HTTPS	10 min
VMS / VMS Composites	HTTPS	10 min
Metro Train Positions	HTTPS	30s
Metro Train Trip Updates	HTTPS	30s
Metro Train Service Alerts	HTTPS	30s
PTV Disruptions	HTTPS	5 min
BOM Rainfall	SFTP	5 min
RAI Jobs	HTTPS	5 min
IRS EyeFi	HTTPS	30s
ServiceNow IRS	HTTPS	30 min
Tow Allocation	HTTPS	60s
RWE	HTTPS	5 min
ETS	HTTPS	5 min
Ramp AHS / Metering / Operations	HTTPS	5 min
Off Ramps	HTTPS	5 min
ESLS	HTTPS	2 min
SCATS Site Status / PFL	HTTPS	1 hour
RID Impacts	SQS	Variable (long poll)
OneView	HTTPS	5 min

SQS Ingestion (RID)

RID publishes impact data onto an Amazon SQS queue (via SNS). Unlike HTTP sources that poll on a fixed interval, SQS ingestion uses a long polling approach: the service requests data and waits up to 20 seconds for a message, reducing both end-to-end latency and API call volume.

Long polling is achieved by intentionally deleting the Redis key that controls ingestion frequency for RID when a message is received, allowing a new ingestion cycle to start immediately rather than waiting for key expiry.

Cloudwatch logging tracks which RID impacts have been successfully read from the queue, supporting debugging and audit workflows.

Data Transformer — Downstream consumer that normalises ingested data
Data Stream Ingestion — Sibling ingestion service for AMQP event streams
Data Redis Ingestion — Sibling ingestion service for Redis-based sources
Data Fusion — Fuses transformed data from multiple sources
Experiment Manager — Central coordination service (GraphQL on :5100)

Creating a Data Schema

Building and Configuring Workflows

DDK (Data)

MDK (Modelling)

Modelling Library

FDK (Frontend)

Nexus (Deployment)

Data Ingestion

Overview

Architecture

Key Components

Data Flow

Configuration

Ingested Sources

SQS Ingestion (RID)

Modelling Library

Data Ingestion ​

Overview ​

Architecture ​

Key Components ​

Data Flow ​

Configuration ​

Ingested Sources ​

SQS Ingestion (RID) ​

Related Services ​

Data Ingestion

Overview

Architecture

Key Components

Data Flow

Configuration

Ingested Sources

SQS Ingestion (RID)

Related Services