Data Ingestion

Overview

The Data Ingestion service provides reliable data acquisition from external sources through HTTP-based APIs. It handles data retrieval with configurable retry logic, timeout management, and specialized support for real-time transit data formats (GTFS-Realtime).

This service is designed for teams that need to bring external data into workflows — from real-time feeds to periodic data pulls — with robust error handling and transformation capabilities.

Key Capabilities

HTTP Data Retrieval

Flexible HTTP Requests — GET, POST, PUT, DELETE operations
Retry Logic — Automatic retries for transient failures
Timeout Handling — Configurable timeouts to prevent hangs
Response Processing — Parse JSON, XML, and other formats

GTFS-Realtime Support

Transit Feed Parsing — Specialized support for GTFS-R protocol buffers
Real-Time Updates — Vehicle positions, trip updates, service alerts
Data Transformation — Convert GTFS-R to structured formats
Feed Validation — Check data quality and completeness

Error Handling

Automatic Retries — Configurable retry attempts and backoff strategies
Timeout Protection — Prevent indefinite waits
Error Logging — Detailed error messages for debugging
Graceful Degradation — Partial data handling when possible

Data Transformation

Format Conversion — Transform between data formats
Field Mapping — Rename and restructure fields
Filtering — Extract only needed data
Enrichment — Add computed fields or metadata

Use Cases

Real-Time Transit Data Ingestion

Scenario: A transportation planning team needs live vehicle positions and arrival predictions from transit agencies.

Workflow:

Configure GTFS-Realtime feed URLs
Set polling interval (e.g., every 30 seconds)
Ingest vehicle positions and trip updates
Transform to standardized format
Feed into traffic model or analysis workflow

Value: Enable real-time analysis and passenger information systems.

Periodic API Data Collection

Scenario: A supply chain team pulls inventory levels from multiple vendor APIs daily.

Workflow:

Define API endpoints and authentication
Schedule daily ingestion workflow
Retrieve data from each vendor
Standardize format across vendors
Load into data warehouse for analysis

Value: Centralize data from disparate sources for unified analysis.

Weather Data Integration

Scenario: An agricultural planning workflow needs current and forecast weather data.

Workflow:

Call weather API with location parameters
Retrieve current conditions and forecasts
Extract relevant fields (temperature, precipitation)
Combine with crop and soil data
Feed into irrigation or planting models

Value: Make decisions based on current and predicted conditions.

Market Data Feeds

Scenario: A pricing optimization system needs current market prices for raw materials.

Workflow:

Connect to market data feeds
Retrieve latest commodity prices
Filter to relevant materials
Transform to internal format
Update pricing models with current data

Value: Keep pricing competitive based on real-time market conditions.

Model Inputs

The Data Ingestion service accepts:

API Endpoints — URLs to retrieve data from
Request Configuration — Headers, authentication, parameters
Retry Settings — Number of attempts, backoff strategy
Timeout Values — Maximum wait time for responses
Transformation Rules — How to process retrieved data

Model Outputs

The service produces:

Retrieved Data — Raw or transformed data from external sources
Status Information — Success/failure, retry counts, response times
Error Messages — Detailed diagnostics for failures
Metadata — Timestamps, source info, data quality indicators

Configuration Options

Key parameters you can configure:

Endpoint URL — Where to retrieve data from
HTTP Method — GET, POST, PUT, DELETE
Authentication — API keys, tokens, basic auth
Retry Count — How many times to retry failed requests
Retry Delay — Time between retry attempts
Timeout — Maximum wait time for response
Response Format — JSON, XML, Protobuf, etc.

Supported Data Formats

GTFS-Realtime (Protocol Buffers)

Specialized support for transit data:

Vehicle Positions — Real-time location of transit vehicles
Trip Updates — Predicted arrival/departure times
Service Alerts — Disruptions, detours, schedule changes

JSON

Standard JSON parsing and transformation:

Nested objects and arrays
Field extraction and mapping
Schema validation

XML

XML document processing:

Element and attribute extraction
XPath queries
Namespace handling

CSV

Comma-separated values:

Header detection
Field parsing
Type conversion

Integration with Other Models

The Data Ingestion service works well with:

Data Loader — Ingest data, then load into databases
Traffic Model — Feed real-time transit data into simulations
Tiny Time Mixers — Ingest historical data for forecasting
AI Agent Python — Agents can trigger ingestion based on needs

Retry Strategies

Exponential Backoff

First retry: Wait 1 second
Second retry: Wait 2 seconds
Third retry: Wait 4 seconds
Reduces load on failing services

Fixed Interval

Wait same amount between each retry
Simpler but can overwhelm recovering services

Immediate Retry

Retry without delay
For transient network issues
Risk of overwhelming service

Error Handling Patterns

Transient Failures

Network issues, temporary unavailability:

Automatic retry with backoff
Log warnings but don't fail workflow
Return partial data if some sources succeed

Permanent Failures

Invalid credentials, non-existent endpoints:

Fail immediately without retries
Provide clear error message
Don't block unrelated workflow tasks

Partial Success

Some data retrieved, some failed:

Return available data with warnings
Log which sources failed
Allow workflow to continue with partial data

Performance Notes

Polling Frequency — More frequent polling increases load and costs
Batch When Possible — Retrieve multiple items in single request
Use Caching — Cache responses when data doesn't change rapidly
Parallel Requests — Call independent endpoints concurrently

Getting Started

Basic Workflow

Identify Data Source — Find API endpoint for needed data
Configure Request — Set URL, auth, parameters
Set Retry Policy — Choose retry count and delays
Add to Workflow — Drag Data Ingestion into canvas
Test & Monitor — Verify data retrieval and handle errors

Example: Simple API Call

[Data Ingestion: API] → [Process Data] → [Store Results]

This workflow retrieves data from an API, processes it, and stores the results.

Example: Scheduled Data Collection

[Schedule Trigger] → [Data Ingestion: Multiple Sources] → [Data Loader] → [Database]

This workflow runs on a schedule, ingests from multiple sources, and loads into a database.

Best Practices

Reliability

Use Retries — Network is unreliable, plan for failures
Set Timeouts — Don't wait forever for responses
Log Everything — Track successes, failures, and retries
Monitor Patterns — Watch for recurring failures to fix root causes

Performance

Cache When Appropriate — Don't re-fetch static or slow-changing data
Batch Requests — Reduce overhead with fewer, larger requests
Parallel Retrieval — Call independent sources concurrently
Compress Data — Use gzip for large transfers if supported

Security

Secure Credentials — Don't hardcode API keys in configurations
Use HTTPS — Encrypt data in transit
Validate Certificates — Ensure you're connecting to real endpoints
Rate Limit — Respect API provider limits to avoid bans

Data Quality

Validate Responses — Check data structure and values
Handle Missing Fields — Gracefully deal with incomplete data
Timestamp Everything — Record when data was retrieved
Version APIs — Track which API version you're using

Troubleshooting

Requests Timing Out

Increase timeout value
Check if endpoint is slow or down
Try smaller requests if API supports pagination
Verify network connectivity

Authentication Failures

Verify credentials are correct and current
Check if API key has expired
Ensure proper authentication header format
Review API documentation for auth requirements

Data Format Errors

Verify API is returning expected format
Check if API version has changed
Review transformation rules
Log raw responses for debugging

Retry Exhaustion

API may be down, check status pages
Increase retry count or delays if transient
Investigate root cause if persistent
Set up alerts for repeated failures

Next Steps

Load data into databases: Data Loader
Build a workflow: Building and Configuring Workflows
Explore other models: Modelling Library

Modelling Library

Data Ingestion ​

Overview ​

Key Capabilities ​

HTTP Data Retrieval ​

GTFS-Realtime Support ​

Error Handling ​

Data Transformation ​

Use Cases ​

Real-Time Transit Data Ingestion ​

Periodic API Data Collection ​

Weather Data Integration ​

Market Data Feeds ​

Model Inputs ​

Model Outputs ​

Configuration Options ​

Supported Data Formats ​

GTFS-Realtime (Protocol Buffers) ​

JSON ​

XML ​

CSV ​

Integration with Other Models ​

Retry Strategies ​

Exponential Backoff ​

Fixed Interval ​

Immediate Retry ​

Error Handling Patterns ​

Transient Failures ​

Permanent Failures ​

Partial Success ​

Performance Notes ​

Getting Started ​

Basic Workflow ​

Example: Simple API Call ​

Example: Scheduled Data Collection ​

Best Practices ​

Reliability ​

Performance ​

Security ​

Data Quality ​

Troubleshooting ​

Requests Timing Out ​

Authentication Failures ​

Data Format Errors ​

Retry Exhaustion ​

Next Steps ​

Data Ingestion

Overview

Key Capabilities

HTTP Data Retrieval

GTFS-Realtime Support

Error Handling

Data Transformation

Use Cases

Real-Time Transit Data Ingestion

Periodic API Data Collection

Weather Data Integration

Market Data Feeds

Model Inputs

Model Outputs

Configuration Options

Supported Data Formats

GTFS-Realtime (Protocol Buffers)

JSON

XML

CSV

Integration with Other Models

Retry Strategies

Exponential Backoff

Fixed Interval

Immediate Retry

Error Handling Patterns

Transient Failures

Permanent Failures

Partial Success

Performance Notes

Getting Started

Basic Workflow

Example: Simple API Call

Example: Scheduled Data Collection

Best Practices

Reliability

Performance

Security

Data Quality

Troubleshooting

Requests Timing Out

Authentication Failures

Data Format Errors

Retry Exhaustion

Next Steps