Appearance
Data Ingestion
Overview
The Data Ingestion service provides reliable data acquisition from external sources through HTTP-based APIs. It handles data retrieval with configurable retry logic, timeout management, and specialized support for real-time transit data formats (GTFS-Realtime).
This service is designed for teams that need to bring external data into workflows — from real-time feeds to periodic data pulls — with robust error handling and transformation capabilities.
Key Capabilities
HTTP Data Retrieval
- Flexible HTTP Requests — GET, POST, PUT, DELETE operations
- Retry Logic — Automatic retries for transient failures
- Timeout Handling — Configurable timeouts to prevent hangs
- Response Processing — Parse JSON, XML, and other formats
GTFS-Realtime Support
- Transit Feed Parsing — Specialized support for GTFS-R protocol buffers
- Real-Time Updates — Vehicle positions, trip updates, service alerts
- Data Transformation — Convert GTFS-R to structured formats
- Feed Validation — Check data quality and completeness
Error Handling
- Automatic Retries — Configurable retry attempts and backoff strategies
- Timeout Protection — Prevent indefinite waits
- Error Logging — Detailed error messages for debugging
- Graceful Degradation — Partial data handling when possible
Data Transformation
- Format Conversion — Transform between data formats
- Field Mapping — Rename and restructure fields
- Filtering — Extract only needed data
- Enrichment — Add computed fields or metadata
Use Cases
Real-Time Transit Data Ingestion
Scenario: A transportation planning team needs live vehicle positions and arrival predictions from transit agencies.
Workflow:
- Configure GTFS-Realtime feed URLs
- Set polling interval (e.g., every 30 seconds)
- Ingest vehicle positions and trip updates
- Transform to standardized format
- Feed into traffic model or analysis workflow
Value: Enable real-time analysis and passenger information systems.
Periodic API Data Collection
Scenario: A supply chain team pulls inventory levels from multiple vendor APIs daily.
Workflow:
- Define API endpoints and authentication
- Schedule daily ingestion workflow
- Retrieve data from each vendor
- Standardize format across vendors
- Load into data warehouse for analysis
Value: Centralize data from disparate sources for unified analysis.
Weather Data Integration
Scenario: An agricultural planning workflow needs current and forecast weather data.
Workflow:
- Call weather API with location parameters
- Retrieve current conditions and forecasts
- Extract relevant fields (temperature, precipitation)
- Combine with crop and soil data
- Feed into irrigation or planting models
Value: Make decisions based on current and predicted conditions.
Market Data Feeds
Scenario: A pricing optimization system needs current market prices for raw materials.
Workflow:
- Connect to market data feeds
- Retrieve latest commodity prices
- Filter to relevant materials
- Transform to internal format
- Update pricing models with current data
Value: Keep pricing competitive based on real-time market conditions.
Model Inputs
The Data Ingestion service accepts:
- API Endpoints — URLs to retrieve data from
- Request Configuration — Headers, authentication, parameters
- Retry Settings — Number of attempts, backoff strategy
- Timeout Values — Maximum wait time for responses
- Transformation Rules — How to process retrieved data
Model Outputs
The service produces:
- Retrieved Data — Raw or transformed data from external sources
- Status Information — Success/failure, retry counts, response times
- Error Messages — Detailed diagnostics for failures
- Metadata — Timestamps, source info, data quality indicators
Configuration Options
Key parameters you can configure:
- Endpoint URL — Where to retrieve data from
- HTTP Method — GET, POST, PUT, DELETE
- Authentication — API keys, tokens, basic auth
- Retry Count — How many times to retry failed requests
- Retry Delay — Time between retry attempts
- Timeout — Maximum wait time for response
- Response Format — JSON, XML, Protobuf, etc.
Supported Data Formats
GTFS-Realtime (Protocol Buffers)
Specialized support for transit data:
- Vehicle Positions — Real-time location of transit vehicles
- Trip Updates — Predicted arrival/departure times
- Service Alerts — Disruptions, detours, schedule changes
JSON
Standard JSON parsing and transformation:
- Nested objects and arrays
- Field extraction and mapping
- Schema validation
XML
XML document processing:
- Element and attribute extraction
- XPath queries
- Namespace handling
CSV
Comma-separated values:
- Header detection
- Field parsing
- Type conversion
Integration with Other Models
The Data Ingestion service works well with:
- Data Loader — Ingest data, then load into databases
- Traffic Model — Feed real-time transit data into simulations
- Tiny Time Mixers — Ingest historical data for forecasting
- AI Agent Python — Agents can trigger ingestion based on needs
Retry Strategies
Exponential Backoff
- First retry: Wait 1 second
- Second retry: Wait 2 seconds
- Third retry: Wait 4 seconds
- Reduces load on failing services
Fixed Interval
- Wait same amount between each retry
- Simpler but can overwhelm recovering services
Immediate Retry
- Retry without delay
- For transient network issues
- Risk of overwhelming service
Error Handling Patterns
Transient Failures
Network issues, temporary unavailability:
- Automatic retry with backoff
- Log warnings but don't fail workflow
- Return partial data if some sources succeed
Permanent Failures
Invalid credentials, non-existent endpoints:
- Fail immediately without retries
- Provide clear error message
- Don't block unrelated workflow tasks
Partial Success
Some data retrieved, some failed:
- Return available data with warnings
- Log which sources failed
- Allow workflow to continue with partial data
Performance Notes
- Polling Frequency — More frequent polling increases load and costs
- Batch When Possible — Retrieve multiple items in single request
- Use Caching — Cache responses when data doesn't change rapidly
- Parallel Requests — Call independent endpoints concurrently
Getting Started
Basic Workflow
- Identify Data Source — Find API endpoint for needed data
- Configure Request — Set URL, auth, parameters
- Set Retry Policy — Choose retry count and delays
- Add to Workflow — Drag Data Ingestion into canvas
- Test & Monitor — Verify data retrieval and handle errors
Example: Simple API Call
[Data Ingestion: API] → [Process Data] → [Store Results]This workflow retrieves data from an API, processes it, and stores the results.
Example: Scheduled Data Collection
[Schedule Trigger] → [Data Ingestion: Multiple Sources] → [Data Loader] → [Database]This workflow runs on a schedule, ingests from multiple sources, and loads into a database.
Best Practices
Reliability
- Use Retries — Network is unreliable, plan for failures
- Set Timeouts — Don't wait forever for responses
- Log Everything — Track successes, failures, and retries
- Monitor Patterns — Watch for recurring failures to fix root causes
Performance
- Cache When Appropriate — Don't re-fetch static or slow-changing data
- Batch Requests — Reduce overhead with fewer, larger requests
- Parallel Retrieval — Call independent sources concurrently
- Compress Data — Use gzip for large transfers if supported
Security
- Secure Credentials — Don't hardcode API keys in configurations
- Use HTTPS — Encrypt data in transit
- Validate Certificates — Ensure you're connecting to real endpoints
- Rate Limit — Respect API provider limits to avoid bans
Data Quality
- Validate Responses — Check data structure and values
- Handle Missing Fields — Gracefully deal with incomplete data
- Timestamp Everything — Record when data was retrieved
- Version APIs — Track which API version you're using
Troubleshooting
Requests Timing Out
- Increase timeout value
- Check if endpoint is slow or down
- Try smaller requests if API supports pagination
- Verify network connectivity
Authentication Failures
- Verify credentials are correct and current
- Check if API key has expired
- Ensure proper authentication header format
- Review API documentation for auth requirements
Data Format Errors
- Verify API is returning expected format
- Check if API version has changed
- Review transformation rules
- Log raw responses for debugging
Retry Exhaustion
- API may be down, check status pages
- Increase retry count or delays if transient
- Investigate root cause if persistent
- Set up alerts for repeated failures
Next Steps
- Load data into databases: Data Loader
- Build a workflow: Building and Configuring Workflows
- Explore other models: Modelling Library
