Core Concepts
Understanding the core concepts behind the Lume Python SDK
Understanding the fundamental concepts behind Lume will help you build more effective and secure data transformation pipelines. Lume’s architecture is designed to provide a secure, reliable, and scalable transformation engine while minimizing direct access to your production systems.
The Sync-Transform-Sync Model
The Lume Python SDK operates on a “Sync-Transform-Sync” model, which is designed for maximum security and operational simplicity. When you trigger a run, you are not executing the transformation logic in your own environment. Instead, you are orchestrating a pipeline on the Lume platform.
This architecture means:
- Your application only needs credentials for the Lume API, not for your source or target data systems.
- Transformation logic is centralized and managed within your Lume Flow Version.
- The Lume platform handles the heavy lifting of data syncing and transformation in a scalable, isolated environment, unaffected by the performance of your source or target systems.
Architecture Diagram
This diagram illustrates the flow of data during a lume.run()
execution.
Key Objects
Connector
A Connector is a pre-configured, authenticated link to one of your external data systems, such as an object store or a relational database. Connectors are created and managed securely within the Lume UI.
- A Source Connector is used to ingest data from your system to Lume’s staging area.
- A Target Connector is used to sync transformed data from Lume’s staging area to your system.
A Flow Version must be associated with at least one source and one target connector.
Supported Systems
Lume provides connectors for a variety of systems.
Object Storage Primary storage solution for handling ad-hoc documents such as CSV and JSON files. Supported object storage includes:
- Amazon S3
- Azure Blob Storage
Relational Databases (Recommended) Optimal storage solution for large datasets and structured data. Supported databases include:
- PostgreSQL
- Snowflake
- MySQL
- Microsoft SQL Server
- Databricks
Flow
A Flow is a logical mapping template that defines the blueprint for a transformation. This includes:
- Target Schema: The structure of your output data.
- Transformation Logic: How to map input data to the target schema.
- Validation Rules: Quality checks and business logic.
- Error Handling: How to handle malformed or missing data.
- Connectors: The source and target systems for the data pipeline.
Flows are created and managed in the Lume UI, not via the API.
Version
A Version is an immutable snapshot of a Flow at a specific point in time. Think of it like a Git commit - once created, it never changes.
Why Versions?
- Reproducibility: Same input always produces same output
- Safety: Changes to flows don’t affect running jobs
- Rollback: Easy to revert to previous versions
- Testing: Test new versions before promoting to production
- Compliance: Maintain audit trails
Run
A Run is a single execution of a Flow Version against a specific batch of data. You create a run by calling lume.run()
.
Each run is defined by two key parameters:
flow_version
: The immutable logic to execute.source_path
: A string that tells Lume what specific data to process. See Understandingsource_path
below for details.
This one function call orchestrates the entire Sync-Transform-Sync pipeline.
Pro Tip: Use Webhooks for Production
While run.wait()
is great for simple scripts and getting started, we strongly recommend using Webhooks for production applications. They are more scalable and efficient than continuous polling.
Understanding source_path
The source_path
parameter is a string that uniquely identifies the data your pipeline will process. Its meaning depends on the type of Source Connector used by your Flow Version.
For Object Storage (S3, Azure Blob)
When your source is an object store, source_path
is the full URI to a specific file.
- Example:
s3://my-customer-data/new_records.csv
Lume will fetch this specific file for processing.
For Relational Databases (PostgreSQL, Snowflake)
When your source is a database, source_path
is not a direct path but rather a logical identifier for a batch of data. It’s a string you provide (e.g., a batch ID, a date range) that your pre-configured query in the Lume UI uses to select the correct rows.
- Example:
"batch_202407291430"
The Connector configuration in Lume contains the actual SQL query. This query must reference the source_path
to filter the data. For example, your query might look like: SELECT * FROM invoices WHERE batch_id = :source_path;
This design prevents SQL injection and separates orchestration logic (the source_path
your code provides) from data access logic (the query managed in Lume).
For a complete, step-by-step guide to running your first pipeline, see the Quickstart Guide.
Run Lifecycle
Every run goes through an expanded set of states reflecting the sync-transform-sync process:
A run can also terminate in FAILED
, PARTIAL_FAILED
, or CRASHED
.
Status Meanings
Status | Description |
---|---|
CREATED | Run has been accepted and is waiting to be scheduled. |
SYNCING_SOURCE | Lume is actively ingesting data from your source system into its secure staging area. |
TRANSFORMING | The data transformation logic is being executed on the staged data. |
SYNCING_TARGET | Lume is writing the transformed data and metadata to your target system. |
SUCCEEDED | The entire pipeline, including both sync steps and the transformation, completed successfully. |
PARTIAL_FAILED | The pipeline completed. Some rows were transformed successfully, while others were rejected due to validation or mapping errors. Both mapped and rejected data are written to the target system. See Handling Partial Failures for details. |
FAILED | A non-recoverable error occurred during one of the steps. Check metadata for details. |
CRASHED | A fatal system error occurred. Contact support. |