Understanding the fundamental concepts behind Lume will help you build more effective and secure data transformation pipelines. Lume’s architecture is designed to provide a secure, reliable, and scalable transformation engine while minimizing direct access to your production systems.

The Sync-Transform-Sync Model

The Lume Python SDK operates on a “Sync-Transform-Sync” model, which is designed for maximum security and operational simplicity. When you trigger a run, you are not executing the transformation logic in your own environment. Instead, you are orchestrating a pipeline on the Lume platform.

This architecture means:

  • Your application only needs credentials for the Lume API, not for your source or target data systems.
  • Transformation logic is centralized and managed within your Lume Flow Version.
  • The Lume platform handles the heavy lifting of data syncing and transformation in a scalable, isolated environment, unaffected by the performance of your source or target systems.

Architecture Diagram

This diagram illustrates the flow of data during a lume.run() execution.

Key Objects

Connector

A Connector is a pre-configured, authenticated link to one of your external data systems, such as an object store or a relational database. Connectors are created and managed securely within the Lume UI.

  • A Source Connector is used to ingest data from your system to Lume’s staging area.
  • A Target Connector is used to sync transformed data from Lume’s staging area to your system.

A Flow Version must be associated with at least one source and one target connector.

Supported Systems

Lume provides connectors for a variety of systems.

Object Storage Primary storage solution for handling ad-hoc documents such as CSV and JSON files. Supported object storage includes:

  • Amazon S3
  • Azure Blob Storage

Relational Databases (Recommended) Optimal storage solution for large datasets and structured data. Supported databases include:

  • PostgreSQL
  • Snowflake
  • MySQL
  • Microsoft SQL Server
  • Databricks

Flow

A Flow is a logical mapping template that defines the blueprint for a transformation. This includes:

  • Target Schema: The structure of your output data.
  • Transformation Logic: How to map input data to the target schema.
  • Validation Rules: Quality checks and business logic.
  • Error Handling: How to handle malformed or missing data.
  • Connectors: The source and target systems for the data pipeline.

Flows are created and managed in the Lume UI, not via the API.

Version

A Version is an immutable snapshot of a Flow at a specific point in time. Think of it like a Git commit - once created, it never changes.

# Examples of Flow Versions
"customer_data:v2"      # Version 2 of the customer_data flow
"product_catalog:v1"      # Version 1 of the product_catalog flow

Why Versions?

  • Reproducibility: Same input always produces same output
  • Safety: Changes to flows don’t affect running jobs
  • Rollback: Easy to revert to previous versions
  • Testing: Test new versions before promoting to production
  • Compliance: Maintain audit trails

Run

A Run is a single execution of a Flow Version against a specific batch of data. You create a run by calling lume.run().

Each run is defined by two key parameters:

  1. flow_version: The immutable logic to execute.
  2. source_path: A string that tells Lume what specific data to process. See Understanding source_path below for details.

This one function call orchestrates the entire Sync-Transform-Sync pipeline.

import lume

# This one call tells Lume to find the data at the given path,
# execute the "invoice_cleaner:v4" logic, and sync the results
# to the target defined in the flow's connectors.
run = lume.run(
    flow_version="invoice_cleaner:v4",
    source_path="s3://my-company-invoices/new/2024-08-01.csv"
)

# You can then wait for the run to complete
run.wait()

print(f"Run {run.id} finished with status: {run.status}")

Pro Tip: Use Webhooks for Production

While run.wait() is great for simple scripts and getting started, we strongly recommend using Webhooks for production applications. They are more scalable and efficient than continuous polling.

Understanding source_path

The source_path parameter is a string that uniquely identifies the data your pipeline will process. Its meaning depends on the type of Source Connector used by your Flow Version.

For Object Storage (S3, Azure Blob)

When your source is an object store, source_path is the full URI to a specific file.

  • Example: s3://my-customer-data/new_records.csv

Lume will fetch this specific file for processing.

For Relational Databases (PostgreSQL, Snowflake)

When your source is a database, source_path is not a direct path but rather a logical identifier for a batch of data. It’s a string you provide (e.g., a batch ID, a date range) that your pre-configured query in the Lume UI uses to select the correct rows.

  • Example: "batch_202407291430"

The Connector configuration in Lume contains the actual SQL query. This query must reference the source_path to filter the data. For example, your query might look like: SELECT * FROM invoices WHERE batch_id = :source_path;

This design prevents SQL injection and separates orchestration logic (the source_path your code provides) from data access logic (the query managed in Lume).

For a complete, step-by-step guide to running your first pipeline, see the Quickstart Guide.

Run Lifecycle

Every run goes through an expanded set of states reflecting the sync-transform-sync process:

CREATED → SYNCING_SOURCE → TRANSFORMING → SYNCING_TARGET → SUCCEEDED

A run can also terminate in FAILED, PARTIAL_FAILED, or CRASHED.

Status Meanings

StatusDescription
CREATEDRun has been accepted and is waiting to be scheduled.
SYNCING_SOURCELume is actively ingesting data from your source system into its secure staging area.
TRANSFORMINGThe data transformation logic is being executed on the staged data.
SYNCING_TARGETLume is writing the transformed data and metadata to your target system.
SUCCEEDEDThe entire pipeline, including both sync steps and the transformation, completed successfully.
PARTIAL_FAILEDThe pipeline completed. Some rows were transformed successfully, while others were rejected due to validation or mapping errors. Both mapped and rejected data are written to the target system. See Handling Partial Failures for details.
FAILEDA non-recoverable error occurred during one of the steps. Check metadata for details.
CRASHEDA fatal system error occurred. Contact support.