From Zero to One: How I Built a Lean Data Stack with Dagster, DuckDB, and S3-Compatible Object Storage

I did not want a heavy data platform. I wanted a stack that was simple enough to ship quickly, cheap enough to run, and flexible enough to grow later.

Character Design

I tend to choose practical tools over perfect architecture.

When I build a data system, I usually care about three things first:

Can the team move fast?
Can we keep the cost under control?
Can we evolve the system later without rebuilding everything?

With that in mind, I ended up with a simple stack:

Dagster for orchestration, DuckDB for computation, and S3-compatible object storage for layered storage.

In one sentence:

Dagster handles asset orchestration, DuckDB handles SQL-heavy transformation, and object storage holds the data in Parquet or CSV.

Content

Why I chose this stack

At the beginning, I was not trying to build the most advanced modern data stack.

I was trying to build something that actually worked for the stage I was in.

I needed a setup that could support real delivery, but without the cost and complexity of a larger platform. That is why this combination made sense to me.

It is lightweight, clear, and good enough for many small-to-medium analytical workloads.

How I split the architecture

1. Orchestration: Dagster

I use Dagster assets to represent dependencies directly instead of wiring everything together with scripts.

That gives me a few clear benefits:

The pipeline is easier to understand
Upstream and downstream relationships are explicit
Re-runs are easier to control
Lineage and execution status are easier to observe

I usually organize assets by layers such as:

raw
stg
dwd

This keeps the codebase readable and helps a lot once the number of datasets starts growing.

2. Compute: DuckDB

My default rule is simple:

If SQL can do it, I prefer SQL.

DuckDB lets me read data from object storage, run transformations in one session, and write the results back out again.

For semi-structured inputs or messy edge cases, I still use Python when needed. But I try to keep the main transformation logic in SQL.

That approach works well for me because SQL is usually easier to read, easier to review, and easier to standardize across a team.

DuckDB is a good fit here because it stays lightweight while still being powerful enough for a lot of real work.

3. Storage: S3-compatible object storage

On the storage side, I treat everything as an S3 URI and use Parquet as the main format. When another team or external user needs a simpler export, I also write CSV.

This gives me a few useful defaults:

One storage protocol
One layered path structure
One main delivery format

That consistency removes a lot of unnecessary decisions from daily work.

A real pipeline example

In one anonymized workflow, the pipeline looked roughly like this:

raw_event_asset
stg_detail_asset
dwd_detail_asset
dwd_event_asset
dwd_detail_latest_asset
dwd_detail_avg_asset

This kind of workflow is already too complex for a few scripts, but it still does not require a heavy distributed platform.

That is exactly the gap this stack helped me cover.

The boundaries I try to keep clear

One thing I care about a lot is keeping responsibilities separate:

Dagster is for orchestration
DuckDB and SQL are for transformation
Object storage is for storage and delivery

I try not to mix these concerns too much.

Once the boundaries become blurry, a lightweight architecture can become messy very quickly.

When this stack is a good fit

I would not judge this stack only by data volume.

I usually look at several things together:

Data size
Freshness requirements
Concurrency
Source complexity
Team size
Cost pressure
Governance needs
Business stage

In my experience, this setup is a strong fit when:

You are in the early 0-to-1 or 1-to-N stage
Most workloads are batch or near-real-time
Your team is comfortable with SQL
You want fast iteration without introducing too much infrastructure

For that kind of situation, Dagster + DuckDB + S3-compatible object storage feels like a very good balance.

What this architecture helped me do

This stack does not solve every problem.

But it helped me solve an important one:

How do I keep delivering value without letting cost or complexity grow too early?

That is why I like this setup.

It gave me a practical middle ground between under-building and over-engineering.

If you are in a similar stage, I think these are the three questions worth asking:

Do we really need distributed infrastructure right now?
Can SQL handle most of our transformations?
Are we okay with a simple, pragmatic setup first, then upgrading later when the bottleneck becomes real?

If the answer is yes, this stack is worth trying.

About the Author

I write about the modern data stack, data analysis, and cities.

I am especially interested in practical work with DuckDB, Dagster, and lightweight data systems.

I care about useful tools, clear thinking, and real-world data work.

From Zero to One: How I Built a Lean Data Stack with Dagster, DuckDB, and S3-Compatible Object Storage

Character Design