Skip to main content

Command Palette

Search for a command to run...

From Zero to One: How I Built a Lean Data Stack with Dagster, DuckDB, and S3-Compatible Object Storage

Updated
5 min read
From Zero to One: How I Built a Lean Data Stack with Dagster, DuckDB, and S3-Compatible Object Storage

I did not want a heavy data platform. I wanted a stack that was simple enough to ship quickly, cheap enough to run, and flexible enough to grow later.

Character Design

I tend to choose practical tools over perfect architecture.

When I build a data system, I usually care about three things first:

  • Can the team move fast?

  • Can we keep the cost under control?

  • Can we evolve the system later without rebuilding everything?

With that in mind, I ended up with a simple stack:

Dagster for orchestration, DuckDB for computation, and S3-compatible object storage for layered storage.

In one sentence:

Dagster handles asset orchestration, DuckDB handles SQL-heavy transformation, and object storage holds the data in Parquet or CSV.


Content

Why I chose this stack

At the beginning, I was not trying to build the most advanced modern data stack.

I was trying to build something that actually worked for the stage I was in.

I needed a setup that could support real delivery, but without the cost and complexity of a larger platform. That is why this combination made sense to me.

It is lightweight, clear, and good enough for many small-to-medium analytical workloads.


How I split the architecture

1. Orchestration: Dagster

I use Dagster assets to represent dependencies directly instead of wiring everything together with scripts.

That gives me a few clear benefits:

  • The pipeline is easier to understand

  • Upstream and downstream relationships are explicit

  • Re-runs are easier to control

  • Lineage and execution status are easier to observe

I usually organize assets by layers such as:

  • raw

  • stg

  • dwd

This keeps the codebase readable and helps a lot once the number of datasets starts growing.


2. Compute: DuckDB

My default rule is simple:

If SQL can do it, I prefer SQL.

DuckDB lets me read data from object storage, run transformations in one session, and write the results back out again.

For semi-structured inputs or messy edge cases, I still use Python when needed. But I try to keep the main transformation logic in SQL.

That approach works well for me because SQL is usually easier to read, easier to review, and easier to standardize across a team.

DuckDB is a good fit here because it stays lightweight while still being powerful enough for a lot of real work.


3. Storage: S3-compatible object storage

On the storage side, I treat everything as an S3 URI and use Parquet as the main format. When another team or external user needs a simpler export, I also write CSV.

This gives me a few useful defaults:

  • One storage protocol

  • One layered path structure

  • One main delivery format

That consistency removes a lot of unnecessary decisions from daily work.


A real pipeline example

In one anonymized workflow, the pipeline looked roughly like this:

  1. raw_event_asset

  2. stg_detail_asset

  3. dwd_detail_asset

  4. dwd_event_asset

  5. dwd_detail_latest_asset

  6. dwd_detail_avg_asset

This kind of workflow is already too complex for a few scripts, but it still does not require a heavy distributed platform.

That is exactly the gap this stack helped me cover.


The boundaries I try to keep clear

One thing I care about a lot is keeping responsibilities separate:

  • Dagster is for orchestration

  • DuckDB and SQL are for transformation

  • Object storage is for storage and delivery

I try not to mix these concerns too much.

Once the boundaries become blurry, a lightweight architecture can become messy very quickly.


When this stack is a good fit

I would not judge this stack only by data volume.

I usually look at several things together:

  • Data size

  • Freshness requirements

  • Concurrency

  • Source complexity

  • Team size

  • Cost pressure

  • Governance needs

  • Business stage

In my experience, this setup is a strong fit when:

  • You are in the early 0-to-1 or 1-to-N stage

  • Most workloads are batch or near-real-time

  • Your team is comfortable with SQL

  • You want fast iteration without introducing too much infrastructure

For that kind of situation, Dagster + DuckDB + S3-compatible object storage feels like a very good balance.


What this architecture helped me do

This stack does not solve every problem.

But it helped me solve an important one:

How do I keep delivering value without letting cost or complexity grow too early?

That is why I like this setup.

It gave me a practical middle ground between under-building and over-engineering.

If you are in a similar stage, I think these are the three questions worth asking:

  1. Do we really need distributed infrastructure right now?

  2. Can SQL handle most of our transformations?

  3. Are we okay with a simple, pragmatic setup first, then upgrading later when the bottleneck becomes real?

If the answer is yes, this stack is worth trying.


About the Author

I write about the modern data stack, data analysis, and cities.

I am especially interested in practical work with DuckDB, Dagster, and lightweight data systems.

I care about useful tools, clear thinking, and real-world data work.