From Zero to One: How I Built a Lean Data Stack with Dagster, DuckDB, and S3-Compatible Object Storage

I did not want a heavy data platform. I wanted a stack that was simple enough to ship quickly, cheap enough to run, and flexible enough to grow later.
Character Design
I tend to choose practical tools over perfect architecture.
When I build a data system, I usually care about three things first:
Can the team move fast?
Can we keep the cost under control?
Can we evolve the system later without rebuilding everything?
With that in mind, I ended up with a simple stack:
Dagster for orchestration, DuckDB for computation, and S3-compatible object storage for layered storage.
In one sentence:
Dagster handles asset orchestration, DuckDB handles SQL-heavy transformation, and object storage holds the data in Parquet or CSV.
Content
Why I chose this stack
At the beginning, I was not trying to build the most advanced modern data stack.
I was trying to build something that actually worked for the stage I was in.
I needed a setup that could support real delivery, but without the cost and complexity of a larger platform. That is why this combination made sense to me.
It is lightweight, clear, and good enough for many small-to-medium analytical workloads.
How I split the architecture
1. Orchestration: Dagster
I use Dagster assets to represent dependencies directly instead of wiring everything together with scripts.
That gives me a few clear benefits:
The pipeline is easier to understand
Upstream and downstream relationships are explicit
Re-runs are easier to control
Lineage and execution status are easier to observe
I usually organize assets by layers such as:
rawstgdwd
This keeps the codebase readable and helps a lot once the number of datasets starts growing.
2. Compute: DuckDB
My default rule is simple:
If SQL can do it, I prefer SQL.
DuckDB lets me read data from object storage, run transformations in one session, and write the results back out again.
For semi-structured inputs or messy edge cases, I still use Python when needed. But I try to keep the main transformation logic in SQL.
That approach works well for me because SQL is usually easier to read, easier to review, and easier to standardize across a team.
DuckDB is a good fit here because it stays lightweight while still being powerful enough for a lot of real work.
3. Storage: S3-compatible object storage
On the storage side, I treat everything as an S3 URI and use Parquet as the main format. When another team or external user needs a simpler export, I also write CSV.
This gives me a few useful defaults:
One storage protocol
One layered path structure
One main delivery format
That consistency removes a lot of unnecessary decisions from daily work.
A real pipeline example
In one anonymized workflow, the pipeline looked roughly like this:
raw_event_assetstg_detail_assetdwd_detail_assetdwd_event_assetdwd_detail_latest_assetdwd_detail_avg_asset
This kind of workflow is already too complex for a few scripts, but it still does not require a heavy distributed platform.
That is exactly the gap this stack helped me cover.
The boundaries I try to keep clear
One thing I care about a lot is keeping responsibilities separate:
Dagster is for orchestration
DuckDB and SQL are for transformation
Object storage is for storage and delivery
I try not to mix these concerns too much.
Once the boundaries become blurry, a lightweight architecture can become messy very quickly.
When this stack is a good fit
I would not judge this stack only by data volume.
I usually look at several things together:
Data size
Freshness requirements
Concurrency
Source complexity
Team size
Cost pressure
Governance needs
Business stage
In my experience, this setup is a strong fit when:
You are in the early
0-to-1or1-to-NstageMost workloads are batch or near-real-time
Your team is comfortable with SQL
You want fast iteration without introducing too much infrastructure
For that kind of situation, Dagster + DuckDB + S3-compatible object storage feels like a very good balance.
What this architecture helped me do
This stack does not solve every problem.
But it helped me solve an important one:
How do I keep delivering value without letting cost or complexity grow too early?
That is why I like this setup.
It gave me a practical middle ground between under-building and over-engineering.
If you are in a similar stage, I think these are the three questions worth asking:
Do we really need distributed infrastructure right now?
Can SQL handle most of our transformations?
Are we okay with a simple, pragmatic setup first, then upgrading later when the bottleneck becomes real?
If the answer is yes, this stack is worth trying.
About the Author
I write about the modern data stack, data analysis, and cities.
I am especially interested in practical work with DuckDB, Dagster, and lightweight data systems.
I care about useful tools, clear thinking, and real-world data work.


