Vibe-Engineering LakeFlow Pipelines, the Advancing Analytics Way

Written by Simon Whiteley | Jan 20, 2026 10:56:12 AM

Want to watch this blog instead? Find the YouTube video here: https://youtu.be/GFa6Cf6GEA0

If you work in data engineering today, you’ve probably seen the rise of vibe-coding - that moment where someone prompts an LLM, gets a pipeline out the other side, deploys it, and calls it a day. And in fairness, for the person who gets their clean table of data at the end, job done.

But for the engineering teams who have to support hundreds of those pipelines afterwards?
That’s where things begin to creak.

It’s not that LLM‑driven development is bad. Far from it. It’s that doing it without structure produces… well, let’s call it what it is: AI slop. A thousand pipelines all written slightly differently, none of them predictable, and all of them reliant on someone’s late‑night vibe-coding session. And as fun as that sounds, it’s not something you can sustainably build a platform on.

So at Advancing Analytics, we’ve been asking ourselves a very simple question:

How do we embrace the speed and power of LLM‑generated development, without creating chaos?

And over the past few months, behind the scenes, we’ve been building an answer.

The Future Everyone’s Talking About… and the Messy Middle We Actually Live In

There’s a lot of noise in the industry about where we’re heading: ontologies, agentic systems, decentralised data meshes that dynamically map your entire data estate on demand.

And yes, that is where we're heading. But we’re not there yet.

Right now, we’re in the messy middle; we still need repeatability, guardrails, engineering. Because the idea that every pipeline can be different, and an agent will happily maintain them all? Not today - and not tomorrow either.

So the challenge becomes:

How do we let people move fast using AI, without sacrificing structure, quality, or supportability?

The Breakthrough: What We’ve Been Building

Over the past month, we’ve been quietly prototyping something completely new. Something fast to deploy, surprisingly powerful, and designed to solve exactly the problem we’re all feeling.

We call it:

LakeForge

LakeForge is our new engineering framework for Databricks LakeFlow pipelines - a lightweight, opinionated way to get standardised, predictable, repeatable pipelines that can scale.

Think of it as the opposite of wild-west vibe coding.
This is vibe coding on rails.

What LakeForge actually does

At its core, LakeForge provides:

A standardised ETL pattern
Boilerplates, decorators and configuration blocks create a consistent foundation - reducing delivery time, cutting down on rework, and making multi‑team collaboration genuinely scalable.
A library of reusable functions
Data is loaded, cleaned and structured in predictable ways using Auto Loader, rescue columns, naming rules and more. So, you'll have fewer defects, faster onboarding, and immediate alignment with enterprise data standards.
A design‑validation‑refinement loop
Multiple agents (our 'Pantheon') review, design, critique and refine pipelines before anything is generated. This gives you higher‑quality pipelines from day one, massively reduces time spent in code reviews, and ensures governance is baked in from the start.
Deterministic output
No free‑form “here’s some random Spark code”. LakeForge produces specification‑driven templates that look the same every time. That means predictable support costs, reliable production behaviour, and pipelines that remain maintainable long after the original developer has moved on.

In short, you point LakeForge at your data.
It analyses it, proposes transformations, validates them, iterates until they meet quality thresholds, and finally generates a full set of LakeFlow declarative pipeline files.

All in a couple of minutes - turning days of engineering effort into minutes of automated, high‑quality output.

And this isn’t theoretical. We’re already testing it with client datasets, running it against real tables, validating how it scales, and hardening the prompts and quality gates to ensure it performs in real enterprise environments.

Under the Hood: The Technical Foundation of LakeForge

LakeForge isn’t just a pretty wrapper around LLM calls. Underneath, it’s built on a deliberate separation of determinism and creativity, which is what keeps the outputs consistent while still benefiting from AI‑assisted design.

Here’s a snapshot of the core engineering principles:

1. Declarative pipeline generation

LakeForge produces LakeFlow declarative specs, not handwritten notebooks.
These specs contain:

Rename maps
Data quality expectations
Transformation steps
Table‑level metadata
Lineage‑aware dependencies

Because everything is declared rather than coded, pipelines behave the same way regardless of who - or what - generated them.

2. Deterministic function library

All heavy lifting happens in a curated function set (our forge library). LLMs never write transformation logic themselves; they only provide specifications.
The execution logic lives in:

Auto Loader wrappers
Schema enforcement utilities
Pattern‑driven sanitisation functions
Incremental merge helpers
Schema evolution safety rails

This is what keeps pipeline behaviour identical across hundreds of tables.

3. Agentic design loop with validation gates

Our Pantheon agents operate in a controlled sequence:

Profiling
Pipeline design proposal
Syntactic validation
Semantic validation
Refinement or regeneration
Template generation

Each stage has both LLM‑based reasoning and deterministic checks. If the LLM hallucinates a column or suggests an invalid rule, validation catches it, and only that component is regenerated - not the full pipeline.

4. Repeatability at scale

Because the spec format is consistent and the generation loop is automated, we can:

Ingest hundreds of tables at once
Apply organisation-wide data standards automatically
Refine and regenerate pipelines in bulk
Enforce naming conventions system‑wide

It means LakeForge is engineered for real enterprise workloads where scale and supportability matter. This is the part that turns vibe-coding from chaos into a legitimate engineering workflow.

Why We Built It

Because real engineering isn’t about avoiding AI. It’s about harnessing it.

Data teams aren’t going to be replaced by agents. But data teams who know how to build systems that use agents well? They’re the ones who will shape the next decade of our industry.

Vibe-coding isn’t the enemy. The lack of structure around vibe-coding is.

LakeForge gives people the freedom to work fast, while still producing pipelines that are:

Scalable
Repeatable
Supportable
Clean
Grounded in engineering best practice

It brings order to what could otherwise become a very chaotic future.

What’s Next

This is just the beginning. We’re now working on hardening it and refining prompts. Currently we're working with clients on first iterations, and getting a huge amount of real-world feedback to improve it further.

LakeForge is becoming the foundation for how we think pipelines will be built over the next couple of years.

We’re incredibly excited by what’s possible, and even more excited to finally talk about what we’ve been working on behind the scenes.

Watch this space - LakeForge will be launched later this month. We can't wait to show you what it can do.

View full post