
AI-Powered Data Engineering: My Stack for Faster, Smarter Analytics
Build an AI-assisted analytics engineering stack that eliminates context switching and accelerates dbt workflows. Learn how to integrate Roo, Claude Code, MCP servers, and Motherduck for an intelligent, automated development experience.

A few weeks ago, I hit a breaking point. I'd just wrapped up a call with seventeen action items, half of which involved translating obscure business requirements into dbt models while cross-referencing documentation scattered across Confluence, Notion, and someone's Google sheet named "Copy of Copy of Copy of Copy of Untitled".
The thought of manually implementing all of this made me want to throw my laptop out the window. So instead, I decided to build something better: an AI-powered data engineering workflow that actually works.
The Vision
Here's what I wanted: an environment where I could dump my meeting notes into a system and get back working dbt models, properly formatted SQL, and documentation that actually matches what stakeholders asked for. No more context switching between seventeen browser tabs. No more "wait, which Slack thread had that metric definition?"
The stack I landed on:
- VS Code as the editor (because obviously)
- Roo as the agentic framework
- Claude Code as the LLM
- CLAUDE.md for project context
- dbt for transformations
- MCP servers for everything else
So, essentially this 👇, but unironically.
Let me walk you through how I set this up, including all the sharp edges I hit along the way.
Setting Up the Editor
First things first: VS Code. I know, revolutionary choice. But the magic isn't in VS Code itself; it's in how we wire it up to talk to everything else.
The real game-changer here is Roo. If you haven't used it yet, Roo is essentially a code-aware AI assistant that can actually “understand” your entire codebase.
- Roo integrates directly into VS Code, acting as an autonomous coding partner that can understand natural language, generate code, refactor, debug, and answer questions about your codebase.
- It can read/write files, execute terminal commands, and even automate browser tasks. Users can customize Roo’s behavior through Custom Modes, connect to any OpenAI-compatible model, or run it with local models.
- Roo uses MCP to add custom tools and integrations (e.g., APIs, databases, specialized dev utilities), making it adaptable for roles like system architect, QA engineer, or product manager.
But here's the thing: the default setup uses CLI-based search, which is... fine. But we can do better.
plaintext# Spin up Qdrant for semantic code search
docker run -p 6333:6333 qdrant/qdrant
Once it's up and running, configure codebase indexing in Roo.
Now, Roo can build a proper semantic index of your codebase. This means when you ask "where do we calculate customer lifetime value?", it actually finds the relevant code instead of just grepping for "LTV" and hoping for the best.
Cool, right? Yeah, but we’re far from done.
I’m using Claude Code as the LLM of choice, after some testing it just feels like the most capable “model” to work with, although performance varies based on tasks you give it and also model version. If you have access to multiple providers, it’s worth experimenting with all of them.
For the bulk of my tasks, I haven’t needed any finer tuning of Roo configuration, but maybe we can take a look at that in a follow up post.
The CLAUDE.md File: Your AI's Cheat Sheet
Here's a trick I learned: Claude automatically pulls in any file named CLAUDE.md when starting a conversation. This is perfect for encoding all those unwritten rules and tribal knowledge that usually live in senior engineers' heads.
I structure mine like this:
plaintext# AI-Assisted Jaffle Shop dbt Project
This is a dbt project that uses Motherduck as its data warehouse.
## Common Commands
### dbt Commands
- `dbt run`: Build all models in the project
- `dbt run --select staging`: Build only staging models (materialized as views)
- `dbt test`: Run all data tests
- `dbt seed --full-refresh --vars '{"load_source_data": true}'`: Load seed data
### Making Changes
1. Research: Consult "Jaffle Shop Corporate Knowledge Base" in Notion for relevant metrics definitions and stakeholder context
2. Plan: Understand impact on downstream models using `dbt ls --select +model_name+`
3. Develop: Create/modify models in appropriate staging or marts folders
4. Document: Update corresponding YAML files with descriptions and tests
5. Compile: Run `dbt compile` to check for syntax and Jinja errors
6. Test Build: Run `dbt run --select <model_name>+` to test your changes
7. Validate: Run `dbt test --select <model_name>+` to validate data quality
8. Validate in warehouse: Execute queries in Motherduck to make sure the models are hydrated and run lightweight custom data quality checks
9. Cross-reference: Verify model outputs align with business definitions in corporate knowledge base
10. Lint: Run `sqlfluff lint` to check SQL style compliance
11. Commit: Use descriptive commit messages, pre-commit hooks will run automatically
### Notion Integration for Development
- Before building new models: Search corporate knowledge base for existing metrics definitions
- During development: Reference stakeholder discussions to understand business requirements
- For metrics questions: Check latest corporate conversations about calculation methodology
- When documenting: Align model descriptions with corporate terminology and definitions
This becomes your AI's reference manual. Every time Claude starts working on your project, it knows your conventions, your tools, and your workflow. You can tailor these instructions to your own personal workflow as much as possible, which would be especially useful if you could also give Claude access to all the tools you would use yourself.
A thing to keep in mind here is the context length for your agents' prompts. Stuffing these instructions into them every time will give you less room to work with, which is not necessarily an issue, but be mindful of a bloated CLAUDE.md file.
My current version is around 300 lines long, which seems to be a fairly good sweet spot and doesn’t ruin the performance of the model when working on tasks.
MCP Servers: The Secret Sauce
Source: notion.so
MCP (Model Context Protocol) servers are what turn Claude from a smart text generator into an actual data engineer. Think of them as tools Claude can use to interact with your infrastructure.
Setting them up in VS Code is straightforward. Add these to your settings (with your auth configs):
javascript{
"mcpServers": {
"dbt-mcp": {
"command": "uvx",
"args": [
"--env-file",
"/path/to/your/.env",
"dbt-mcp"
]
},
"mcp-server-motherduck": {
"command": "uvx",
"args": [
"mcp-server-motherduck",
"--db-path",
"md:",
"--motherduck-token",
"your-token-here"
]
},
"notion": {
"command": "npx",
"args": [
"-y",
"@notionhq/notion-mcp-server"
],
"env": {
"OPENAPI_MCP_HEADERS": "{\"Authorization\": \"Bearer xy\", \"Notion-Version\": \"2022-06-28\"}"
}
}
}
}
This configuration snippet sets up MCP servers so Roo (or another MCP-compatible tool) can use external services as part of its extended capabilities:
- Each block under "mcpServers" defines a different MCP server, specifying the command to run and arguments to pass. For example:
- dbt-mcp uses uvx to launch a dbt integration with environment variables from a .env file.
- mcp-server-motherduck connects to MotherDuck using a token for authentication.
- notion runs a Notion MCP server via npx and injects API credentials in env.
- Once configured, these servers act as custom tools Roo can call, enabling it to interact with dbt, query MotherDuck, or access Notion through structured API calls, all from within your coding workflow.
The dbt MCP server is particularly clever. It can:
- Execute dbt commands (core or cloud) and parse outputs
- Understand your project structure
- Access model metadata and documentation
- Even interact with your metrics layer if you have one
The Motherduck MCP server lets Claude query your warehouse directly. No more "can you check if this table exists?" followed by you alt-tabbing to DBeaver. Claude just... checks.
Oh, and Claude will be able to look up stuff in Notion too, where in this demo, all documentation and call transcripts for this project live. There’s a Confluence MCP server too, for the brave.
dbt Configuration with Motherduck
I chose Motherduck because it's DuckDB in the cloud, which means it's fast, cheap, and doesn't require a PhD in cloud architecture to set up. Your profiles.yml looks like this:
xmljaffle_shop:
target: dev
outputs:
dev:
type: duckdb
schema: dev_dani
path: 'md:jaffle_shop'
threads: 16
extensions:
- httpfs
To create your virtual Python environment, just do the usual dance (or use uv if you’re fancy)
xml# Create a fresh environment
mkdir ai-data-stack && cd ai-data-stack
python -m venv venv
source venv/bin/activate
# Install dbt with DuckDB adapter
pip install dbt-duckdb
And voilà, you should be able to run dbt debug, BUT why would you do it yourself when you can tell the AI to do it?
Here’s me telling Roo to “test the dbt connection”
Yes, it cost me $0.05 to do this.
SQLFluff: Because Formatting Matters
Look, I know code formatting is the tabs-vs-spaces of the data world, but consistent SQL is readable SQL. Instead of setting up another MCP server, I just document the SQLFluff commands in CLAUDE.md:
xmlpip install sqlfluff
# In CLAUDE.md:
- `sqlfluff lint`: Check SQL formatting
- `sqlfluff fix`: Auto-fix formatting issues
Claude reads this and knows to run SQLFluff after generating models. It's simple, it works, and it means I never have to look at uppercase SELECT statements again.
The Workflow in Action
Let me show you what this actually looks like. After a client call, I have notes like:
Need to add customer segmentation based on order frequency. Marketing wants to identify VIP customers (>10 orders or >$1000 total spend). Also need to track customer acquisition channels from the new utm_source field.
I literally paste this into Roo and watch the magic happen:
First, Roo queries the Notion MCP server to check if we have existing definitions for VIP customers. It finds a discussion between the CMO and CFO about the thresholds.
Then it uses the Motherduck MCP to inspect the current schema:
xml-- Roo runs this automatically
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'raw_orders';
It discovers the utm_source field doesn't exist yet in staging, so it updates the staging model:
Then it creates a new mart model for customer segmentation:
xml-- models/marts/customer_segments.sql
with customer_orders as (
select
customer_id,
count(distinct order_id) as total_orders,
sum(total_amount_cents) / 100.0 as lifetime_value,
min(created_at) as first_order_date,
max(acquisition_channel) as acquisition_channel
from {{ ref('stg_orders') }}
group by customer_id
),
segmented as (
select
*,
case
when total_orders > 10 or lifetime_value > 1000 then 'VIP'
when total_orders > 5 or lifetime_value > 500 then 'Regular'
else 'New'
end as customer_segment
from customer_orders
)
select * from segmented
It runs dbt to build the models, executes tests, and even queries the results to verify the segmentation makes sense. All of this happens in about 30 seconds.
The Real Magic: Context Awareness
What makes this setup special isn't just the automation, it's the context awareness. When I ask Roo to "add that new supply chain KPI we discussed," it:
- Searches Notion for recent supply chain discussions
- Finds the email thread about inventory turnover calculations
- Checks existing models for related metrics
- Implements the KPI using our existing conventions
No more "wait, how did we decide to calculate that again?" No more digging through Slack threads from three months ago.
Sharp Edges and Workarounds
Not everything is perfect. Here are some of the issues I hit:
Claude is not so good at generating SQL compatible with Motherduck. I assume this is because Motherduck is ~fairly new. The workaround would be to put a few examples into CLAUDE.md to help the model out, but I deemed this not critical enough to waste context.
Motherducks “did you mean xy” error messages are actually very helpful for the model when it runs into an error as it gives it a clear next step on what to try next.
Another thing that came up was that the Notion MCP is not very reliable. I just ran into all kinds of issues with it and no clear solution to them. Chalking this up to being new as well.
Overall, the biggest hurdle is collecting and connecting all the disparate components of the development environment. This will get better with time for sure, but for now, expect some plumbing.
Why This Actually Matters
Last week, I implemented fifteen new models based on a requirements doc. The old way would have taken me two days of context switching between documentation, SQL writing, and testing. With this setup? Three hours, including a coffee break.
But the real win isn't just speed, it's consistency. Every model follows our conventions. Every metric matches the business definition. Every piece of documentation is up to date.
What's Next
I'm working on adding a few more MCP servers:
- GitHub integration for automatic PR creation
- Slack server for pulling in those "quick questions" that turn into requirements
- Tableau/Looker integration to understand how models are actually used
The dream is a system where I can say "implement the Q4 board deck requirements" and come back to tested, documented, production-ready code. We're not quite there yet, but we're closer than I ever thought we'd be.
Try It Yourself
The whole setup takes about an hour if you follow the steps above. Start with just VS Code and Roo, then add MCP servers as you need them. The productivity gains are immediate, but the real value comes from having an AI assistant that actually understands your entire data stack.
Just remember to put your conventions in CLAUDE.md. Trust me, you'll thank yourself later when Claude automatically lowercases all your SQL keywords without being asked.
And please, for the love of all that is holy, use semantic code search. Your future self will thank you when you need to find "that weird edge case handler for fiscal year calculations" at 5 PM on a Friday.
Now if you'll excuse me, I have seventeen more action items to feed into my AI assistant. At least this time, I might actually make it home for dinner.
FAQs
1. What is an AI-assisted analytics engineering stack?
2. Why use Roo and Claude Code together?
3. What role does CLAUDE.md play in this setup?

About the author
Dani is a data professional with a rich background in data engineering and real-time data platforms. At Estuary, Daniel focuses on promoting cutting-edge streaming solutions, helping to bridge the gap between technical innovation and developer adoption. With deep expertise in cloud-native and streaming technologies, Dani has successfully supported startups and enterprises in building robust data solutions.
