
Air pollution is a growing public health and environmental concern for many urban regions in Nepal. Rapid urbanization, increased vehicular traffic, industrial activity, and seasonal factors such as temperature inversions, pressure, and wind speeds contribute to fluctuating air quality levels across cities. This article analyzes Air Quality Index (AQI) patterns across major cities in Nepal and examines how specific pollutants, along with weather conditions like temperature and wind speed, influence overall air quality.
Air quality data is inherently time-sensitive, and pollution levels can change rapidly due to weather, traffic, industrial activity, or seasonal effects. Meaningful analysis often depends on understanding what's happening right now, as well as how current conditions compare to recent history. This makes air quality data a strong fit for pipelines, where data freshness is configurable rather than fixed to either real-time or batch extremes.
In this article, we'll walk through a practical, end-to-end workflow for building an AI-ready air quality analytics pipeline using Estuary and MotherDuck. Estuary is used to ingest hourly air pollution data across major cities in Nepal from the OpenWeather API via HTTP Webhook capture and stream it into data warehouse/analytics storage based in MotherDuck, powered by DuckDB. Prompt-based AI workflows within MotherDuck automatically transform aggregated air-quality metrics into human-readable explanations on weekly air qualities across cities, delivering conversational queries suitable for reporting to RAG applications.
Key Takeaways
Right-time pipelines ensure air quality data remains fresh and actionable.
MotherDuck + DuckDB provides fast, low-overhead analytics without heavy infrastructure.
Prompt-based SQL enables explainable AI summaries grounded directly in data.
The architecture scales naturally to additional cities, pollutants, and AI agents.
On Estuary
Estuary is a right-time data platform designed to unify cost, latency, and deployment flexibility within one dependable managed system. Instead of stitching together separate tools for change data capture, batch ingestion, streaming pipelines, and application synchronization, Estuary provides a single platform that supports analytics, operational workloads, and AI use cases across a wide range of latencies and environments.
When working with air quality data at scale, the objective is not just to collect measurements, but to:
- Analyze trends as they evolve
- Explore anomalies quickly
- Operationalize insights for dashboards, alerts, and AI-driven explanations
To support these workflows, data must be clean, structured, and continuously usable. Raw API responses alone are not sufficient; they need to be normalized, stored reliably, and made available to downstream systems with predictable freshness. Estuary ensures seamless integration of air quality API data into MotherDuck through continuous ingestion and real time synchronization of the latest measurements.
Users can set up seamless data pipelines from multiple sources to MotherDuck, ensuring efficient and timely data availability for analysis. This integration leverages Estuary's robust data movement capabilities to enhance the performance and utility of MotherDuck’s DuckDB-based architecture.
On MotherDuck
MotherDuck is a managed, serverless cloud service built on top of DuckDB that brings the same fast, lightweight query engine into a shared cloud environment. It preserves DuckDB’s simplicity and performance while adding the features needed for team-based production analytics:
- Serverless architecture: No servers to provision or manage. MotherDuck automatically scales to meet analytical workloads.
- Cost-effective: Well-suited for analytical workloads without the cost profile of large data warehouse platforms.
- Collaboration: Teams can share queries, datasets, and results in real time, working from a single source of truth.
- Secure secret storage: Credentials and external connections are managed safely in the cloud.
- Pipeline integration: Platforms such as Estuary can write directly into MotherDuck, ensuring analytics tables remain continuously up to date.
Why We Chose MotherDuck
While building air quality analytics and AI-driven explanations, we can consider several popular data platforms like Snowflake, Databricks, BigQuery, and Postgres. Each of these systems is powerful, but many proved to be either too expensive, too complex, or too operationally heavy for a workload centered on fast analytics, iterative exploration, and downstream AI use cases.
MotherDuck allows us to process, aggregate, and unify the incoming air quality data into a fast, reliable analytical dataset. Because MotherDuck is built on DuckDB, it's fast enough for both exploration and reporting. With MotherDuck handling analytics, we can layer on LLM-based agents that explain air quality trends using up-to-date data.
Overview of Our Pipeline
Datasets Used:
- Air Quality Data (OpenWeather Air Pollution API): The primary dataset used in this workflow comes from the OpenWeather Air Pollution API, which provides hourly city-level measurements for major air pollutants. This API aggregates data from multiple monitoring stations and models, making it suitable for regional and cross-city analysis. This project focuses on cities across Nepal.
- Current Weather Data (OpenWeather Weather API): To enrich the air quality measurements with environmental context, we also use weather data from the OpenWeather Weather API. This secondary dataset allows us to correlate air quality patterns with meteorological factors, such as low wind speeds or lack of rainfall, that can influence pollution levels.
Prerequisites:
- Estuary: Sign up for a free trial of Estuary, which will be used to ingest API data via HTTP/Webhooks and stream it into MotherDuck.
- OpenWeather API Key: Create an account with OpenWeather to access both the Air Pollution API and Weather API.
- Any IDE capable of running Python scripts will work for calling OpenWeather APIs and sending data to Estuary’s HTTP Webhook endpoint.
- MotherDuck Account: Sign up for a 21-day free trial of MotherDuck, which will serve as the analytical warehouse and AI-ready storage layer. Retrieve your MotherDuck access token / API key from the account settings after sign-up.
- AWS S3 Account (Staging): An Amazon S3 bucket is used for staging or intermediate storage when loading data into MotherDuck.
Let's begin!
Step-by-Step Guide: Streaming API Data into MotherDuck Using Estuary
Step 1: Configure Estuary’s HTTP Webhooks Capture
1. Click on New Capture and search for HTTP Webhook.
2. Name your Capture: For example, air-quality-data. Then, create your own Authentication token.
- Add the webhook id name:
air-quality-data, and test and Save to Collection. - Once your changes are saved, click on your capture and you can now see the following Public Endpoint:
By clicking the link, you can test it:
You can directly use the following curl command to add API values into your POST link:
Save the above Webhook URL and Authentication Token safely, to be used later. Similarly, create another Estuary Webhooks source for the weather.
Step 2: Build a Python Ingestion Script to Send Air Quality and Weather API Date to Estuary via HTTP/Webhooks
Air Quality API and Weather API Data: After signing up for a free account for OpenWeather API, retrieve the API Key first here and save it in .env file or somewhere secure.
We'll be looking at 5 cities in Nepal, so I'll be using the following APIs:
One generates historical to current Air Quality data - we'll be generating weekly data for 5 cities, on an hourly basis.
We'll be using the current weather data - which doesn't store historical weather data - so we'll be polling it for a few hours with a Python script that's run on an hourly basis to generate historical data.
I. Fetch Air Quality Data:
II. Fetch Weather Quality Data:
fetch_nepal_weather.py
Run Cron job to run both scripts hourly:
.env:
Output:
Hourly captured data for:
Weather:
Air Quality:
3: Setting Up Estuary with MotherDuck
I. Configure AWS S3 Staging (Secure Intermediate Storage)
Before data can be written into MotherDuck, Estuary requires a temporary staging location. We'll use Amazon S3 for this purpose.
- In your AWS account, create an S3 Bucket for Staging (for example: motherduck-staging), and optionally create a folder inside the bucket. Copy the S3 URL, as it will be required during MotherDuck materialization setup.
- Navigate to IAM → Users.
- Create or use an IAM user that has read and write access to the S3 bucket.
- Attach a policy with at least the following actions:
- s3:PutObject
- s3:GetObject
- s3:ListBucket
- Select your IAM user, under Security credentials, and click Create access key.
- Save both the access key and secret key securely. These credentials allow Estuary to write staged data into S3.
II. Create MotherDuck Account and Retrieve MotherDuck API Keys
Sign up for a MotherDuck account, and go to Integrations → Access Tokens → Click on Create Token and save it securely.
Step 4: Create a MotherDuck Materialization in Estuary
In Estuary, go to Destinations → click on New Materialization → Type MotherDuck.
Fill in the following fields:
- MotherDuck Service Token: Paste the service token retrieved earlier from MotherDuck.
- Database: Create a database in MotherDuck and add it here (For example: air_quality_db).
- Database Schema - Add a database schema name.
- For Staging Bucket Configuration, choose S3 and add the following:
- S3 Staging Bucket name: Add your S3 bucket name from earlier (For example: motherduck-staging).
- Access Key ID and Secret Access Key: Paste it from earlier.
- AWS S3 Bucket Region: Add the region of your S3 bucket.
Link your Capture - Add the two Webhooks captures:
Output:
Step 5: Performing Analytical Queries in MotherDuck
Querying Air Quality Data:
The table shown above displays hourly air quality records for Kathmandu. Each row represents:
- One city
- One timestamp
- Pollutant measurements (PM2.5, PM10, NO₂, O₃, etc.)
- Calculated AQI
Querying Weather Data:
These are weather measurements such as temperature, humidity, wind speed, and cloud cover captured hourly. Because the OpenWeather weather endpoint does not provide historical data, these values are accumulated over time via hourly polling.
Step 6: Enriching Data - Dominant Pollutant & Health Classification
WHO and EPA Health Classification
The World Health Organization’s Air quality guidelines (AQG) are a global target for national, regional and city governments to work towards improving their citizen’s health by reducing air pollution. WHO offers global baseline recommendations, making them ideal for cross-country or multi-city analysis (e.g., comparing Nepal, India, and the U.S.).
The Environmental Protection Agency (EPA) is a federal US agency responsible for protecting human health and the environment through enforceable regulations. For air quality, the EPA defines the Air Quality Index (AQI) to translate pollutant concentrations into categories such as Good, Moderate, Unhealthy, and Hazardous. EPA classifications are widely used in operational dashboards and public alerts, making them practical for real-time decision-making with specific consideration for sensitive groups (children, elderly, people with asthma).
Using WHO and EPA health classifications to enrich air-quality data adds public-health context to raw pollutant measurements, transforming numbers into actionable insight. On their own, AQI values or pollutant concentrations are hard to interpret.
Health categories give LLMs and rule-based systems a semantic layer: “PM2.5 exceeded WHO limits but remains within EPA ‘Moderate’ range.” This could be useful for various use cases such as RAG pipelines, automated summaries, and alert agents.
Here's what the following query shows:
- A
CASEexpression that identifies the pollutant with the highest average concentration per day. - Health categories derived using official thresholds:
- EPA (U.S.) 24-hour standards
- WHO 2021 daily guidelines
- Each day is labeled as:
- Good / Moderate / Unhealthy (EPA)
- Compliant / Above WHO Guidelines (WHO)
This converts numeric sensor data into human-interpretable health signals.
Step 7: Generating AI-Ready Summaries Using prompt()
One great feature within MotherDuck is the prompt function, which allows you to interact with Large Language Models (LLMs) directly from SQL. You can generate both free-form text and structured data outputs.
Example of an AI-driven insight: “Air quality worsened due to PM2.5 exceeding the World Health Organization’s 24-hour guideline, increasing short-term respiratory risk, particularly for sensitive populations.”
MotherDuck’s built-in LLM function converts structured SQL results into natural-language summaries. These summaries are grounded strictly in database values, can be indexed for RAG applications, and are suitable for dashboards, alerts, and reports.
Read more on PROMPT() here.
Another powerful feature of MotherDuck is its automatic query correction, allowing users to write and refine SQL without being concerned about syntax errors.
Fixed Query:
Conclusion
In this tutorial, we built a complete, production-ready pipeline for air quality analytics and AI-driven interpretation using Estuary and MotherDuck. By combining right-time data ingestion with fast analytical querying and embedded LLM capabilities, we moved beyond static dashboards toward explainable, continuously updated insights. This approach is well-suited for public health monitoring, environmental reporting, and AI-powered decision support systems where data freshness and interpretability matter as much as accuracy.

About the author
Ruhee has a background in Computer Science and Economics and has worked as a Data Engineer for SaaS providing tech startups, where she has automated ETL processes using cutting-edge technologies and migrated data infrastructures to the cloud with AWS/Azure services. She is currently pursuing a Master’s in Business Analytics with a focus on Operations and AI at Worcester Polytechnic Institute.




































