Managing a modern data lake with an AI-powered coding assistant


Introduction

I maintain a sophisticated data lake built on the medallion architecture pattern. Managing dozens of SQL transformations, ensuring data quality, and iterating quickly on analytics pipelines can be challenging—especially as the codebase grows.

Enter Antigravity, Google’s agentic AI coding assistant. By integrating Antigravity into our development workflow, I’ve transformed how I build, debug, and maintain our data infrastructure. In this post, I’ll share how I’ve configured Antigravity to understand our data lake, the rules and workflows I’ve created, and real-world use cases that have dramatically improved our productivity.


Our Data Lake Architecture: The Medallion Pattern

Our data lake follows the medallion architecture, a design pattern that organizes data into progressively refined layers:

┌─────────────────────────────────────────────────────────────────────-┐
│                        MEDALLION ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────────────-┤
│                                                                      │
│   ┌──────────┐     ┌───────────────┐     ┌─────────┐                 │
│   │  BRONZE  │ ──► │ BRONZE_SILVER │ ──► │ SILVER  │                 │
│   └──────────┘     └───────────────┘     └─────────┘                 │
│       Raw Data        Cleaned &           Single                     │
│       + Aliases       Structured          Source of                  │
│                                           Truth                      │
│                                               │                      │
│                                               ▼                      │
│                                        ┌─────────────┐               │
│                                        │ SILVER_GOLD │               │
│                                        └─────────────┘               │
│                                           Complex                    │
│                                           Aggregations               │
│                                               │                      │
│                                               ▼                      │
│                                         ┌────────┐                   │
│                                         │  GOLD  │                   │
│                                         └────────┘                   │
│                                          Analytics-                  │
│                                          Ready Data                  │
│                                                                      │
└────────────────────────────────────────────────────────────────────-─┘

Layer Breakdown

Layer Description Example Tables
Bronze Raw data ingestion with aliases and basic filtering brz_crm_opportunity_scd
Bronze-Silver Initial cleaning, casting, and structuring brz_slv_opportunity_daily
Silver Conformed, business-level views—the “single source of truth” slv_opportunity_daily_snapshot
Silver-Gold Intermediate aggregations for specific business use cases slv_gld_retention_daily_snapshot
Gold Highly aggregated, analytics-ready datasets for BI and reporting gld_retention_daily, gld_user_analytics_daily

Each SQL file in our repository corresponds directly to a table name in the data lake, making navigation intuitive. The SQL transformations are compiled by a Python script (compile_layers.py) that generates _compiled.sql files—which Antigravity knows to ignore.


Teaching Antigravity About Our Repository

The key to making Antigravity effective is context. By providing the agent with rules and workflows, it understands our repository structure, conventions, and how to interact with our data infrastructure.

Rules: Always-On Context

I store rules in .agent/rules/ as markdown files. These provide persistent context that Antigravity references in every interaction:

databricks.md — Connecting to Our Data Warehouse

---
trigger: always_on
---

In order to connect to the databricks sql cluster, use the `dbsqlcli -e "<QUERY>"` 
terminal command. All tables use the schema `analytics`.

This rule ensures Antigravity knows how to execute SQL queries against our Databricks cluster. With this context, the agent can:

  • Run ad-hoc queries to validate data
  • Check record counts after transformations
  • Verify data quality constraints

medallion-tables.md — Understanding Our Architecture

---
trigger: always_on
---

This repository consists of SQL and Python files used to populate tables and views 
of a data lake following the "medallion architecture". The layers consist of:

- bronze - raw data with aliases and some filtering
- bronze_silver - Initial cleaning, casting, and structuring
- silver - Conformed, business-level view, the "single source of truth"
- silver_gold - Intermediate tables with complex aggregations
- gold - Analytics-ready datasets for BI and reporting

This rule teaches Antigravity the semantic meaning of each directory and layer, enabling it to make intelligent decisions about where to place new tables or find existing transformations.

ignore.md — Avoiding Noise

---
trigger: always_on
---

Ignore the following directories:
@/keep/
@/apps/
@/prompts/
@/tasks/

Some directories contain experimental code, archived work, or unrelated applications. This rule prevents Antigravity from surfacing irrelevant files.


Workflows: Reusable Multi-Step Processes

Workflows are stored in .agent/workflows/ and define repeatable procedures triggered by slash commands.

/check-pk — Validate Primary Key Uniqueness

---
description: Checks that there are no duplicates of the primary key.
---

Run a sql against the databricks sql cluster to check that there are 
no duplicates in the table and primary key(s) mentioned in the prompt. 
Return the number of duplicates.

Usage:

/check-pk slv_orders_daily on order_date, order_id

Antigravity will generate and execute a SQL query like:

SELECT order_date, order_id, COUNT(*) as cnt
FROM analytics.slv_orders_daily
GROUP BY order_date, order_id
HAVING COUNT(*) > 1

This is invaluable for validating data integrity after modifying transformations.

/lineage — Discover Data Lineage

---
description: Find the uses and data lineage of a given table.
---

1. Execute the utility to find definition and usages of the table.
// turbo
2. Run the command: `python3 utils/lineage.py {table_name}`
3. Analyze the output to identify which files define and use the table.
4. If appropriate, summarize the lineage for the user.

Usage:

/lineage slv_orders_daily

The agent executes our custom lineage utility, which scans all SQL files to find:

  • Where the table is defined (the source SQL file)
  • Where the table is used (downstream dependencies)

This provides instant impact analysis before making changes.


Connecting to Databricks via CLI

A critical piece of our setup is the Databricks SQL CLI (dbsqlcli). This allows Antigravity to execute queries directly against our data warehouse.

Setup

  1. Install the CLI:

    pip install databricks-sql-cli
    
  2. Configure environment variables:

    export DBSQLCLI_HOST_NAME="your-workspace.cloud.databricks.com"
    export DBSQLCLI_HTTP_PATH="/sql/1.0/warehouses/your_warehouse_id"
    export DBSQLCLI_ACCESS_TOKEN="your_access_token_here"
    
  3. Run queries:

    dbsqlcli -e "SELECT count(*) FROM analytics.gld_retention_daily"
    

With this integration, Antigravity can autonomously:

  • Run validation queries
  • Explore table schemas with DESCRIBE
  • Sample data to understand transformations
  • Debug data issues by inspecting actual records

Real-World Use Cases

1. Making Changes to the Pipeline

Scenario: Add a new column renewal_probability from your CRM to the retention snapshot.

Prompt:

Add the renewal_probability field from crm.opportunity to our retention pipeline, flowing through bronze → silver → gold.

Antigravity’s Approach:

  1. Identifies the bronze table (brz_crm_opportunity_scd.sql)
  2. Adds the column with the correct alias (renewal_probability_c AS renewal_probability)
  3. Propagates to silver (stg_slv_opportunity_daily_snapshot.sql)
  4. Updates gold layer aggregations (slv_gld_retention_daily_snapshot.sql)
  5. Ignores _compiled.sql files (per our rules)

Time saved: What would take 30+ minutes of manual file navigation and editing is completed in seconds.


2. Adding New Data Assets

Scenario: Create a new silver table for tracking account health scores.

Prompt:

Create a new silver table called slv_account_health that combines data from our engagement tables and CRM accounts.

Antigravity’s Approach:

  1. Creates silver/slv_account_health.sql
  2. Uses the correct schema prefix (analytics)
  3. References existing tables appropriately
  4. Follows our naming conventions (snake_case, layer prefix)

3. Discovering Data Lineage

Scenario: Before modifying slv_orders_daily, understand its dependencies.

Prompt:

/lineage slv_orders_daily

Output:

Searching for table: slv_orders_daily

DEFINED IN:
  notebooks/stg_slv_orders_daily.sql

USED IN (3 files):
  silver_gold/slv_gld_revenue_daily_snapshot.sql
  gold/gld_revenue_daily.sql
  notebooks/dev_slv_org_daily.sql

Now we know exactly which downstream tables will be affected by any changes.


4. Debugging Data Issues

Scenario: A stakeholder reports that retention numbers look incorrect for Q4.

Prompt:

Check if there are any duplicate records in gld_retention_daily for Q4 2024 on account_id and snapshot_date.

Antigravity’s Approach:

  1. Constructs a validation query:

    SELECT account_id, snapshot_date, COUNT(*) as cnt
    FROM analytics.gld_retention_daily
    WHERE snapshot_date >= '2024-10-01' AND snapshot_date < '2025-01-01'
    GROUP BY account_id, snapshot_date
    HAVING COUNT(*) > 1
    
  2. Executes via dbsqlcli
  3. Reports findings with analysis

5. Data Analysis and Insights

Scenario: Quickly get top accounts by ARR for executive reporting.

Prompt:

What are the top 10 accounts by ARR for this quarter?

Antigravity’s Approach:

  1. Understands our table structure (knows ARR is in retention tables)
  2. Writes and executes the query:

    SELECT account_name, arr_usd
    FROM analytics.slv_gld_retention_daily_snapshot
    WHERE snapshot_date = CURRENT_DATE()
    ORDER BY arr_usd DESC
    LIMIT 10
    
  3. Returns formatted results

6. Ad-Hoc Schema Exploration

Scenario: You need to understand what columns are available in a table.

Prompt:

Describe the columns in brz_crm_opportunity_scd

Antigravity’s Approach:

  1. Runs DESCRIBE TABLE analytics.brz_crm_opportunity_scd
  2. Returns a formatted table of column names, types, and descriptions

Benefits We’ve Experienced

Before Antigravity After Antigravity
Manual file navigation across 100+ SQL files Natural language queries to find relevant code
Copy-paste errors when propagating column changes Automatic multi-file updates with consistency
Running queries in separate Databricks notebooks Inline query execution with immediate results
Tribal knowledge about table dependencies Instant lineage discovery
Time-consuming data validation One-command primary key checks

Conclusion

By teaching Antigravity about our medallion architecture through rules and workflows, and connecting it to Databricks via the dbsqlcli, I’ve created a powerful development environment that:

  • Understands context — Knows our table layers, naming conventions, and file structure
  • Executes queries — Can validate data, explore schemas, and debug issues in real-time
  • Accelerates development — Reduces multi-file changes from minutes to seconds
  • Preserves knowledge — Workflows encode best practices that any team member can use

The combination of AI-assisted coding with direct database connectivity has fundamentally changed how we build and maintain our data lake. What once required deep institutional knowledge and careful manual work is now accessible through natural language conversations.


If you’re managing a data lake and haven’t explored agentic AI coding tools, I highly recommend giving Antigravity a try. The time investment in setting up rules and workflows pays dividends immediately.


Resources: