Dbt Airbnb Analytics, South Aegean

Oct 1, 2024

Dbt Airbnb Analytics, South Aegean thumbnail

Project Overview

The goal of this project is to design a modern data pipeline using:

  1. Loading raw Airbnb datasets into Snowflake from AWS S3 bucket using Storage Integration
  2. Cleaning, transforming and testing the data with dbt (data-build-tool)
  3. Modeling key business metrics for occupancy, revenue or income and reviews
  4. Visualizing insights in Metabase through interactive dashboards

Data Source

Public data from Inside Airbnb – South Aegean, stored in an S3 bucket. Datasets used:

  • listings.csv.gz – Metadata for each listing
  • calendar.csv.gz – Daily availability and pricing
  • reviews.csv.gz – Guest reviews and review dates

To identify the data structure, missing values, outliers, and potential data quality issues such as inconsistent formats or invalid entries, I used the DataPrep data profiling tool for an initial data assessment and quality check.

Tech Stack

DBT model diagram

The data pipeline uses a modern and modular stack. The process starts with [InsideAirbnb](https://insideairbnb.com/get-the-data/) , which provides open Airbnb data.

  • Snowflake: Cloud data warehouse for storage
  • dbt: Data modeling (dimensional), transformation, testing, and documentation
  • SQL: Core transformation language
  • Metabase: Dashboard and data visualization

Key Metrics Modelled

🏠 Market Overview

  • Total Listings by location, room type, rental type
  • Average nightly price by location (neighbourhood, island)
    (dim_listing × dim_location)

💰 Occupancy & Income

  • Average income by location, by season per host, by room or property type Income Trend
  • Average income per host
    • by location
    • by season
    • by property type
      Insight: Identify which areas or property types are growing or declining in profitability
  • Top income-generating neighbourhoods and islands
  • Occupancy rate over time (daily / weekly / monthly)
  • Seasonal occupancy per location or property type (season)
  • Average occupancy rates over time by location and time

Reviews

  • Review volume over time
  • Review sentiment

Pipeline

1) Ingestion (Snowflake + S3)

  • Storage Integration: S3 role trust + stage (S3_AIRBNB_STAGE)
  • Load Strategy:
    • Method 1: Explicit schema (strict contracts)
    • Method 2: INFER_SCHEMA (rapid prototyping)
  • Landing Zone: RAW schema tables for listings, calendar, reviews

2) Transformation (dbt)

Layer overview

Project follows a dbt layered approach:

  • Staging (stg_*): Direct, minimally transformed 1:1 reflections of raw sources, column standardization & light cleaning.
      1. stg_calendar
    • Grain: listing_id × calendar_date (one row per listing per date appearing in source).
    • Purpose: Pass through availability & nightly constraints.
    • Normalization: Denormalized relative to source; acceptable at staging.
  1. stg_listings
    • Grain: listing_id (one row per listing snapshot in the source; filtered where price not null).
    • Contains host, neighbourhood, pricing, review score attributes. Quite wide and intentionally not normalized here.
    • Normalization: Low (host + location + listing details together). Appropriate for staging.
  2. stg_reviews
    • Grain: review_id (unique review per listing).
    • Normalization: Minimal; raw review text carried forward.
  • Dimensions (dim_*): Conformed, surrogate-keyed entities.
      1. dim_date
    • Grain: date_day (unique calendar date).
    • Key: date_key (YYYYMMDD numeric surrogate).
    • Normalization: Fully atomic (1 row per date). Pure time dimension.
  1. dim_host
    • Grain: host_id (distinct hosts from listings).
    • Key: host_key surrogate; natural key host_id retained.
    • Normalization: Moderate with derived attributes (segment, tenure) live here instead of separate tables (fine for analytics star schema).
  2. dim_location
    • Grain: (neighbourhood_raw, island_name) pair; surrogate neighbourhood_key.
    • Attributes: island_group for hierarchical rollups.
    • Normalization: Slight snowflaking could split island vs neighbourhood, but current design keeps hierarchy in one table—pragmatic.
  3. dim_listings
    • Grain: listing_id.
    • Key: listing_key surrogate.
    • Foreign keys: host_keyneighbourhood_key.
    • Contains classification buckets (size, bedroom_category, price_category, review_category, availability_category) and derived metrics (estimated_occupancy_rate_pct).
    • Normalization: Intentional denormalization for performance/readability; star schema center dimension.
  4. dim_calendar
    • Grain: listing_key × date_key.
    • Key: calendar_key surrogate (bridge).
    • Acts like a periodic snapshot dimension / bridge between date & listing with availability attributes.
    • Normalization: Proper (each atomic combination once). Serves as base for occupancy fact.
  • Facts (fct_* / incremental). Events taken:
      1. fct_occupancy
    • Grain: listing_key × date_key (daily).
    • Measures: is_available (degenerate boolean), nightly_price (copied from listing), plus dimension foreign keys (neighbourhood_key, host_key).
    • Type: Transactional-ish / periodic snapshot fact.
    • Normalization: Standard fact table referencing conformed dims.
  1. fct_reviews
    • Grain: review_id (each review event).
    • Measures/Attributes: sentiment_compound, sentiment_label.
    • Foreign linkage: listing_key (fact-to-dim join).
    • Normalization: Proper atomic event fact.
    • Enhancements: Calculated sentiment score and category with UDF using Vader from python. 160,290 out of 976,657 is approximately 16.41% rows have sentiment_compound 0, indicating reviews written other than english.
    • Used LanguageDetector to identify the language but will omit the translating due to costs (API or Snowflake Cortex AI)
  • Marts (mart_*): Aggregated, analytics-ready rollups.

3) Visualization (Metabase)

Running Metabase with docker. There are other options to run the Metabase server, as a Java JAR file, Metabase Cloud (free trial exists).

docker run -d -p 3000:3000 --name metabase metabase/metabase

Open http://localhost:3000

Dashboards

Market Overview

  • Total listings by island, neighbourhood, and room type
  • Average price by location
  • Distribution of property types ![[Pasted image 20251114000745.png]]

Occupancy & Income

  • Occupancy trends (daily/weekly/monthly)
  • Income per host by location, season, and property type
  • Top-performing neighbourhoods ![[Pasted image 20251114000803.png]] Reviews & Sentiment
  • Review volume over time
  • Positive/neutral/negative sentiment
  • Location-level sentiment differences ![[Pasted image 20251114000721.png]]

4) Orchestration

requirements

uv add dagster-dbt
uv add dagster-webserver

initiate dagster

dagster-dbt project scaffold --project-name dbt_dagster_project_airbnb --dbt-project-dir=airbnb

run dagster server

uv dagster dev

Challenges

  • Availability ≠ booking certainty: treat “unavailable” as proxy; clearly flag in metrics and dashboards. - Is_available doesn't necessarily mean the listings is booked, it could be disabled by the host, by private booking etc. From publicly available (scraped data) we cannot know this information surely. That's why the fact table is called fact_occupancy but the general assumption to calculate price related metrics, non available dates will be considered as booked.
  • This data model doesn’t have time-varying prices.

Key Takeaways

  • Gained practical experience using Snowflake, modelling with dbt using a layered architecture.
  • Learned a new tool to build interactive dashboards in Metabase and structure them around key business questions (occupancy, revenue, reviews).

Further Enhancements

  • Automate ingestion when new data is loaded into S3 bucket
  • SCD type 2 for listings
ascend to the surface