Chapter 4: Data Studio——Integrate and transform data with no code

Section 4-1

What is Data Studio? Differences from ETL tools and architecture

Data Studio will be added to HubSpot Data Hub in 2025No-code data integration and transformation platformis. Combine and transform multiple external data sources (Snowflake, BigQuery, AWS S3, Google Sheets, CSV files) and HubSpot CRM data with a spreadsheet-like UI,Publish as a dataset for use within HubSpothave a function.

📊 Data Studio — Revenue Attribution Dataset Pro

Revenue Attribution Dataset — Preview (150 rows)

🟡

HubSpot Deals

CRM Object

LEFT JOIN

❄️

SF Revenue Data

Snowflake Table

⚙️

Calculated column × 3

AI generated formula

Deal Name

HS Amount

SF Revenue

Diff %

Status

ABC Co., Ltd. New introduction

¥2,400,000

0.0%

✓ Match

XYZ Corp Enterprise

¥8,500,000

¥8,200,000

-3.5%

⚠Confirmation required

Tanaka Seisakusho additional license

¥1,200,000

0.0%

✓ Match

Calculated column [Diff %]

= (SF_REVENUE - HS_AMOUNT) / HS_AMOUNT * 100

Differences between ETL tools, BI tools, and Looker

Data Studio

HubSpot Data Studio

Complete within HubSpot (no need to switch tools)

No-code UI, no engineers needed

Native integration with CRM data

Connect directly to HubSpot reports

Supported data sources are limited

Limits to complex conversion logic

ETL tools

Fivetran / dbt / Airbyte

Supports 200+ data sources

Compatible with complex conversions and large-scale data

Compatible with version control and CI/CD

Engineering knowledge required

Cost and management of separate tools will be incurred

Additional settings for joining with HubSpot data

BI tools

Looker / Tableau / Power BI

Advanced visualization

Real-time analysis of large-scale data

Drill-down interactive analysis

Difficult to interact with HubSpot two-way

High cost ($50~/user/month)

Unable to write converted values back to CRM

💡 Scenarios where Data Studio shines the most

"I want to combine HubSpot CRM data with one or two other external data sources and use the results for HubSpot reporting and automation." Data Studio fits perfectly into this need. If large-scale complex ETL is required, combination with dbt/Fivetran is appropriate.Data Studio can cover 80% of the data integration and confirmation work that RevOps uses on a daily basis.。

Section 4-2

Connecting data sources—HubSpot CRM, external DB, spreadsheets

There are two main types of data sources that can be connected to Data Studio: "HubSpot internal sources" and "external sources." Authentication settings are required to connect to external sources, and the sources available vary depending on the plan.

🟡

HubSpot CRM objects

internal source

Connect to all properties and related data for contacts, companies, deals, tickets, and custom objects. View the latest CRM values in real time.

Plan:Pro~ (all property access)

❄️

Snowflake

external data warehouse

Connect directly to Snowflake database schema tables. All tables can be referenced from Data Studio by simply setting the account identifier, warehouse name, role, and authentication information.

Plan:Enterprise (full integration) / Pro (read only)

🔷

BigQuery

external data warehouse

Connect to BigQuery dataset tables in Google Cloud. Authenticate using the service account JSON key. Often used in use cases that combine GA4 data or Google Ads data with HubSpot CRM.

Plan:Enterprise (full) / Pro (read)

🟠

AWS S3

object storage

You can read CSV/Parquet files in your S3 bucket directly from Data Studio. Authenticate with IAM role or access key. Used to combine with data regularly uploaded to S3.

Plan:Professional〜

📊

Google Sheets

spreadsheet

You can connect to any sheet by simply authenticating with your Google account. It is the easiest to use for merging with data managed outside of CRM, such as budget tables, territory maps, and product price lists.

Plan:Professional〜

📄

CSV file (upload)

file source

You can upload a local CSV and use it as a temporary data source. Convenient for integrating reference data (industry benchmarks, ICP classification tables, etc.) that does not require regular updates. Maximum file size: 100MB.

Plan:Professional〜

Best practices for connection settings

Connection destination	Recommended authentication method	Precautions
Snowflake	Dedicated service account + read-only role	Do not give write permission to the production database. Create a read-only role specifically for Data Studio
BigQuery	Service account JSON key + BigQuery Data Viewer role	Do not give project owner privileges. Set maximum scan volume to manage query costs
AWS S3	IAM user (least privilege: s3:GetObject only)	We recommend restricting access to a specific prefix (folder) rather than accessing the entire bucket.
Google Sheets	OAuth (Google account authentication)	If you are connecting with a retired employee's account, you will not be able to retrieve data when your authority expires. Connect with a shared service account

Section 4-3

Data transformation and blend design (joins, aggregations, filters)

The core of Data Studio isJOIN, transform, and aggregate multiple tables” The ability to build pipelines with no code. As you drag and drop each step on the UI, an SQL query is automatically generated behind the scenes.

🔄 Data Studio transformation pipeline (5 steps)

📥

All contact properties

All company properties

🔗

JOIN settings

Select the join key (e.g. email address/ID) and JOIN type (LEFT / INNER / FULL)

🔍

filter

Exclude unnecessary rows (e.g. only Lifecycle Stage = Customer, only amount > 0)

⚙️

Conversion/calculation column

Add new columns using formulas, conditional branching, and string operations. You can also ask AI to make suggestions.

📤

Dataset publication

Save the completed table as a dataset. Available from HubSpot Reports/WF

Types and usage of JOIN

JOIN type	Explains everything from design to implementation.	Typical usage with HubSpot
LEFT JOIN (recommended default)	Keep all rows of the left table. Rows with no match in the right table will be NULL	"All HubSpot contacts + those in Snowflake combine sales data." contacts won't disappear
INNER JOIN	Keep only matching rows in both tables	“I want to analyze only rows that exist in both HubSpot transactions AND Snowflake billing records.”
FULL OUTER JOIN	Keep all rows from both tables. Unmatched rows are NULL	“I want to detect both transactions that are in HubSpot but not in Snowflake (missed billing) and transactions that are in Snowflake but not in HubSpot (unregistered transactions).”

⚠️ “Data type mismatch” in JOIN keys is the most common cause of errors

When you try to JOIN HubSpot's contact ID (numeric type) and Snowflake's customer_id (string type), it fails.Unify data types using "calculated columns" before JOIN(Example: Convert HubSpot ID to string → CAST(contact_id AS VARCHAR)). The fastest way is to ask the AI to ``convert this column to a string'' from the UI's ``Add calculated column.''

Section 4-4

Formula suggestion by AI—Automatic generation of calculated columns and condition classifications

One of the major differentiating features of Data Studio is“Have AI suggest a formula” functionis. On the Add Calculation Column screen, if you give instructions in natural language such as ``I want to calculate ARR from this data'' or ``I want to classify tiers based on industry and company size,'' AI will generate the corresponding formula.

🤖 Collection of examples of AI formula suggestions

Instructions in natural language from the “Ask AI to suggest formulas” button → Automatically generate formulas

ARR calculation

Calculate annual recurring revenue (ARR) from monthly contract amount

= IF(BILLING_CYCLE = "Monthly", AMOUNT * 12, IF(BILLING_CYCLE = "Quarterly", AMOUNT * 4, AMOUNT))

The amount is annualized according to the billing cycle (monthly, quarterly, yearly). Safe design that returns AMOUNT as is if BILLING_CYCLE is unknown.

Customer tier classification

Classify customers into three tiers based on ARR and number of employees

= CASE WHEN ARR >= 5000000 AND NUM_EMPLOYEES >= 100 THEN "Enterprise" WHEN ARR >= 1000000 OR NUM_EMPLOYEES >= 50 THEN "Mid-Market" ELSE "SMB" END

A CASE statement that classifies ARR of 5 million yen or more and 100 or more employees as Enterprise, ARR of 1 million yen or more or 50 or more employees as Mid-Market, and all others as SMB.

Health score calculation

Calculate customer health score from login frequency, number of support tickets, and number of functions used

= LEAST(100, (COALESCE(LOGIN_COUNT_30D, 0) * 2) + (COALESCE(FEATURES_USED, 0) * 5) - (COALESCE(OPEN_TICKETS, 0) * 10))

Score calculated as number of logins x 2 points + number of functions used x 5 points - number of unresolved tickets x 10 points. COALESCE converts NULL to 0. LEAST limited to maximum value 100.

Date difference/churn detection

Calculate the number of days that have passed since the last login date and determine the churn risk

= CASE WHEN DATEDIFF('day', LAST_LOGIN_DATE, CURRENT_DATE) > 60 THEN "High Risk" WHEN DATEDIFF('day', LAST_LOGIN_DATE, CURRENT_DATE) > 30 THEN "Medium Risk" ELSE "Healthy" END

More than 60 days since the last login is classified as high churn risk, 30 to 60 days as medium risk, and less than 30 days as healthy. Calculate the number of days with DATEDIFF.

Email domain extraction

Extract company domain from email address and exclude free emails

= CASE WHEN SPLIT_PART(EMAIL, '@', 2) IN ( 'gmail.com','yahoo.co.jp','hotmail.com','outlook.com' ) THEN NULL ELSE SPLIT_PART(EMAIL, '@', 2) END

Extracts the part after @ and returns NULL for major free email domains. Use this when you want to use only the business email domain as a company identifier.

✅ Tips for using AI formula suggestions

From "Calculate ARR", there are three types of BILLING_CYCLE fields: monthly, quarterly, and yearly.Create an annualized ARR column by 12x, 4x, and as is, respectively.The more specific you are about field names, conditions, and expected output, the more accurate it will be.Low threshold (less than 60)

Section 4-5

Pipeline design by use case—reports, scoring, enrichment

The actual usage patterns of Data Studio can be broadly classified into three types. Let's take a closer look at each pipeline design.

Pattern A: Building a Revenue Attribution report

Property design and association designCan detect problems with forecast accuracy, unbilled items, and double registrations.。

💰 Revenue Attribution Pipeline

Matching HubSpot Opportunities x Snowflake Billing Data

Source selection: HubSpot Deals (Closed Won only)

There are four standard objects in HubSpot: Contacts, Companies, Deals, and Tickets. These can be used in most BtoB businesses, but

JOIN: Snowflake's revenue_recognition table

Join key: hs_deal_id = sf_deal_id (already mapped in Salesforce integration). JOIN type: FULL OUTER JOIN (to detect records that exist in either side)

JOIN snowflake.revenue.recognition ON hs_deals.deal_id = revenue.hs_deal_id

Add calculated columns: difference amount, difference rate, status flag

Tell the AI to "Calculate the difference between HubSpot's amount and Snowflake's recognized_revenue, and flag any difference greater than 5% as 'needs confirmation'"

diff_amount = COALESCE(recognized_revenue,0) - COALESCE(amount,0) diff_pct = diff_amount / NULLIF(amount, 0) * 100 status_flag = CASE WHEN ABS(diff_pct) > 5 THEN 'Confirmation required' ELSE 'Normal' END

Publish dataset → Connect to HubSpot reports

Published as “Revenue Reconciliation Dataset”. Select this dataset in HubSpot's "Custom Reports" to create a list view of "Needs Attention" deals and a trend graph of monthly variance amounts.

Pattern B: Customer health scoring dataset

The use case is to combine HubSpot CRM + product usage data (BigQuery) + support tickets (Zendesk API) to calculate a customer health score and write it back to contact properties in HubSpot. The CS team's weekly manual merging of CSV files can now be completely automated.

🏥 Customer Health Scoring Pipeline

Integrated score calculation of CRM x product usage data x support data

Source ①: HubSpot Contacts (Customer only)

Filter by Lifecycle Stage = "customer". Get: contact_id, email, company, contract_start_date, csm_owner_id, plan_type

Source ②: BigQuery product_usage table (LEFT JOIN)

Join key: email. Get: login_count_30d, features_used_count, last_login_date, session_duration_avg

Source ③: S3 zendesk_tickets_weekly.csv (LEFT JOIN)

See weekly export of Zendesk data to S3. Join key: customer_email. Get: open_ticket_count, avg_response_hours, csat_score_avg

Calculated column: Health score + risk classification

Ask AI to suggest a formula to calculate your score out of 100 and High/Medium/Low risk tier.

health_score = LEAST(100, (COALESCE(login_count_30d,0)*2) + (COALESCE(features_used_count,0)*5) + (COALESCE(csat_score_avg,3)*8) - (COALESCE(open_ticket_count,0)*10)) risk_tier = CASE WHEN health_score >= 70 THEN 'Green' WHEN health_score >= 40 THEN 'Yellow' ELSE 'Red' END

Publish dataset → write back to CRM with workflow

Update datasets with weekly scheduled runs. Read dataset values and automatically update contact health_score and risk_tier properties in HubSpot workflows (Chapter 7)

⚡ Concept of dataset update schedule

Data Studio datasets can be updated manually or scheduled (at least every hour).Every 1 to 4 hours for dashboards that require real-time performance; Daily for monthly reports; Weekly for datasets that involve heavy conversion processing.is the standard. Increased schedule updates will consume API quota, so increase the frequency only if you really need real-time updates.

📌 Chapter 4 Summary

Data Studio is the fastest route exclusively for “CRM × external data combination”

ETL tools (dbt/Fivetran) are flexible but require engineering man-hours. BI tools (Tableau/Looker) have powerful visualization, but cannot be written back to CRM. Data Studio is the fastest, no-code way to meet the needs of ``combining HubSpot CRM data with 1 to 3 external sources and leveraging them within HubSpot.''

Use 6 types of data sources

Six types can be used: HubSpot CRM (internal), Snowflake, BigQuery, AWS S3, Google Sheets, and CSV. External connections require a dedicated service account and minimum privilege authentication settings. Use a shared account for your organization, rather than connecting with a former employee's personal account.

JOIN defaults to LEFT JOIN—beware of type mismatch

For most use cases, LEFT JOIN (keeps all HubSpot records) is the correct choice. Use FULL OUTER JOIN only when you want to detect differences between records that exist on both sides. Mismatched JOIN key data types (numeric vs. string) is the most common cause of errors—convert the type in the computed column before joining.

AI formula suggestion specifically conveys field names, conditions, and expected values

From ``Calculate ARR'', give specific instructions such as ``Create an ARR column where the BILLING_CYCLE field has three values: Monthly/Quarterly/Annual, Monthly is 12 times, Quarterly is 4 times, and Annual is the same.'' Be sure to preview the generated formula before applying it to production.

Revenue Reconciliation detects “discrepancies between CRM numbers and financial numbers”

With a FULL OUTER JOIN of HubSpot Opportunities x Snowflake billing data, you can automatically detect opportunities that are in HubSpot but have not been billed or that have been billed but are not in HubSpot. By simply flagging discrepancies over 5% as "confirmation required" and creating a report, the monthly sales reconciliation process will be reduced from 1 hour to 5 minutes.

Set refresh schedules to match data freshness requirements

Not all datasets need to be updated in real time. A good guideline is every 1 to 4 hours for dashboards, daily for monthly analysis, and weekly for heavy join processing. If you do it too frequently, it will consume your API quota and affect other processes. When setting the data, think about how many hours old this data can be without affecting decision-making.

Next Chapter

Chapter 5: Data Agent and Smart Properties——AI automatically generates data →

Data Studio
Mapping variable name (e.g. total_score) and write destination property (e.g. custom property "total_score")

📋 Contents of this chapter

What is Data Studio? Differences from ETL tools and architecture

Differences between ETL tools, BI tools, and Looker

Connecting data sources—HubSpot CRM, external DB, spreadsheets

Best practices for connection settings

Data transformation and blend design (joins, aggregations, filters)

Types and usage of JOIN

Formula suggestion by AI—Automatic generation of calculated columns and condition classifications

Pipeline design by use case—reports, scoring, enrichment

Pattern A: Building a Revenue Attribution report

Pattern B: Customer health scoring dataset

📌 Chapter 4 Summary

Data Studio is the fastest route exclusively for “CRM × external data combination”

Use 6 types of data sources

JOIN defaults to LEFT JOIN—beware of type mismatch

AI formula suggestion specifically conveys field names, conditions, and expected values

Revenue Reconciliation detects “discrepancies between CRM numbers and financial numbers”

Set refresh schedules to match data freshness requirements

Data StudioMapping variable name (e.g. total_score) and write destination property (e.g. custom property "total_score")

📋 Contents of this chapter

What is Data Studio? Differences from ETL tools and architecture

Differences between ETL tools, BI tools, and Looker

Connecting data sources—HubSpot CRM, external DB, spreadsheets

Best practices for connection settings

Data transformation and blend design (joins, aggregations, filters)

Types and usage of JOIN

Formula suggestion by AI—Automatic generation of calculated columns and condition classifications

Pipeline design by use case—reports, scoring, enrichment

Pattern A: Building a Revenue Attribution report

Pattern B: Customer health scoring dataset

📌 Chapter 4 Summary

Data Studio is the fastest route exclusively for “CRM × external data combination”

Use 6 types of data sources

JOIN defaults to LEFT JOIN—beware of type mismatch

AI formula suggestion specifically conveys field names, conditions, and expected values

Revenue Reconciliation detects “discrepancies between CRM numbers and financial numbers”

Set refresh schedules to match data freshness requirements

Data Studio
Mapping variable name (e.g. total_score) and write destination property (e.g. custom property "total_score")