Quick Answer
Measure true AI/automation ROI with holdout experiments, not correlation metrics. Split audience (85% treatment, 15% control) → Run for 4+ weeks → Calculate Revenue Lift = (Treatment conversion - Control conversion) / Control. Example: 11.3% vs 2.4% = +371% lift, 161× ROI. Excel templates and statistical significance calculator included.
The $2M Marketing Spend Nobody Believed In
In Q4 2023, a 47-person B2B SaaS company spent $2.1M on demand generation campaigns (paid ads, webinars, content syndication). Revenue increased by $8.7M that quarter.
The CEO asked: "Did marketing generate $8.7M, or would we have closed those deals anyway?"
The CMO showed correlation charts: email open rates (32%), webinar attendance (847 people), MQL volume (+127% YoY). All trending upward alongside revenue.
The CEO responded: "That's correlation, not causation. Show me the incremental revenue we wouldn't have earned without marketing."
The CMO couldn't answer. Budget was cut by 40% the following quarter.
The Problem with Correlation Metrics
95% of marketing and sales teams measure correlation, not causation:
- Email open rate: Opens correlate with revenue, but do they cause revenue?
- Demo completion rate: Completed demos correlate with deals, but would deals close without demos?
- MQL volume: More MQLs correlate with pipeline growth, but are they incremental or were those leads already inbound?
Correlation ≠ Causation. To prove ROI, you must measure incremental impact (what wouldn't have happened without your action).
What You'll Learn
This guide teaches you how to measure causal impact using holdout experiments (also called A/B tests or controlled experiments):
- Correlation vs Causation: Why most ROI metrics are misleading
- Holdout Experiment Design: 5-step framework for controlled experiments
- Revenue Lift Calculation: How to quantify incremental revenue
- Statistical Significance Testing: P-values, T-tests, confidence intervals
- Common Pitfalls: Simpson's Paradox, selection bias, survivorship bias
- Excel Implementation: Zero-budget tools for holdout experiments
📊 Real-World Impact
A 23-person marketing SaaS used holdout experiments to prove their nurture campaigns generated +$340K incremental ARR (16.7% lift, p=0.003). Budget was increased by 80% the following year.
Chapter 1: Correlation vs Causation
The Correlation Trap
Correlation means two variables move together. Causation means one variable causes the other to change.
Example of correlation without causation:
- Ice cream sales and drowning deaths are correlated (both increase in summer)
- Does ice cream cause drowning? No. A third factor (hot weather) drives both.
B2B Marketing Examples of Correlation Traps:
| Metric | Correlation Observed | Potential Confounding Factor |
|---|---|---|
| Email sends | +100 emails → +$50K revenue | Those leads were already high-intent (would buy anyway) |
| Webinar attendance | Webinar attendees → 3.2x higher close rate | Self-selection bias (motivated buyers attend) |
| Demo completion | Demo completers → 47% close rate vs 12% non-demo | Sales team only demos qualified leads (selection bias) |
| Content downloads | Whitepaper downloaders → 2.1x pipeline | High-intent leads download content (reverse causation) |
In all these cases, correlation exists but causation is unclear. The metric may be a symptom of buyer intent, not the cause of revenue.
Methods for Proving Causation
To prove causation, you need one of the following methods:
1. Randomized Controlled Experiments (Holdout Tests)
Split your audience into two randomly assigned groups:
- Treatment Group: Receives the campaign (emails, ads, outreach)
- Control Group: Does not receive the campaign (holdout)
Compare conversion rates between the two groups. The difference is incremental impact (causation).
Example:
- Treatment Group (n=1,000): 15.7% conversion → 157 conversions
- Control Group (n=1,000): 12.3% conversion → 123 conversions
- Lift: (15.7% - 12.3%) / 12.3% = +27.6%
- Incremental conversions: 157 - 123 = 34 conversions
2. Quasi-Experimental Methods
When randomization is not possible, use:
- Difference-in-Differences (DiD): Compare treatment group (before vs after) to control group (before vs after). Example: Launch campaign in US East (treatment), withhold in US West (control), compare changes.
- Synthetic Control: Create a "synthetic" control group from historical data. Example: Forecast what revenue would have been without the campaign, compare to actual revenue.
- Regression Discontinuity: Use natural cutoffs. Example: Leads scoring 100+ get outreach (treatment), leads scoring 90-99 don't (control). Compare outcomes at the 100-point threshold.
3. Time-Series Analysis
Compare metrics before and after campaign launch. Requires stable baseline (no seasonality, no external shocks).
Weak method (correlation risk high) but acceptable if:
- Business has stable weekly/monthly patterns
- No competing campaigns running simultaneously
- Long baseline period (12+ weeks of pre-campaign data)
⚠️ Causal Inference Hierarchy
- Gold Standard: Randomized Controlled Experiment (holdout test)
- Silver Standard: Quasi-Experimental Methods (DiD, synthetic control)
- Bronze Standard: Time-Series Analysis (before/after comparison)
- Not Acceptable: Correlation metrics without control groups
Chapter 2: What is a Holdout Experiment?
A holdout experiment (also called A/B test, controlled experiment, or randomized controlled trial) is the gold standard for measuring causal impact.
Control vs Treatment Groups
Treatment Group:
- Receives the campaign/action you want to test (emails, ads, outreach, product feature)
- Typically 70-90% of total audience
Control Group (Holdout):
- Does NOT receive the campaign/action
- Typically 10-30% of total audience
- Must be randomly assigned (no cherry-picking)
Why withhold from 10-30%?
- Smaller holdout (5%) = weak statistical power (hard to detect lift)
- Larger holdout (40%) = too much opportunity cost (lost revenue from untreated group)
- Sweet spot: 20% holdout balances statistical power and opportunity cost
Why Randomization Matters
Random assignment ensures control and treatment groups are identical in all aspects except the campaign.
Example of bad (non-random) assignment:
- Treatment Group: High-intent leads (200+ signal score)
- Control Group: Low-intent leads (0-100 signal score)
Result: Treatment group converts at 27%, control at 8%. This is NOT lift—it's selection bias. High-intent leads would have converted anyway.
Example of good (random) assignment:
- Use Excel:
=RAND()function to assign random numbers (0-1) to each lead - If RAND() < 0.2 → Control Group (20%)
- If RAND() ≥ 0.2 → Treatment Group (80%)
Result: Both groups have similar signal scores, industries, deal sizes, regions. Any difference in conversion is due to the campaign (causation).
Sample Size Calculation
Minimum sample size depends on:
- Baseline conversion rate: Lower baseline = larger sample needed
- Expected lift: Smaller expected lift = larger sample needed
- Statistical power: Typically 80% (20% chance of false negative)
- Significance level: Typically 95% (5% chance of false positive, p=0.05)
Rule of Thumb:
- Minimum 100 conversions in treatment group
- Minimum 50 conversions in control group
Example Calculation:
Baseline conversion rate: 10% Expected lift: +20% (10% → 12%) Significance level: 95% (p=0.05) Statistical power: 80% Formula (simplified): n = 16 × (p × (1-p)) / (lift²) n = 16 × (0.10 × 0.90) / (0.02²) n = 16 × 0.09 / 0.0004 n = 3,600 leads per group Total sample size: 7,200 leads (3,600 treatment + 3,600 control)
Use online calculators for precise sample size:
💡 What if I don't have 7,200 leads?
Run the experiment for longer duration. Example: If you have 1,200 leads/month, run for 6 months to accumulate 7,200 leads. Alternatively, accept lower statistical power (60% instead of 80%) and run with smaller sample (n=1,800).
Prove AI revenue impact with numbers, share ROI confidence.
Holdout testing proves real revenue impact, not vanity metrics.
Chapter 3: Designing Holdout Experiments (5 Steps)
Step 1: Hypothesis Setting
Define what you want to prove. Good hypotheses are specific, measurable, and falsifiable.
Bad Hypothesis:
- "Email marketing improves revenue." (too vague)
Good Hypothesis:
- "Sending 3 nurture emails over 14 days to leads who visited /pricing but didn't book a demo will increase demo booking rate by 15%."
Hypothesis Template:
"[Action] applied to [Audience] will increase [Metric] from [Baseline] to [Target] ([Expected Lift]%)."
Examples:
- "Hot-lead outreach (5-minute response to /pricing visits) applied to SMB leads will increase demo booking rate from 8% to 12% (+50% lift)."
- "Lost-deal reactivation emails (90 days post-loss) applied to SMB closed-lost deals will reactivate 10% of deals (baseline: 2% organic reactivation, +8pp lift)."
- "Product usage alerts (usage dropped 50%+ in 7 days) applied to trial users will increase trial-to-paid conversion from 18% to 25% (+39% lift)."
Step 2: Metric Definition
Define primary metric (what you're trying to improve) and guardrail metrics (what you don't want to hurt).
Primary Metric Examples:
- Conversion rate: % of leads who book demo, sign contract, activate product
- Revenue: Total revenue, average contract value (ACV), annual recurring revenue (ARR) — see Revenue Velocity Optimization for calculation methods
- Speed: Time-to-close, time-to-first-value, sales cycle length
Guardrail Metrics (watch for negative side effects):
- Churn rate: Did aggressive outreach increase churn?
- Unsubscribe rate: Did email frequency cause opt-outs?
- Customer satisfaction: Did speed sacrifice quality (NPS drop)?
| Experiment Type | Primary Metric | Guardrail Metrics |
|---|---|---|
| Nurture emails | Demo booking rate | Unsubscribe rate, spam complaints |
| Pricing page CTA | Trial signup rate | Trial-to-paid conversion (quality of signups) |
| Sales demo script | Demo-to-close rate | Sales cycle length, discount rate |
| Onboarding automation | Activation rate (7-day usage) | Support ticket volume, NPS |
Step 3: Group Allocation
Random Assignment Process:
- Export your lead/customer list to Excel or CSV
- Add a column:
=RAND()(generates random number 0-1) - Sort by RAND() column (ascending)
- Top 20% → Control Group, Bottom 80% → Treatment Group
- Mark each lead with
group = "control"orgroup = "treatment"
Stratified Randomization (for segmented audiences):
- If you have distinct segments (SMB vs Enterprise, US vs EU), randomize within each segment
- Example: 20% control for SMB, 20% control for Enterprise (ensures both segments are represented)
Step 4: Experiment Duration
Minimum duration = 1 sales cycle
- SMB (7-30 day sales cycle): Run for 30 days minimum
- Mid-market (30-90 day cycle): Run for 90 days minimum
- Enterprise (90-180 day cycle): Run for 180 days minimum
Why full sales cycle?
- Early stopping leads to false positives (novelty effect, seasonal spikes)
- Example: Email campaign shows +30% lift in Week 1 (novelty), but drops to +5% by Week 4 (fatigue). If you stop at Week 1, you overestimate lift.
Sequential Testing (for early stopping):
- Use Bayesian A/B testing or sequential probability ratio test (SPRT) to stop early if lift is conclusive
- Tools: Evan Miller's Sequential A/B Testing Calculator
Step 5: Result Analysis
Key Questions to Answer:
- Is the lift real? (Statistical significance: p-value < 0.05)
- How large is the lift? (Effect size: % improvement)
- What is the confidence interval? (Range of plausible lift values)
- Did guardrail metrics degrade? (Check unsubscribe rate, NPS, churn)
Analysis Template (Excel):
Group | Leads | Conversions | Conv Rate | Lift ------------- | ----- | ----------- | --------- | ---- Treatment | 4,000 | 627 | 15.7% | +27.6% Control | 1,000 | 123 | 12.3% | (baseline) Statistical Significance: T-Test P-Value: 0.0023 (p < 0.05 ✅ Significant) 95% Confidence Interval: [+18.2%, +37.0%] Guardrail Check: Unsubscribe Rate: 0.4% (treatment) vs 0.3% (control) ✅ Acceptable NPS: 47 (treatment) vs 48 (control) ✅ No degradation
✅ Decision Framework
- p < 0.05 AND lift > 10% AND guardrails OK → Ship it (scale campaign)
- p < 0.05 AND lift < 10% → Marginally positive (consider cost/benefit)
- p ≥ 0.05 → Inconclusive (run longer or redesign)
- Lift < 0% → Negative impact (kill campaign immediately)
Chapter 4: Revenue Lift Calculation
Lift Formula & Examples
Lift Formula:
Lift = (Treatment Metric - Control Metric) / Control Metric
Example 1: Conversion Rate Lift
Treatment Group: 15.7% conversion rate Control Group: 12.3% conversion rate Lift = (15.7% - 12.3%) / 12.3% Lift = 3.4% / 12.3% Lift = 27.6% Interpretation: The campaign improved conversion rate by 27.6%.
Example 2: Revenue Lift
Treatment Group (n=4,000): $2.3M revenue ($575 per lead) Control Group (n=1,000): $450K revenue ($450 per lead) Lift = ($575 - $450) / $450 Lift = $125 / $450 Lift = 27.8% Interpretation: The campaign generated 27.8% more revenue per lead.
Incremental Revenue Calculation
Incremental Revenue = revenue you earned because of the campaign (that wouldn't have been earned otherwise).
Formula:
Incremental Revenue = (Treatment Revenue - (Treatment Size × Control Revenue per Lead))
Step-by-Step Example:
Treatment Group: - Size: 4,000 leads - Revenue: $2.3M - Revenue per lead: $575 Control Group: - Size: 1,000 leads - Revenue: $450K - Revenue per lead: $450 Step 1: Calculate "what treatment group would have earned without campaign" Counterfactual Revenue = 4,000 leads × $450/lead = $1.8M Step 2: Calculate incremental revenue Incremental Revenue = $2.3M (actual) - $1.8M (counterfactual) Incremental Revenue = $500K Interpretation: The campaign generated $500K in revenue that wouldn't have been earned without it.
Annualized Incremental Revenue (for ongoing campaigns):
Experiment Duration: 90 days Incremental Revenue: $500K (90 days) Annualized Incremental Revenue = $500K × (365 / 90) Annualized Incremental Revenue = $500K × 4.06 Annualized Incremental Revenue = $2.03M/year
ROI Calculation:
Campaign Cost: - Email platform: $500/month × 3 months = $1,500 - Content creation: $5,000 (one-time) - Sales rep time: 20 hours × $50/hour = $1,000 Total Cost: $7,500 Incremental Revenue (90 days): $500K ROI = (Incremental Revenue - Cost) / Cost ROI = ($500K - $7.5K) / $7.5K ROI = 65.7x Interpretation: For every $1 spent, the campaign generated $65.70 in incremental revenue.
💰 Real-World Example: Lost Deal Reactivation
A 19-person ID verification SaaS ran a 120-day holdout experiment on closed-lost deals (n=847):
- • Treatment (n=678): Automated reactivation emails at 90 days post-loss
- • Control (n=169): No outreach
- • Result: 11.3% reactivation rate (treatment) vs 2.4% (control)
- • Lift: +371% (p=0.001)
- • Incremental ARR: $340K
- • Cost: $2,100 (automation setup + email platform)
- • ROI: 161x
Chapter 5: Statistical Significance Testing
Understanding P-Value
P-value = probability that the observed lift occurred by random chance (not due to the campaign).
Interpretation:
- p = 0.05: 5% probability that lift is random (95% confidence it's real)
- p = 0.01: 1% probability that lift is random (99% confidence it's real)
- p = 0.20: 20% probability that lift is random (80% confidence it's real — not statistically significant)
Industry Standards:
- p < 0.05: Statistically significant (acceptable for most decisions)
- p < 0.01: Highly significant (use for high-stakes decisions, e.g., $100K+ budgets)
- p < 0.10: Marginally significant (acceptable for low-risk experiments)
T-Test in Excel
T-Test compares means of two groups and calculates p-value.
Excel Formula:
=T.TEST(treatment_array, control_array, 2, 2)
Parameters:
treatment_array: Range of treatment group conversion data (0 or 1 for each lead)control_array: Range of control group conversion data (0 or 1 for each lead)2(first argument): Two-tailed test (can detect positive or negative lift)2(second argument): Two-sample assuming unequal variances (most conservative)
Step-by-Step Example:
Step 1: Create conversion column (0 = no conversion, 1 = conversion) Lead ID | Group | Converted ------- | --------- | --------- 1 | treatment | 1 2 | treatment | 0 3 | treatment | 1 ... | ... | ... 4001 | control | 0 4002 | control | 1 ... | ... | ... Step 2: Create two arrays Treatment Conversions: Range B2:B4001 (n=4,000) Control Conversions: Range B4002:B5001 (n=1,000) Step 3: Run T-Test =T.TEST(B2:B4001, B4002:B5001, 2, 2) Result: 0.0023 (p-value) Interpretation: p=0.0023 < 0.05 → Statistically significant ✅
Confidence Intervals
Confidence Interval (CI) = range of plausible values for lift.
95% CI Interpretation:
- If you ran this experiment 100 times, 95 times the true lift would fall within this range
Example:
- Observed Lift: +27.6%
- 95% CI: [+18.2%, +37.0%]
- Interpretation: True lift is between +18.2% and +37.0% with 95% confidence
Excel Calculation (simplified):
Step 1: Calculate Standard Error (SE)
SE = SQRT((p_treatment × (1 - p_treatment) / n_treatment) +
(p_control × (1 - p_control) / n_control))
Example:
p_treatment = 15.7% = 0.157
p_control = 12.3% = 0.123
n_treatment = 4,000
n_control = 1,000
SE = SQRT((0.157 × 0.843 / 4000) + (0.123 × 0.877 / 1000))
SE = SQRT(0.0000331 + 0.0001079)
SE = 0.0119
Step 2: Calculate Margin of Error (95% CI uses z=1.96)
Margin = 1.96 × SE = 1.96 × 0.0119 = 0.0233 (2.33%)
Step 3: Calculate CI
Lift = 27.6%
Lower Bound = 27.6% - 2.33% = 25.3%
Upper Bound = 27.6% + 2.33% = 29.9%
95% CI: [+25.3%, +29.9%]⚠️ Wide Confidence Intervals
If CI is wide (e.g., [-5%, +40%]), your experiment lacks statistical power. Solutions: (1) Run longer to accumulate more conversions, (2) Increase sample size, (3) Accept wider CI if directionally positive.
Chapter 6: Common Measurement Pitfalls
Simpson's Paradox
Simpson's Paradox occurs when an overall trend reverses when data is segmented.
Example:
Overall Results: Treatment: 15.0% conversion (600 / 4,000) Control: 15.5% conversion (155 / 1,000) Lift: -3.2% ❌ Negative lift Segmented Results (by deal size): SMB Segment: Treatment: 20.0% conversion (400 / 2,000) Control: 15.0% conversion (75 / 500) Lift: +33.3% ✅ Positive Enterprise Segment: Treatment: 10.0% conversion (200 / 2,000) Control: 16.0% conversion (80 / 500) Lift: -37.5% ❌ Negative Explanation: - Campaign works for SMB (+33%) but fails for Enterprise (-37%) - Overall lift is negative because Enterprise has lower baseline conversion (pulls down average) - Action: Apply campaign only to SMB, exclude Enterprise
Prevention:
- Always segment by key dimensions: industry, deal size, region, customer type
- Report segment-level lift, not just overall lift
Selection Bias
Selection Bias occurs when control and treatment groups differ in non-random ways.
Example:
- Treatment Group: Leads who opened email (self-selected high-intent)
- Control Group: Leads who didn't open email (low-intent)
- Result: Treatment converts at 30%, control at 8%. This is NOT lift—it's selection bias.
Prevention:
- Random assignment BEFORE any action (assign groups before sending emails, not based on who opened)
- Never cherry-pick control groups
Survivorship Bias
Survivorship Bias occurs when you analyze only "survivors" (leads who didn't churn, unsubscribe, or drop out).
Example:
- You send 10 nurture emails over 90 days
- 30% unsubscribe after Email 3
- You measure conversion rate of the remaining 70% → 25% conversion
- Conclusion: "Nurture emails drive 25% conversion!" ❌ Wrong
True Calculation:
- 70% survived × 25% converted = 17.5% overall conversion
- 30% unsubscribed × 0% conversion = 0%
- Total: 17.5% conversion (not 25%)
Prevention:
- Include all leads in analysis (even unsubscribes, drop-outs)
- Use "intent-to-treat" analysis (measure based on original group assignment, not final status)
Novelty Effect
Novelty Effect occurs when early lift is inflated due to newness, then fades over time.
Example:
Week 1: +30% lift (users excited by new email series) Week 2: +18% lift (excitement fades) Week 3: +10% lift (fatigue sets in) Week 4: +5% lift (stable state) If you stopped at Week 1, you'd think lift is +30%. True steady-state lift is only +5%.
Prevention:
- Run experiments for minimum 1 sales cycle (30-180 days)
- Track lift over time (plot weekly/monthly lift)
- Use steady-state lift (last 25% of experiment duration) for ROI calculations
🚨 Most Common Mistake
Stopping experiments too early leads to false positives. Always run for full sales cycle. A study of 1,500 A/B tests found 40% of "winners" in Week 1 became losers by Week 4.
Prove AI revenue impact with numbers, share ROI confidence.
Holdout testing proves real revenue impact, not vanity metrics.
Chapter 7: Excel Implementation
Data Preparation
Step 1: Export Lead Data
Export from CRM (HubSpot, Salesforce) with these columns:
lead_id: Unique identifiercreate_date: When lead was createdconversion_date: When lead converted (blank if not converted)revenue: Deal value (if converted)segment: SMB, Mid-Market, Enterprise (optional for stratification)
Step 2: Create Conversion Column
=IF(ISBLANK(conversion_date), 0, 1)
Randomization (RAND Function)
Step 3: Assign Random Numbers
=RAND()
This generates a random number between 0 and 1 for each lead.
Step 4: Assign Groups
=IF(RAND_column < 0.2, "control", "treatment")
This assigns 20% to control, 80% to treatment.
Important: After running RAND(), copy the entire column and "Paste Special → Values" to freeze random assignments (otherwise they'll regenerate on every edit).
T-Test Calculation
Step 5: Create Summary Table
Group | Count | Conversions | Conv Rate ----------- | -------------- | ---------------- | --------- Treatment | =COUNTIF(...) | =SUMIF(...) | =B2/A2 Control | =COUNTIF(...) | =SUMIF(...) | =B3/A3 Formulas: A2 (Treatment Count): =COUNTIF(group_column, "treatment") B2 (Treatment Conversions): =SUMIFS(conversion_column, group_column, "treatment") C2 (Treatment Conv Rate): =B2/A2
Step 6: Run T-Test
=T.TEST(treatment_conversion_column, control_conversion_column, 2, 2)
Full Example:
Assuming:
- Column A: lead_id
- Column B: group ("treatment" or "control")
- Column C: converted (0 or 1)
Step 1: Filter treatment group conversions
Treatment Range: =FILTER(C:C, B:B="treatment")
Step 2: Filter control group conversions
Control Range: =FILTER(C:C, B:B="control")
Step 3: Run T-Test
=T.TEST(FILTER(C:C, B:B="treatment"), FILTER(C:C, B:B="control"), 2, 2)
Result: 0.0023 (p-value)
If p < 0.05 → Statistically significant ✅📥 Download: Excel Template
Pre-built Excel template with randomization, T-tests, and lift calculation formulas (coming in Phase 2).
Chapter 8: Advanced: Marketing Mix Modeling
For businesses running multiple campaigns simultaneously, holdout experiments for individual campaigns may not be feasible. Use Marketing Mix Modeling (MMM).
What is Marketing Mix Modeling?
MMM uses regression analysis to estimate the contribution of each marketing channel (email, ads, SEO, events) to total revenue.
Example Model:
Revenue = β0 + β1×(Email Sends) + β2×(Ad Spend) + β3×(SEO Traffic) + ε Where: - β0 = baseline revenue (without any marketing) - β1 = incremental revenue per email send - β2 = incremental revenue per $1 ad spend - β3 = incremental revenue per SEO visit - ε = error term (unexplained variance) Example Output (using historical data): Revenue = $50K + $2.30×(Email) + $1.87×(Ads) + $0.45×(SEO) Interpretation: - Each additional email generates $2.30 in revenue - Each $1 in ad spend generates $1.87 in revenue (ROI = 0.87x) - Each SEO visit generates $0.45 in revenue
When to Use MMM
- Multiple channels running simultaneously (can't isolate one)
- Historical data available (12+ months of weekly/monthly data)
- Budget allocation decisions (which channel to invest in?)
Limitations of MMM
- Correlation-based (not as strong as holdout experiments for causation)
- Requires large datasets (minimum 52 weeks of data)
- Sensitive to multicollinearity (if channels are correlated, attribution becomes noisy)
Tools for MMM:
- R:
lm()function for linear regression (free) - Python:
scikit-learnlibrary (free) - Google Sheets:
=LINEST()function (free) - Commercial Tools: Nielsen MMM, Analytic Partners, Neustar MarketShare
🎓 Recommendation
Start with holdout experiments for individual campaigns (simpler, stronger causation). Graduate to MMM once you have 12+ months of multi-channel data and need cross-channel attribution.
Chapter 9: 30-Day Holdout Experiment Roadmap
Week 1 (Day 1-7): Design & Setup
Day 1-2: Hypothesis & Metric Definition
- • Define hypothesis (action → audience → metric → expected lift)
- • Define primary metric (conversion rate, revenue, speed)
- • Define guardrail metrics (churn, unsubscribe, NPS)
- • Document in 1-page experiment brief
Day 3-4: Sample Size Calculation
- • Calculate minimum sample size (use online calculators)
- • Determine experiment duration (accumulate enough conversions)
- • If sample size too large, consider: (1) run longer, (2) accept lower power, (3) test on subset
Day 5-7: Group Assignment
- • Export lead/customer list from CRM
- • Run randomization in Excel (RAND function)
- • Assign 20% to control, 80% to treatment
- • Upload group assignments back to CRM (custom field: "experiment_group")
- • QA check: Verify control and treatment groups have similar baseline metrics
Week 2-4 (Day 8-28): Experiment Execution
Day 8: Launch Campaign
- • Apply campaign to treatment group only
- • Ensure control group is excluded (use CRM filters: "experiment_group = treatment")
- • Double-check: No leakage to control group
Day 8-28: Monitor Metrics (Weekly)
- • Track conversion rate, revenue, guardrail metrics
- • Check for data quality issues (missing data, duplicates)
- • Do NOT stop early (resist temptation to peek at results and ship immediately)
Day 28: Guardrail Check
- • If unsubscribe rate spikes (>2x baseline), pause campaign
- • If NPS drops (>5 points), investigate customer feedback
- • If churn increases (>1.5x), stop experiment immediately
Week 5 (Day 29-30): Analysis & Decision
Day 29: Statistical Analysis
- • Export final data (all conversions, revenue)
- • Calculate conversion rate lift
- • Run T-Test (p-value)
- • Calculate 95% confidence interval
- • Check guardrail metrics (unsubscribe, NPS, churn)
Day 30: Decision & Documentation
- • If p < 0.05 AND lift > 10% AND guardrails OK → Ship (scale to 100%)
- • If p ≥ 0.05 → Inconclusive (run longer or redesign)
- • If lift < 0% → Kill campaign
- • Document results in 1-page report (share with stakeholders)
✅ Success Criteria
By Day 30, you should have: (1) Statistically significant result (p < 0.05), (2) Lift quantified with confidence interval, (3) Incremental revenue calculated, (4) Go/No-Go decision made, (5) Documented learnings for future experiments.
Chapter 10: Implementation Checklist
Pre-Launch Checklist
During Experiment Checklist
Post-Experiment Checklist
📥 Download Checklist
Printable checklist template (PDF + Excel) coming in Phase 2.
3 Steps to Start Measuring Causation Today
Step 1: Pick One Campaign to Test (30 min)
Choose a low-risk, high-volume campaign for your first holdout experiment:
- • Best candidates: Nurture emails, webinar follow-ups, lost-deal reactivation
- • Avoid: High-stakes campaigns (product launches, executive outreach)
- • Minimum: 1,000+ leads/month volume
Step 2: Run Randomization in Excel (15 min)
Export leads, assign groups, upload to CRM:
- • Export lead list from CRM (CSV)
- • Add column:
=RAND() - • Assign groups:
=IF(RAND_column<0.2, "control", "treatment") - • Upload to CRM (custom field: "experiment_group")
Step 3: Set Calendar Reminder for Analysis (30 days)
Schedule analysis date (30-90 days from launch):
- • Calendar event: "Analyze Holdout Experiment Results"
- • Remind yourself to NOT peek at results before then
- • On analysis date: Run T-Test, calculate lift, make decision
Ready to Prove ROI with Automated Holdout Experiments?
Optifai runs holdout experiments automatically for every campaign. No Excel, no manual randomization, no complex analysis. Just click "Launch Experiment" and get results in 30 days.
Remember: Correlation is easy. Causation is hard. But proving causation is the only way to defend your budget, earn executive trust, and scale revenue predictably.
Good luck with your first holdout experiment. 🚀