Epic Caboodle Data Warehouse: How the Data Model Works and Who Needs It vs Clarity
Most health systems that implement Epic Caboodle use it as a more expensive version of Clarity rather than as the enterprise data warehouse it is designed to be. They query Caboodle the same way they query Clarity, skip the dimensional model’s advantages, and then wonder why the investment did not produce the analytical capability they expected. This article explains how Caboodle’s data model actually works, when Caboodle is the right tool versus when Clarity is sufficient, and what operational and analytical use cases each platform serves best.
- Epic Caboodle Architecture: Where It Fits in the Data Stack
- The Dimensional Data Model: Facts, Dimensions, and Star Schema
- Key Caboodle Tables: EncounterFact and the Dimension Model
- Caboodle vs Clarity: When to Use Which
- External Data Integration in Caboodle
- Pre-Built Derived Metrics and When to Validate Them
- Caboodle on Azure Synapse: Cloud Deployment Considerations
- Connecting Caboodle to External BI Platforms
- Caboodle Implementation: What Teams Underestimate
- Roles, Skills, and Career Path for Caboodle Work
- Downloads
Epic Caboodle Architecture: Where It Fits in the Data Stack
Epic’s data architecture has three distinct layers. Chronicles is the operational database where clinical transactions occur in real time. Clarity is the relational reporting database extracted nightly from Chronicles, structured for SQL reporting. Caboodle is the enterprise data warehouse (EDW) built from Clarity data through a second ETL layer, structured for dimensional analysis and BI platform integration.
Caboodle sits above Clarity in the data stack. It is not a replacement for Clarity – it depends on Clarity as its primary data source. The ETL from Clarity to Caboodle runs after the Clarity ETL completes, typically in the early morning hours. Caboodle data is therefore one ETL cycle older than Clarity data – if the Clarity ETL finishes at 4:00 AM and the Caboodle ETL runs from 4:30 AM to 6:00 AM, Caboodle data reflects clinical activity through the prior evening, same as Clarity. The full Cogito architecture context is covered in the Epic Cogito Reporting and Analytics guide.
The key distinction between Caboodle and Clarity is the data model, not the data itself. Both contain largely the same clinical information – but Clarity presents it in normalized relational tables that mirror Epic’s operational structure, while Caboodle presents it in a dimensional model optimized for analytical queries. The data model difference determines which analytical tasks are easier, faster, and more natural in each platform. Understanding the full Epic module landscape provides context for how Caboodle fits into the broader implementation – covered in the Epic EHR Learning Hub.
The Dimensional Data Model: Facts, Dimensions, and Star Schema
Caboodle is organized as a dimensional model – specifically, a star schema. A star schema has two types of tables: fact tables that contain measurable events and pre-computed metrics, and dimension tables that contain descriptive attributes used to filter and group those facts. The name comes from the visual pattern when you draw the schema: the fact table is at the center, and dimension tables radiate outward like points of a star.
Fact Tables: Measurements and Events
Fact tables in Caboodle contain one row per measurable event. EncounterFact has one row per patient encounter. MedicationAdministrationFact has one row per medication administration. OrderFact has one row per clinical order. Each fact table contains the measurements relevant to its event – LOS for encounters, administration time for medication events, result value for lab orders – and a set of foreign keys that link to dimension tables.
Fact tables are designed to be queried in aggregate. You SUM charges from EncounterFact, AVERAGE LOS, COUNT encounters by department. Querying individual rows from a fact table is technically possible but not the primary use case. If you need patient-level detail with many columns, Clarity is often more appropriate than Caboodle for that specific query pattern.
Dimension Tables: Descriptive Attributes
Dimension tables contain the descriptive context that makes fact table metrics meaningful. PatientDim contains patient demographics – age, sex, race, insurance type. DepartmentDim contains department name, specialty, facility, and location. ProviderDim contains provider name, specialty, and NPI. DiagnosisDim contains ICD-10 codes and descriptions. DateDim contains calendar attributes like fiscal year, quarter, week number, and day of week.
The DateDim is one of Caboodle’s most powerful features. A query that breaks down encounter volume by fiscal quarter, by day of week, by month-over-month comparison – all of these are made trivial by joining EncounterFact to DateDim. In Clarity, the same analysis requires date arithmetic functions applied to raw timestamp columns. In Caboodle, you join to DateDim and filter on FISCAL_QUARTER or DAY_OF_WEEK_NAME. The dimensional model eliminates much of the date calculation complexity that makes Clarity analytical queries difficult to write correctly.
Why Star Schema Queries Are Faster for Analytics
The star schema’s analytical performance advantage comes from the pre-aggregation and pre-computation built into the ETL process. When Caboodle’s ETL loads EncounterFact, it calculates and stores the LOS for each encounter directly in the fact table. A Clarity query for average LOS must calculate DATEDIFF(MINUTE,…)/1440 for every row in the result set. A Caboodle query can simply AVG the pre-computed LOS_DAYS column. This pre-computation advantage compounds when you are aggregating millions of encounter rows across multiple dimensions simultaneously.
The tradeoff is that pre-computed metrics are static until the next ETL run. The LOS in EncounterFact reflects the LOS as calculated during the most recent ETL, using the metric definition built into the Caboodle ETL logic. If your organization’s LOS definition differs from Epic’s Caboodle default calculation, the pre-computed value is wrong – and you must either modify the ETL logic or recalculate in your query. This is the validation step most Caboodle implementations skip.
Key Caboodle Tables for Analysts
| Table Name | Type | Grain (One Row Per) | Key Pre-Computed Metrics | Primary Use Case |
|---|---|---|---|---|
| EncounterFact | Fact | Patient encounter | LOS_DAYS, READMISSION_FLG, RISK_SCORE | Volume, LOS, readmission, population analytics |
| PatientDim | Dimension | Patient | AGE_AT_ENCOUNTER, PAYER_TYPE_C | Demographics, payer mix, population segmentation |
| DepartmentDim | Dimension | Department | DEPT_NAME, SPECIALTY, FACILITY | Departmental filtering and grouping |
| DiagnosisDim | Dimension | Diagnosis code | ICD10_CODE, DESCRIPTION, CHRONIC_FLG | Diagnosis analysis, case mix, chronic disease |
| ProviderDim | Dimension | Provider | PROV_NAME, NPI, SPECIALTY, DEPT_ID | Provider productivity, referral patterns, credentialing |
| DateDim | Dimension | Calendar date | FISCAL_YEAR, FISCAL_QUARTER, DAY_OF_WEEK_NAME | Time-based analysis, fiscal period reporting |
| MedicationAdministrationFact | Fact | Medication administration | ADMIN_ACTION, DOSE_GIVEN | Medication compliance, pharmacy analytics |
| OrderFact | Fact | Clinical order | ORDER_TO_RESULT_MINUTES (lab), ORDER_STATUS | CPOE utilization, lab turnaround, ordering patterns |
An academic medical center’s quality team needed a 30-day all-cause readmission analysis broken down by primary diagnosis category (using CCS groupings of ICD-10 codes), payer type, age group, and attending service – for the last 24 months, compared month-over-month. In Clarity, this query required joining PAT_ENC, PATIENT, PAT_ENC_DX, CLARITY_EDG, and CLARITY_SER across 24 months of data, with complex date arithmetic for month-over-month comparison and manual CCS code grouping logic. The initial Clarity query took 22 minutes to run and required 340 lines of SQL written by a senior data analyst. The same analysis in Caboodle joined EncounterFact to PatientDim (payer type), DiagnosisDim (CCS grouping pre-loaded in dim), ProviderDim (attending service), and DateDim (month-over-month via FISCAL_MONTH columns). The Caboodle query ran in 4 minutes, used 28 lines of SQL, and the READMISSION_FLG pre-computed metric eliminated the self-join that had been the most complex part of the Clarity version. The 22-minute Clarity run time was also a problem during the ETL window – the Caboodle query could run at any time without impacting the Clarity reporting environment.
Caboodle vs Clarity: When to Use Which for Analytics
The decision between Caboodle and Clarity is not about which one is better – both are valuable for different analytical tasks. The decision is about fit: which platform’s data model matches the question being asked?
| Dimension | Epic Clarity | Epic Caboodle |
|---|---|---|
| Data model | Normalized relational (mirrors Chronicles structure) | Dimensional star schema (optimized for analytics) |
| Query complexity | High for multi-dimensional analysis – many joins required | Lower for multi-dimensional analysis – pre-aggregated |
| Performance at scale | Can slow significantly for large multi-table analytical queries | Better for large analytical queries across many rows |
| Near-real-time data | Yes – near-real-time extracts available for select tables | No – nightly ETL only; one cycle behind Clarity |
| Operational reporting | Better fit – Workbench reports run against Clarity | Not designed for operational/near-real-time reporting |
| Enterprise analytics | Workable but complex for multi-dimensional analysis | Better fit – built for enterprise analytical queries |
| External data integration | Epic data only | Supports non-Epic data sources (financial, claims, HR) |
| BI platform integration | Possible – SQL Server connection direct | Preferred – dimensional model maps naturally to BI tools |
| Pre-computed metrics | None – all calculations at query time | LOS, readmission flags, risk scores pre-calculated in ETL |
| Implementation cost | Included in Epic license (Clarity access) | Additional license + infrastructure investment |
A practical decision framework: if the question can be answered by a Reporting Workbench report running against Clarity, use Clarity. If the question requires multi-year trend analysis across multiple dimensions, integration with non-Epic data, or BI platform visualization at scale, use Caboodle. If the question requires near-real-time data (same-day operational metrics), neither Caboodle nor Clarity nightly extract will work – you need near-real-time Clarity extracts or a different data source. Writing direct SQL queries against Clarity for operational analysis is covered in the Epic Clarity SQL for Analysts guide.
External Data Integration in Epic Caboodle
Caboodle’s most underused capability is its support for non-Epic data sources. Caboodle’s ETL framework can ingest data from external systems – financial accounting systems, claims data, HR systems, patient satisfaction platforms (Press Ganey, HCAHPS), external benchmarks (Vizient, Premier) – and link it to Epic clinical data through common keys like patient MRN, encounter ID, or provider NPI.
This integration capability is what makes Caboodle a genuine enterprise data warehouse rather than just an Epic-specific analytical database. A health system that integrates claims data into Caboodle can answer questions like: which patients seen in our ambulatory clinics subsequently had claims with another health system, and what were those services? A health system that integrates financial data can answer: what is the actual margin per DRG after accounting for labor costs from the HR system? Neither of these questions is answerable from Epic data alone.
External Data Integration Architecture
External data enters Caboodle through the Epic Foundation System’s ETL pipeline or through a custom ETL process that loads data directly into Caboodle’s custom schema. Epic provides a custom schema namespace within Caboodle specifically for non-Epic data – this keeps external data organized separately from the Epic-managed fact and dimension tables and prevents it from being overwritten by Epic ETL updates.
The linking mechanism between external data and Epic data requires a common identifier. Patient-level linkage uses PAT_ID or MRN. Encounter-level linkage uses CSN (Contact Serial Number) or visit date range. Provider-level linkage uses NPI. Build analysts designing external data integration must confirm which identifier the external system uses and verify that the same patient/provider appears with the same identifier in both systems before assuming a JOIN will produce correct matches.
A regional health system integrated Medicare claims data into Caboodle to identify patients who had visits at the health system but received subsequent care at competing facilities. The integration plan linked claims to Epic patients via Medicare Beneficiary Identifier (MBI). During data validation, the team discovered that Epic’s patient records stored the MBI in a custom field that had inconsistent formatting – some records had hyphens (1EG4-TE5-MK72), some had no hyphens (1EG4TE5MK72), and some had leading spaces from a legacy import. The claims data used the hyphenated format consistently. The JOIN linked only 67% of expected patient records because of the format inconsistency. Resolving the linkage required a data cleansing step on the Epic MBI field before the JOIN would work correctly. This is a routine data quality problem in any enterprise integration – but it is often not discovered until the first validation run, which in this case was three weeks before the scheduled production deployment.
Pre-Built Derived Metrics and When to Validate Them
Caboodle’s pre-built derived metrics are one of its most valuable features and one of its most common sources of reporting errors. Epic calculates and stores metrics like LOS, 30-day readmission flags, and risk scores directly in fact tables during the ETL process. These pre-computed values are faster to query than real-time calculations. But they use Epic’s definition of each metric, which may not match your organization’s adopted definition.
Readmission Flag Validation
The READMISSION_FLG in EncounterFact uses Epic’s default readmission definition: an unplanned inpatient readmission within 30 days of an index inpatient discharge, for the same patient at the same facility. This definition differs from CMS’s Hospital Readmission Reduction Program (HRRP) definition, which applies risk adjustment based on diagnosis and excludes planned readmissions using a different exclusion list. It differs from some organizations’ internal definitions that count readmissions to any facility in the network, not just the same facility.
Before using any Caboodle derived metric for performance management, regulatory reporting, or payer contracting, validate the metric definition against your organization’s canonical definition. The validation process: run the pre-computed Caboodle metric for a prior period, then run your own calculation of the same metric from first principles, compare the results. Any discrepancy requires investigation. The BAT vs UAT methodology that applies to clinical system testing applies equally to data warehouse validation – described in the BAT vs UAT guide.
LOS Validation in Caboodle
LOS_DAYS in EncounterFact is calculated from admission to discharge timestamps during the Caboodle ETL. The calculation method (hours to minutes to decimal days) is consistent but the question is which timestamps are used. Epic’s LOS may use the A01 (admit) to A03 (discharge) ADT event timestamps, which is the correct clinical LOS. Some organizations’ billing LOS definitions use registration timestamp to discharge timestamp. These produce slightly different numbers that matter when your Caboodle LOS is being compared against payer data that uses a different definition.
Caboodle on Azure Synapse: Cloud Deployment Considerations
Epic increasingly supports Caboodle deployment on Microsoft Azure Synapse Analytics rather than on-premises SQL Server. Azure Synapse is a cloud-based distributed analytics platform that provides elastic compute scaling, integration with Azure Machine Learning, native Power BI connectivity, and consumption-based pricing. For large health systems with multi-facility Epic deployments and complex analytical workloads, Azure Synapse offers significant performance advantages for the largest Caboodle queries.
Cloud Caboodle vs On-Premises Caboodle
| Dimension | On-Premises Caboodle (SQL Server) | Cloud Caboodle (Azure Synapse) |
|---|---|---|
| Infrastructure | Organization-managed SQL Server hardware and licensing | Microsoft Azure-managed – no hardware to provision |
| Compute scaling | Fixed – limited by purchased server capacity | Elastic – scale up/down based on query demand |
| Large query performance | Can bottleneck at server capacity ceiling | Distributed query engine handles large parallel workloads |
| Cost model | Fixed capital cost (hardware + SQL Server license) | Consumption-based – cost increases with query volume |
| HIPAA compliance | Managed by organization’s data center security controls | Requires Microsoft Azure HIPAA BAA and configuration review |
| Power BI integration | SQL Server connection via Power BI Desktop or gateway | Native Azure connectivity – no gateway required |
| Best for | Organizations with existing data center infrastructure; predictable query loads | Organizations seeking elastic scaling; ML integration; Power BI-first BI strategy |
The most significant risk with cloud Caboodle is unoptimized query cost. Azure Synapse pricing is consumption-based – a poorly optimized query that performs a full distributed scan of a large fact table in 8 seconds might generate significant compute charges if run frequently. Organizations migrating from on-premises Caboodle where query cost was invisible (included in fixed infrastructure) must establish query governance, resource monitoring, and cost alerting before migrating analytical workloads to Azure Synapse.
Connecting Caboodle to External BI Platforms
Caboodle’s dimensional model makes it the natural backend for external Business Intelligence platforms – Power BI, Tableau, Qlik, and others. The star schema structure maps directly to the way BI tools think about data: drag-and-drop dimensions for filters and groupings, pre-computed measures for instant aggregation. Connecting a BI platform to Caboodle typically produces far better performance and usability than connecting the same BI platform directly to Clarity’s normalized tables.
Connection Architecture and HIPAA Requirements
Connecting a BI platform to Caboodle requires a read-only database connection – a SQL Server connection string with a service account that has SELECT-only permissions on the Caboodle schema. On-premises connections use the BI tool’s SQL Server connector. Azure Synapse connections use Azure-native connectors that do not require an on-premises data gateway.
HIPAA compliance requires that the BI platform is covered by a Business Associate Agreement (BAA) with the health system. Microsoft Power BI’s enterprise tier (Power BI Premium or Power BI for Government) supports BAA coverage. Tableau can be covered via BAA when deployed in a HIPAA-eligible configuration. Consumer-tier BI tools without enterprise compliance certification must not be connected to Caboodle if the data contains PHI. The privacy officer must confirm BAA status for any external platform before the Caboodle connection is established. Epic Reporting Workbench’s relationship to Caboodle and external BI platforms is covered in the Epic Reporting Workbench guide.
DirectQuery vs Import Mode in Power BI
When connecting Power BI to Caboodle, analysts must choose between DirectQuery mode (Power BI sends queries to Caboodle in real time when users interact with dashboards) and Import mode (Power BI imports a scheduled snapshot of Caboodle data into its own storage). DirectQuery provides up-to-date data but generates Caboodle compute load with every dashboard interaction. Import mode reduces Caboodle query load but introduces a data freshness lag equal to the import schedule. For Caboodle (which is already one ETL cycle behind Chronicles), a daily import schedule in Power BI adds another data freshness layer – Import mode with a daily refresh at 7:00 AM and a Caboodle ETL that completes at 6:00 AM produces a tolerable one-day maximum data lag for most analytical use cases.
Caboodle Implementation: What Teams Consistently Underestimate
Caboodle implementations are consistently underestimated in three areas: ETL validation, derived metric validation, and data governance. These are the same categories that cause Caboodle implementations to deliver less analytical value than expected.
ETL Validation Takes Longer Than Expected
Validating that the Caboodle ETL correctly populated fact and dimension tables from Clarity is not a quick exercise. EncounterFact record counts must be reconciled against Clarity PAT_ENC. Dimension table values must be checked against Clarity reference tables. External data joins must be validated against the source system’s own record counts. A thorough ETL validation for a medium-sized health system (300-500 beds, 3-5 years of historical data) typically takes 4-8 weeks when done correctly.
Governance Must Be Built Before the First Report Is Published
The same report proliferation and metric inconsistency problem that affects Reporting Workbench (47 versions of the census report) will occur with Caboodle-based analytics if governance is not in place before the first dashboard goes live. Establish canonical metric definitions, report certification processes, and BI platform access governance before publishing any Caboodle-based analytics to clinical or executive stakeholders. The investment in governance structure pays back immediately in analytical credibility.
Roles, Skills, and Career Path for Caboodle Work
| Role | Certification | Key Skills | Salary Range (2026) |
|---|---|---|---|
| Caboodle Data Analyst | Epic Cogito | Dimensional SQL, fact/dimension model, metric validation | $90,000 – $130,000 |
| Data Architect / EDW Lead | Epic Cogito + Azure/cloud cert | ETL design, external integration, Azure Synapse, BI architecture | $120,000 – $165,000 |
| Clinical Informatics Analyst | Epic Cogito + clinical background | Metric governance, CMS quality, analytical credibility | $105,000 – $145,000 |
| BI Developer (Caboodle-connected) | Power BI / Tableau cert + Epic Cogito | Star schema BI modeling, DirectQuery, row-level security | $95,000 – $135,000 |
Validate every pre-computed Caboodle derived metric against your organization’s canonical definition before using it in any report or dashboard. Start with the three metrics that appear in the most leadership conversations: readmission rate, length of stay, and patient volume. Run Epic’s pre-computed value alongside your own calculation from first principles. Document the differences. Correct the ones that matter for your use case – either by adjusting the ETL logic or by recalculating in the query layer. Organizations that do this before the first Caboodle dashboard goes live avoid six months of stakeholder trust erosion from metric discrepancies that analysts cannot explain.
Authoritative References
- Microsoft Azure Synapse Analytics – Architecture, Distributed Query Engine, and HIPAA-Eligible Configuration
- CMS – Hospital Readmission Reduction Program (HRRP): Official Readmission Measure Specifications and Risk Adjustment Methodology
