AI and data platforms11 Jun 20267 min read

What is Palantir - Part 2: How the data gets in

Foundry connects to the systems you already run. Part 2 covers the integration layer: connectors and virtual tables, four sync patterns, governed pipelines with LLM nodes, markings.

Read the Part 1 - What is Palantir here.

When I walk technical teams through Foundry's data integration layer, one question comes up more than any other. A CTO or head of data leans back and asks: "We already have Databricks. Or Snowflake. Why would we need another data platform on top of that?" It's the right challenge to raise.

Databricks, Snowflake, and BigQuery are built to store enterprise data at scale and query it fast. Foundry is built for a different job. It turns that data into an operational model of the business, with workflows, permissions, and actions attached.

Take a supplier. In a warehouse, a supplier is a row in a table. You can query it, join it, count how many are overdue. In Foundry's Ontology, that same supplier is an object linked to its purchase orders, production schedules, inventory positions, the people who manage it, and the actions an AI agent is permitted to take on it. Because the Ontology defines both the supplier and the permitted actions, an AI agent can propose a supplier switch, route it to a human for approval, and write the decision back into your ERP. All inside one governed framework.

Palantir calls this approach an "unwalled garden." Foundry works alongside the architecture you already have. It connects directly to Databricks and Snowflake, so you can leave your data where it lives and use Foundry's modelling, governance, and decision layer on top.

Here's how the integration layer works.

What Foundry connects to

Foundry ships with connectors for most common enterprise systems and data platforms.

On the ERP and business application side it connects to SAP (ERP, SLT, HANA, SuccessFactors, Ariba, Concur, and more), NetSuite (SuiteAnalytics, SuiteQL, SuiteTalk), major CRMs, WMS platforms, and financial systems.

It also connects to cloud databases, object stores, streaming platforms, file feeds, IoT sensors, webhooks, and REST APIs, and can ingest structured tables, documents, images, audio, and event streams. On the analytics side, Power BI, Tableau, and Jupyter connect to Foundry data directly, so teams with established BI workflows can keep them.

For organisations already running Databricks, Snowflake, or BigQuery, the integration goes deeper than a connector. Foundry can register tables from those platforms as virtual tables, a direct pointer to the source without physically moving the data. A query against a table registered this way can run on the source platform's compute, with Foundry applying its security and lineage model on top.

For environments that cannot open inbound network ports, Foundry's Agent Worker runs inside the customer's network and maintains outbound-only connections to the platform. A CTO with a strict network perimeter can run the full integration stack without exposing any inbound surface.

Choosing between synced datasets and virtual tables comes down to latency, governance, and operational requirements. Syncing gives you full dataset capabilities: versioning, branching, lineage, and health checks. Virtual tables keep data where it lives and avoid duplication, but they depend on live source connectivity and don't carry those same guarantees.

Foundry builds on top of your existing data warehouse.

How the data stays current

Stale data is a fast way to break an operational workflow, so Foundry offers four sync patterns: scheduled full snapshots, incremental pulls of only what changed since the last run, streaming ingestion for event sources and sensor feeds, and change data capture. CDC reads the database change log directly, which suits high-transaction systems where row-level precision matters.

What connects all four is versioning. Every sync run lands as a discrete, versioned transaction, tied back to the sync that produced it. When something breaks upstream (say a source schema changes or a field disappears), you see exactly where it happened and which downstream datasets were affected.

The Data Lineage application shows this as a full graph of upstream and downstream dependencies across the platform.

Flow diagram of Foundry's four sync paths, batch, incremental, change data capture, and streaming, carrying data from source systems into a stack of versioned dataset transactions. Below, a data lineage chain marks where a schema change occurred and which downstream datasets were affected. — Sync runs act as versioned transactions. When something breaks, you know exactly where.

The pipeline: from no-code to full code, with AI in the middle

The raw data a source system produces and the clean, connected dataset your Ontology needs are rarely the same thing. Foundry supports both visual and code-based pipeline development to close that gap.

Pipeline Builder is a visual interface covering the full pipeline, from source connection through transformation to output. It lowers the barrier for analysts and domain experts to build and validate transforms, joins, and outputs without writing code, though complex pipelines still tend to end up in the hands of engineers. Version control is built in, and changes move through a branching and validation flow before promotion to production. Pipeline Builder can export transforms directly to Code Repositories, so teams can start visual and graduate to code without rebuilding their pipelines from scratch.

Code Repositories supports Python, Java, SQL, and R transforms against Foundry's Spark-backed compute engine, with lighter-weight options for non-Spark workloads on smaller datasets. For engineers who need full control over transformation logic, or who are building complex multi-step pipelines, it's the right tool. The compute layer scales horizontally to support large datasets and distributed workloads.

Before data reaches anything downstream, health checks validate it against defined thresholds: row counts, null rates, referential integrity, and custom business rules. If validation fails, the pipeline halts and the owning team is notified before downstream datasets are affected.

For pipelines that process data already living in Databricks, Snowflake, or BigQuery, Foundry supports compute pushdown. The transformation logic is defined in Foundry, but the execution runs on the native compute engine of the source system. Lineage and access controls stay in Foundry even when the execution doesn't. For large-scale workloads, pushdown also eliminates data egress charges, since the data never leaves the source environment.

Compute pushdown: the governance model stays in Foundry. The heavy lifting happens where the data lives.

Pipeline Builder includes an LLM node. You can insert a language model step directly into a pipeline to classify unstructured text, extract entities from documents, or enrich structured data with AI-generated fields. It runs under the same governance as every other step, and it complements conventional transforms rather than replacing them: use it for the work deterministic logic can't do.

Palantir calls this the Multimodal Data Plane. The same pipeline infrastructure processes structured tables, documents, images, audio, and streaming events, and governs them the same way regardless of data type.

Pipeline Builder graph in which supplier documents are parsed, fan out to two LLM nodes for entity extraction and clause classification, then are filtered, combined, and output to the Ontology. An optional export path leads to Code Repositories for Python, Java, SQL, and R. — The same governed pipeline infrastructure from drag-and-drop to full code.

Markings: security that travels with the data

Most enterprise systems handle security at the application layer. Foundry attaches it to the data itself. A marking is a mandatory access control label applied to a dataset. When a marked dataset feeds into a transformation, the output inherits the marking automatically, and the label propagates through every subsequent transformation and derived dataset. Markings are conjunctive: a user must hold every marking on the data to access it. Deployments under government classification rules can add classification markings (CBAC), where releasability works disjunctively, any one marking in the release-to set is enough. Security teams can model real-world classification schemes directly in the data layer.

Markings travel with the data. A classified input always produces a classified output.

Open data: Iceberg tables

Foundry's native datasets are managed through a Foundry-specific transaction layer, so external tools can't read them directly. Some architectures need that data accessible from outside Palantir's ecosystem, and Iceberg tables address this.

Apache Iceberg is an open table format supported natively by Databricks, Snowflake, Trino, Spark, and a growing list of tools. Foundry now supports Iceberg as an alternative storage format, both for data managed inside Foundry and for virtual tables backed by external catalogues.

Data stored as a Foundry Iceberg table can be read by Databricks or Snowflake without an export step, with access available via REST, JDBC, and S3-compatible interfaces. Iceberg also supports row-level edits (DELETE, UPDATE, MERGE), which the standard Foundry dataset format doesn't. That matters for workflows that need conditional row modifications without re-snapshotting the entire table.

Iceberg table support in Foundry was in beta at time of writing. Check the current platform release notes before building production workflows against it.

The trade-off

None of this is free. Foundry adds an operational layer to your architecture, and another platform for the team to learn and pay for. If all you need is analytics, Databricks or Snowflake on their own may be enough. The case for Foundry starts when workflows, governance, and decisions need to sit directly on top of the data, when the supplier needs to be more than a row in a table.

The handoff to the Ontology

The pipeline output is a curated, versioned, governed dataset, ready to be modelled into objects and relationships in the Ontology. Part 3 covers that modelling: what it means to describe your business as objects and relationships, and how that layer lets applications and AI agents operate across connected business objects.

What Foundry connects to

How the data stays current

The pipeline: from no-code to full code, with AI in the middle

Markings: security that travels with the data

Open data: Iceberg tables

The trade-off

The handoff to the Ontology

Keep reading...

Why Palantir. Why Now. Ten Weeks On

What is Palantir - Part 3: The Ontology

Your AI Agent Has the Keys. Now What?