Data 3.0 and the Future of Data Infrastructure: Beyond Databricks vs. Snowflake
In the rapidly evolving landscape of data, the battle between Databricks and Snowflake is more than just a clash of platforms—it’s a window into a broader transformation we’re witnessing in how enterprises manage and derive value from data. Welcome to Data 3.0, where AI-native ingestion, dynamic metadata, next-gen compute engines, and blurred engineering boundaries are reshaping everything from data pipelines to developer workflows.
Thesis 1: AI-Native Ingestion and Transformation
In the Data 3.0 era, the traditional tools for data pipelines, ETL, and orchestration are being upended. Rather than relying on rigid, drag-and-drop solutions that were once sufficient for internal analytics, organizations are now demanding tools that can handle real-time, production-grade use cases with the agility that AI workloads require.
Real-Time, Scalable Pipelines:
Modern AI-driven workflows necessitate pipelines that are not only scalable but also capable of transforming data on the fly. Products like Prefect, Windmill, and dltHub are pioneering a code-native approach—enabling pipelines to be treated as modular components that can be easily orchestrated, monitored, and evolved.Dynamic Data Workflows:
The emergence of frameworks such as Tobiko (from the creators of SQLMesh) illustrates the move toward automating SQL queries, tracking metrics, and mapping data lineage with minimal manual intervention. Meanwhile, innovations like Anthropic’sModel Context Protocol (MCP)are setting the stage for context-aware AI interactions, preserving the integrity and governance of data across every transformation step.Streaming-First Architectures:
Batch processing will remain critical, but the trend is clear—data processing is shifting closer to real time. Technologies like Apache Kafka and Apache Flink are pivotal in this transition, enabling organizations to support continuous model training and inference, which is key for applications that depend on instantaneous decision-making.
Thesis 2: Metadata as the New Source of Truth
As the volume and complexity of data continue to grow, so does the importance of managing “data about the data.” The metadata layer is no longer a passive repository; it’s emerging as the strategic cornerstone of modern data infrastructure.
Active Metadata Management:
Historically, metadata served as an afterthought—a reflective layer capturing schema updates and lineage information. Today, it’s at the forefront, driving actions around data governance, optimization, and real-time decision-making. New lakehouse-native data catalogs, such as Datastrato and Vakamo, are setting standards by ensuring that metadata isn’t just observed, but actively managed.Governance and Compliance in the AI Era:
With AI systems demanding a nuanced understanding of data relationships, the metadata layer now plays a crucial role in ensuring consistency, lineage, and security. Tools like Acryl Data are building unified data catalogs that provide granular control over access and maintain robust records for compliance, even as agents (both human and AI) interact with data in real time.Optimization for Performance:
Innovations in metadata management, from caching to data versioning, are enhancing the performance of AI workloads. Startups like Flarion.io and Greybeam are pushing boundaries by creating new primitives that help organizations optimize cost, time, and resource consumption—making AI-native infrastructure more efficient than ever.
Thesis 3: A New Era for Compute and Query Engines
While Databricks and Snowflake have dominated the data compute landscape—each generating billions in revenue—the next wave of AI-native startups is poised to unlock unprecedented interoperability in the compute and query layer.
Hybrid Workloads and Interoperability:
The traditional dichotomy between batch and streaming is fading. We’re witnessing a co-existence where engines like DuckDB, ClickHouse, and Druid are optimizing specific workloads, while legacy frameworks like Spark and Ray remain foundational. Next-gen query engines and federated compute platforms are emerging, designed to handle AI-first workloads with ease.Breaking Vendor Lock-In:
The rise of lakehouse architectures is reducing dependence on monolithic systems. By enabling an unbundled approach, enterprises can pick and choose best-of-breed solutions that precisely meet their operational needs without being tied to one vendor’s ecosystem.Optimized for AI:
New compute frameworks are being designed from the ground up to cater to the demands of AI-driven processes. These frameworks not only process data more efficiently but also support continuous learning pipelines that blend historical batch data with real-time signals, providing more accurate predictions and recommendations.
Thesis 4: Blurring the Lines Between Data and Software Engineering
One of the most profound shifts in Data 3.0 is the convergence of data engineering and software engineering. The era of specialized silos is coming to an end, driven by the need for agile, full-stack proficiency in building and maintaining AI-driven applications.
Integrated Engineering Workflows:
The rise of “AI Engineer” as a job title underscores the demand for professionals who can straddle both worlds. Tools from companies like dbt Labs have democratized data development by introducing software engineering best practices—such as version control, testing, and CI/CD—into data workflows.Unified Tooling and Collaboration:
Platforms like Gable, Temporal, and Inngest are reimagining data pipeline orchestration by abstracting infrastructure complexity and providing application-like reliability for distributed workflows. This shift is making it easier for teams to collaborate, regardless of whether they come from a traditional software or data background.Open Source and LLM Integration:
With enterprise reliance on open source growing and AI models like large language models (LLMs) increasingly integrated into development workflows, the collaboration between data and software engineering is accelerating. Open source contributions in data-focused repositories are surging, paving the way for tools that are not only powerful but also widely accessible and supported by a vibrant community.
The Future of Data Infrastructure
The transformation underway in Data 3.0 is not simply about choosing between Databricks and Snowflake. Rather, it’s about rethinking the entire data ecosystem to meet the challenges of a world where AI is at the core of business operations.
AI-native pipelines are reshaping how we ingest, transform, and orchestrate data, moving closer to real-time, production-grade workflows.
The metadata layer is emerging as the critical "source of truth," essential for governance, optimization, and dynamic data interactions.
New compute and query engines are breaking free from traditional paradigms, offering unprecedented interoperability and specialized performance for AI workloads.
The convergence of data and software engineering is fostering a new breed of full-stack AI engineers, poised to drive innovation in ways we’ve never seen before.
Snowflake and Databricks are positioned to capture each of these tailwinds. They provide the underlying infrastructure for capturing value from data. Organizations continue to invest billions of dollars in cloud and data infrastructure to capitalize on these opportunities. While either company could be limited to their existing TAM (data warehousing or data analytics) the future is much brighter than that. Snowflake recently articulated their opportunity based on the workloads they can address

