The Cloud Native Geospatial Ecosystem in 2025: From Integrated Systems to Generative Intelligence
Produced with the help of Gemini DeepResearch. See the table of contents control on the left for navigation.
The “Industry Context” sections are taken from the web - their contents were not a part of the conference.
See also NotebookLM podcast based on this doc
Executive Summary
If the Cloud Native Geospatial (CNG) 2025 conference had a single message, it was one of maturation. The community’s focus has decisively shifted, moving beyond the optimization of individual files and toward the architecture of integrated, transactional, and intelligent systems. As conference organizer Jed Sundwall observed in his Welcome Address, this was a gathering of builders—a diverse, vendor-neutral community no longer just experimenting with the cloud’s toolkit, but actively exploiting it to solve geospatial problems at scale.
Two powerful, synergistic currents drove this evolution. The first is the rapid emergence of the geospatial data lakehouse as the new standard for data management—an architecture promising the scale of a data lake with the discipline of a data warehouse. Foundational formats like Cloud Optimized GeoTIFF (COG) and Zarr are no longer the only stars of the show; they are now essential components within larger systems built on open table formats like Apache Iceberg. This shift, quickly validated by adoption from all major cloud providers, directly addresses long-standing challenges in data versioning and consistency. The second current is the rise of generative artificial intelligence, which is reshaping the very experience of data analysis by replacing complex code with natural language.
This report synthesizes the key technological trends from CNG 2025 and some post-conference developments, concluding that while the technical barrier to entry for large-scale geospatial analysis is falling, a new competitive frontier is emerging. The race is no longer about who can build the most complex infrastructure, but about how fast an organization can move from raw data to a crucial insight. As agentic AI and natural language become the primary bridge between vast data repositories and human decision-makers, the ultimate advantage will belong to those who can most effectively span that gap.
The Evolving Data Foundation: Formats, Standards, and the Lakehouse
The report starts with the talks about the foundational layers that are moving toward a more robust, database-like paradigm. As Aimee Barciauskas (Development Seed) said in her intro Cloud-Native Geospatial in Practice talk, the community is collectively building an ecosystem to “discover, order, and deliver” data more efficiently—to improve the metabolism of our geospatial systems.
A core philosophy in this effort is the move toward Analysis-Ready Data (ARD). In The Future of ARD: Composable Building Blocks, Matthias Mohr detailed how the Committee on Earth Observation Satellites (CEOS) is rethinking ARD, shifting from monolithic products to a flexible framework of small, reusable, and machine-readable “building blocks.” This composable model promises to let users construct precise data products on-demand, while simplifying the authoring and validation process for data providers through a suite of open-source tooling.
The Great Raster Debate: COG’s Dominance vs. Zarr’s Ascendance
A healthy tension defined the conference’s “Great Raster Debate”: the established dominance of Cloud Optimized GeoTIFFs (COG) versus the rapid ascendance of Zarr. COG remains the pragmatic choice for massive archives like the USGS Landsat collection, as Tonian Robinson (USGS) explained in her talk Accessing and Processing Landsat Data in the Cloud. Its maturity, broad ecosystem support, and simple file-per-band model offer a reliable on-ramp. Yet even as she affirmed COG’s present-day dominance, Robinson revealed that the USGS is actively exploring a move to Zarr for its next major reprocessing effort, Landsat Collection 3.
This debate was further explored in the talk Zarr: A Landsat Trade Study at Scale by Thomas Maiersperger and Zachariah Dicus (KBR representing USGS), who quantified both the significant storage savings (up to 28%) and the complex performance trade-offs of migrating the entire Landsat archive to Zarr v3.
Read performance, especially for full data bands, will improve with Zarr, though only improved sharding capability in Zarr v3 will make it suitable for the vast Landsat archive.
The study proposes storing Level 1 and Level 2 data as individual scenes in a “spectral cube” format rather than a spatio-temporal cube, due to the sparse nature of the data.
Level 3 products would be structured spatio-temporally. Bands would be stored as separate variables grouped by resolution, with metadata stored in Zarr attributes, aligning with the developing GeoZarr specification.
The “biggest negative” is the current lack of mature, out-of-the-box visualization tools for Zarr compared to the widespread support for COGs in GIS software like QGIS and ArcGIS.
Naomi Provost (C-Trees) explained in Moving from Science to Product how adopting Zarr rescued their science-to-product pipeline from “operational chaos”. She detailed how adopting Zarr as the backbone of their science-to-product pipeline solved critical challenges with time-series data versioning and inconsistent metadata that had plagued their previous GeoTIFF-based workflows. The solution involved standardizing on Zarr for their time-series carbon data and leveraging Icechunk, a transactional storage engine for Zarr provided via Earth Mover’s Arraylake platform. Icechunk provides database-like features such as ACID transactions and Git-like version control (commits, branches, and tags) directly on object storage, which was critical for ensuring data integrity and reproducibility in a collaborative scientific environment. To make this accessible to their scientists, C-Trees’ engineering team developed a simple Python package (ctreeskit) that abstracted away the complexity.
This sentiment was echoed by Lindsay Nield and Deepak Cherian (EarthMover) in Zarr, Icechunk & Xarray for Cloud-Native Geospatial Data-cube Analysis, who argued that the “versioned global data cube,” built on Zarr and their Icechunk technology, is a superior abstraction for team-based science, moving beyond the chaos of file-based workflows.
Yet Sarah Zwiep (Floodbase) presented a counter-argument in her talk Why not Zarr (yet)?, explaining that for their production flood monitoring systems, the maturity, ubiquity, and simple tooling of the COG ecosystem provide critical reliability that outweighs Zarr’s current advantages.
Further extending the capabilities of Zarr, Tom Nicholas (EarthMover) presented a novel approach in his talk on VirtualiZarr + Icechunk: Build Cloud-Optimised DataCube of Archival Files. He demonstrated how archival data in older formats like NetCDF or HDF5 could be virtually mapped into a cloud-optimized Zarr data cube without rewriting the underlying data, offering a powerful migration path for legacy datasets.
The practical challenges of using these formats at scale were a recurring theme. In CNG in Earth Engine COGs and Beyond, Sai Cheemalapati (Google) shared lessons from Google Earth Engine’s extensive experience, noting that even COGs can be tricky to parse efficiently, forcing them to rely on aggressive caching. He also warned of Zarr v2’s tendency to create a problematic “small file” explosion in object storage.
The question of scale was also tackled by Jeff Albrecht (Regrow) in his talk, Is COG Scalable? His nuanced answer: the format itself holds up, but much of the software ecosystem fails to unlock its potential. Achieving true scalability, he argued, requires a shift toward asynchronous, parallelized tooling. A glimpse of that future came from Julia Wagemann (thiriveGEO), who, in The Sentinel’s EOPF Toolkit, detailed an ambitious ESA-led project to transition the entire Copernicus Sentinel archive to Zarr, a massive undertaking that includes building the open-source community resources needed to support such a monumental format shift.
Industry context
The release of zarr-python 3.0 in January 2025 provided full support for the Zarr v3 specification. A key feature of this release is the sharding codec, which addresses one of Zarr’s primary operational drawbacks: the potential for a massive number of small files when using small chunk sizes. Sharding allows multiple logical chunks to be stored within a single physical object, decoupling the analytical chunking strategy from the physical storage layout and making Zarr more efficient on object storage. Also, the xarray library, a cornerstone of scientific data analysis in Python, released updates in August 2025 that improved support for Zarr v3 and introduced asynchronous data loading capabilities, making it more robust for production workflows.
Columnar Dominance: GeoParquet, COPC, and Specialized Formats
While the raster debate still rages, the vector world has largely coalesced around GeoParquet as the de facto standard for large-scale analytics.
In Building a Geospatial Platform with CNG Technologies, Charlie Savage (Orbital Insights) explained how Privateer’s Terascope platform migrated its massive ship-tracking datasets to GeoParquet with GeoArrow encoding. This shift, combined with an OGC Features API that serves GeoParquet directly, enabled them to move from slow, server-side visualizations to highly interactive, client-side analytics by streaming large volumes of data to be rendered in the browser with deck.gl. Their new architecture uses STAC for unified discovery of both imagery and geolocation data, COGs for imagery, and the OGC Tiles API to serve various tile sets.
The potential of this client-side ecosystem was also demonstrated by Florent Gravin (camptocamp) in Leverage Cloud-Native Vector Data in your Webmapping Application, SIMPLY, where he queried Overture Maps’ GeoParquet data directly in a web browser using DuckDB. Yet, he also revealed the current frontier: the persistent performance bottlenecks that arise when running live, large-scale queries in a client-side environment—a clear area of intense ongoing development.
In a talk focused on the foundational layer of data access, Kyle Barron (Development Seed) introduced Fast Cloud Storage Operations with Obstore, a high-performance Python library designed to offer a simpler and faster alternative for accessing cloud object storage. He contrasted Obstore with the widely used fsspec library, highlighting several key architectural differences. While fsspec emulates a stateful file system, Obstore provides a stateless, HTTP-like API, resulting in a smaller, more predictable API surface with zero required Python dependencies, achieved by building on a shared Rust core. Barron emphasized Obstore’s strengths in native asynchronous streaming and concurrency, which he demonstrated led to significant performance gains, such as a three-fold speed increase when loading large Zarr datasets. He noted the library’s recent adoption as an official backend in Zarr-Python as a key milestone.
Raphael Hagen and Shane Loeffler (CarbonPlan) continued this exploration in their talk Visualizing GeoParquet in the Browser by directly comparing multiple strategies for rendering large GeoParquet datasets. They benchmarked the trade-offs of different partitioning schemes, from a single large file to spatially partitioned files using quadkeys, providing practical guidance on how to balance initial metadata load times against the efficiency of subsequent spatial queries in a client-side environment.
For other data types, specialized formats continue to fill important niches. In his presentation Data Gravity and Point Clouds, Howard Butler (Hobu), a key figure in the point cloud community, introduced the Cloud Optimized Point Cloud (COPC) format. He explained that while various cloud-native point cloud formats exist, the vast archives of existing data are in the widely-used LAZ format. COPC embraces this reality with a design philosophy he described as “COG for point clouds”: it adds a lightweight EPT spatial index on top of the existing LAZ format rather than requiring a full data rewrite. This pragmatic approach allows for efficient partial reads and spatial windowing of massive point cloud datasets directly over HTTP, making huge legacy archives immediately more accessible to cloud-native workflows.
At the far end of the specialization spectrum, Brandon Liu’s (Protomaps) talk on Minimum Viable Cloud-Native Geo described a format hyper-optimized for a single, critical use case: serverless web map visualization. PMTiles packages an entire pyramid of vector or raster tiles into a single file. This design dramatically reduces request overhead by allowing a client to fetch multiple tiles with a single HTTP range request, making it ideal for low-cost, high-performance web mapping applications that do not require a dynamic tile server.
The evolution of these formats reveals a bifurcation in the vector data ecosystem, driven by the distinct requirements of analysis versus visualization.
Analytical Formats (GeoParquet): These are optimized for server-side, large-scale computation. The columnar structure, high compression, and rich attribute support are paramount for complex filtering and aggregation, as seen in Privateer’s use case.
Visualization Formats (PMTiles, Vector Tiles): These are optimized for client-side rendering, where low latency and small payloads are the primary concerns. Geometries are simplified, clipped to tile boundaries, and attributes are often reduced to only what is necessary for styling.
This split is a necessary consequence of network latency and client-side processing limitations. The key industry challenge is not to find a single vector format to replace all others, but to build efficient and automated pipelines that can transform data from its analytical source of truth (GeoParquet) into lightweight visualization formats (PMTiles) on demand.
Overture launched its Global Entity Reference System (GERS) in June 2025. As explained by Drew Breunig (Loqai and Overture Maps) in his talk Making Place as Easy as Time, GERS provides stable, unique UUIDs for geospatial features, creating a foundational “standard for places” analogous to how standardized time enabled railways, allowing for the linking of disparate datasets.
Addressing these performance challenges at a fundamental level, Lukas Bindreiter (Tilebox) dove into the complexities of data indexing in his talk, Beyond Points: Efficient Spatio-Temp Polygon Indexing for Scalable Catalogs, which explored advanced indexing strategies necessary for performant queries on massive, complex vector datasets using Google’s S2 geometry library that projects the Earth onto a cube and then recursively subdivides each face into a hierarchy of cells. Bindreiter demonstrated the system by querying the entire Copernicus Sentinel-2 metadata catalog, containing over 100 million granules. A complex query for all scenes intersecting the U.S. coastline for the year-to-date returned over 1,600 results in under one second
Shifting from the CPU to the GPU, Tom Augspurger (NVIDIA) discussed the next frontier of performance with his presentation on GPU Accelerated Cloud-Native Geospatial, demonstrating how leveraging GPUs can dramatically accelerate analytical workflows on cloud-native formats, unlocking new possibilities for interactive, large-scale analysis. The key approaches are: 1) focus on compute-bound problems first, 2) use pipelining to keep all your resources busy, and 3) get the data to the GPU and leave it there for as long as possible, performing as many compute operations as you can for each byte that is transferred. For those getting started, Augspurger recommended beginning with high-level, “drop-in” acceleration libraries like RAPIDS, CuPy, PyTorch, or JAX, which abstract away much of this complexity. As performance needs increase, developers can then progressively move down the stack, even to the level of writing custom CUDA kernels, to extract every last bit of performance.
Alfonso Ladino (UIUC) discussed Efficient Radar Data Management in Practice through the lens of FAIR (Findable, Accessible, Interoperable, and Reusable) data principles, arguing that adherence to such principles is a prerequisite for building the trust and resilience required for a truly global ecosystem.
The U.S. NEXRAD archive, one of the world’s largest and longest-running weather radar networks, contains nearly a petabyte of data on AWS, comprising over 360 million individual files. While technically “open,” the data’s raw binary format and operational complexity have made it practically unusable for climatological studies and machine learning without cumbersome, large-scale data wrangling.
The Core Challenge: Variable Scanning Strategies. The primary obstacle to creating a unified dataset is the radar’s use of different Volume Coverage Patterns (VCPs). A radar will use a dense, multi-angle scanning strategy during a severe storm but a different, more sensitive, and sparser strategy in clear-air conditions to detect things like atmospheric boundaries or even bird migrations. This variability has historically prevented the creation of a single, consistent processing pipeline.
The Solution: The Radar Data Tree Model. The proposed solution organizes the raw data into a hierarchical structure that preserves the integrity of each VCP while unifying the entire archive.
Hierarchical Structure: The model, built on Xarray’s new DataTree data structure, treats each VCP as a distinct “node” in a larger tree. All radar scans belonging to a specific VCP are concatenated along that node, creating a clean, continuous time series for each operational mode.
Standardization: This model is based on the World Meteorological Organization’s (WMO) modern radar data standard (FM301), ensuring the final data is compliant with the Climate Forecast (CF) conventions and rich with metadata.
Cloud-Native Output: Once structured in this tree, the entire dataset can be saved as a single, analysis-ready Zarr store. The Zarr hierarchy directly mirrors the data tree, with each VCP as a separate group, allowing for efficient, parallel-friendly access and subsetting.
A 210-Fold Performance Increase. To demonstrate the model’s power, Ladino reproduced a scientific analysis of a storm’s vertical structure—a task that required processing two hours of radar data. The Traditional Workflow (downloading individual files, decompressing, and processing) took nearly 6 minutes. The Radar Data Tree Workflow (querying the Zarr store directly in the cloud) took only 1.5 seconds.
The Central Role of STAC for Data Discovery
If formats like GeoParquet and COG are the books in the cloud’s library, the SpatioTemporal Asset Catalog (STAC) is the card catalog that makes them findable. Matthew Hanson (Element 84) laid the groundwork with a concise Intro to STAC, framing it as the lingua franca for indexing geospatial assets. The sessions that followed demonstrated STAC’s power as a unifying glue.
In STACing Geoparquet, Ben Clark (Meta) showed how STAC can index and expose vast collections of analysis-ready vector data in Overture Maps, consisting of hundreds of GeoParquet files that make up the half-terabyte monthly data drop. Here’s the structure they ended up with:
Asset: A single GeoParquet file.
Item: A metadata record for a single GeoParquet file. The crucial piece of information—the file’s geographic bounding box—is defined at the item level. This allows users to discard 95% of the files immediately through a simple spatial query.
Collection: A grouping of items that share a similar data schema. For Overture, this maps to their data “types” (e.g., “buildings,” “places,” “addresses”). This design choice prevents users from having to perform complex filtering across dissimilar data types.
Sub-Catalog: A grouping of related collections, corresponding to Overture’s data “themes” (e.g., transportation, base).
Clark discussed the trade-offs between the three primary ways to publish a STAC catalog, ultimately highlighting why the emerging STAC GeoParquet format is the ideal fit for Overture’s cloud-native philosophy:
Static JSON Catalog: The simplest to produce, but impractical at Overture’s scale. It would result in gigabytes of JSON that users would have to download and parse in its entirety to perform any queries.
STAC API: The most user-friendly option for consumption, providing a powerful, queryable endpoint. However, this requires the data publisher to stand up, maintain, and pay for a database and an active API service, which runs counter to Overture’s preference for serverless, blob-storage-based data delivery.
STAC GeoParquet: This approach involves creating the STAC index itself as a GeoParquet file. It is highly compressed, self-describing, and—most importantly—can be efficiently queried remotely using tools like DuckDB. This format provides the “best of both worlds”: it offers the query capabilities of an API without the operational overhead, aligning perfectly with a cloud-native, “put it on blob storage and walk away” strategy.
Julia Signell (Element 84) discussed integration of STAC + Zarr in the raster world, offers several primary models based on the “shape” of the data:
The “One Big Zarr” Model for Aligned Data Cubes. This approach is designed for large, aligned datasets typical of climate models or Level 3/4 processed data, where the data exists as a single, massive Zarr store.
The recommended structure is a single, standalone STAC Collection. This collection contains an asset that points directly to the root of the Zarr store. Crucially, this model does not use STAC Items.
The focus is on enabling discovery. The most useful and lowest-effort step is to populate the STAC Collection with an asset link. To enhance discoverability, producers can use the STAC Data Cube extension to summarize the variables (e.g., “sea surface temperature”) and their dimensions at the collection level. This allows users to search across different collections to find datasets relevant to their scientific needs without needing to open the Zarr store itself.
The “Many Small Zarrs” Model for Unaligned Scenes
This model is suited for collections of unaligned data, such as individual satellite scenes (Level 1/2 data), where each scene is stored in its own Zarr store. This pattern is expected to become more common, as exemplified by the European Space Agency’s (ESA) plan to distribute Sentinel-2 data as Zarr.
This follows the traditional STAC paradigm. Each individual Zarr store is treated as an asset, and a corresponding STAC Item is created to describe it, complete with its specific spatial and temporal extent.
To enable powerful searching, it is highly recommended to create collection-level summaries of the key metadata found in the items. This allows users to perform a “Collection Search” to find all scenes that contain specific variables or have certain properties, without having to iterate through every single item.
The “Wildcard”: Cataloging Virtual Zarr with STAC
A third use case involves using STAC to catalog virtual Zarr datasets created with technologies like Kerchunk. This allows legacy data stored in formats like NetCDF or HDF5 to be accessed as if it were a cloud-optimized Zarr store, without duplicating the data. In this scenario, individual STAC Items would point to the original NetCDF files, while a single asset at the Collection level would point to the Kerchunk reference file. This file contains the consolidated metadata needed to construct a “virtual” data cube, enabling highly efficient, cloud-native access to the entire collection.
But as Pete Gadomski (Element 84) warned in his talk on Right-Sizing STAC, implementation matters as much as the specification. He compared the performance and cost of a traditional PGStac (PostgreSQL-backed) API against a new STAC FastAPI implementation that queries STAC metadata stored in a single GeoParquet file.
The biggest advantage emerged from GeoParquet’s columnar nature. When querying for only a subset of fields (e.g., just the geometry and ID for a web map visualization), data transfer is dramatically reduced, leading to significantly faster responses. Direct access to the GeoParquet file, bypassing the API layer entirely, proved extremely powerful for bulk data access patterns.
However, The GeoParquet approach is not ideal for finding a single, specific item within a very large collection. A relational database with proper indexing is far superior for this task, as the GeoParquet implementation timed out when searching for a single item in the 2.2 million-item Sentinel-2 catalog. While the system can be scaled, the naive implementation presented is not designed to replace databases for catalogs with billions of items. Rewriting a GeoParquet file for a small collection of 20,000 items is trivial, but this approach is less suited for workflows with constant, high-frequency data appends.
The Paradigm Shift: From Data Lake to Geospatial Lakehouse
A structurally significant theme of the conference was the codification of the “geospatial data lakehouse.” This architecture represents a paradigm shift away from managing collections of files in a data lake towards a more structured, transactional, and queryable system that combines the scalability of a data lake with the features of a data warehouse.
Jia Yu (Wherobots) delivered a foundational talk on this topic: Building Scalable Geospatial Lakehouses with Apache Sedona and Iceberg, defining the architecture and its benefits. He explained how open table formats, principally Apache Iceberg, provide a metadata layer over data stored in open formats like GeoParquet in object storage. This enables ACID transactions, schema evolution, and time-travel queries—features traditionally associated with relational databases—directly on cloud data. This model brings compute to the data, avoids vendor lock-in by decoupling storage from specific query engines, and allows multiple tools like Spark, DuckDB, or Trino to safely operate on the same tables concurrently. The recent addition of native GEOMETRY and GEOGRAPHY types to the Iceberg specification, a major community effort in late 2024 and early 2025, was the final piece needed to make this architecture a first-class citizen for geospatial workloads.
This trend was reinforced by Noah Slocum’s (ESRI) presentation Spatial Analysis at Scale with ArcGIS GeoAnalytics Engine and Apache Spark, discussing how integrating ArcGIS and Apache Spark provides scalable spatial analysis.
In his lightning talk, Warehouse Your Pixels, Michal Migurski (CARTO) introduced Raquet, a new format for storing raster pixels directly within data warehouses, making them native citizens of the analytical ecosystem. He emphasized that for data warehouses, having data “in-DB” is a critical feature for performance and usability. He also announced a collaboration with Whereabots to align Raquet with their Havasu format, signaling a move toward a community-standardized way of bridging the raster and lakehouse worlds.
On the services side, José Eduardo Macchi, in his talk Unleashing Massive Cloud Power, showed how traditional OGC-compliant services can be built on this new lakehouse foundation. He argued that while direct data access is powerful, many business and government use cases still rely on standard OGC services. He then demonstrated a fully cloud-native, auto-scaling deployment of GeoServer Cloud on Kubernetes. This setup was able to serve Overture’s GeoParquet data directly from S3 as standard WMS and WFS services, proving that organizations can modernize their backend data architecture to a scalable, cloud-native model without disrupting the established, standards-compliant services their users depend on.
Industry context
The importance of this shift cannot be overstated, as it directly solves one of the most persistent and costly problems in collaborative data science: data versioning and provenance. The lakehouse architecture addresses this at a fundamental level. In an Iceberg table, every transaction creates a new, immutable snapshot of the table’s state, identified by a unique ID and timestamp.”The latest version” is no longer an ambiguous filename but an atomic, queryable state. This enables perfectly reproducible analyses, as a user can query a table AS OF TIMESTAMP ‘...’ and be guaranteed to get the exact same data every time. Versioning is elevated from a messy, ad-hoc convention to a core, transactional feature of the data platform itself.
The momentum behind the geospatial lakehouse has accelerated dramatically since the conference. In a clear signal of industry-wide consensus, both Snowflake and Databricks announced official support for geospatial types on Apache Iceberg at their respective user conferences in June 2025. Google followed with its own announcement of active work towards support, and an August 2025 a Google blog post detailing the Iceberg V3 specification explicitly highlighted the new geospatial types as a key feature. The query engine ecosystem is also keeping pace. Apache Sedona, a key Spark extension for geospatial analysis, released version 1.8.0 in September 2025 with support for Spark 4.0. It also introduced SedonaDB, a new single-node analytical engine written in Rust, which aims to lower the barrier to entry for working with these modern data architectures.
The Intelligence Layer: AI’s Remaking of Geospatial Interaction
The New Interface: Agentic AI and Natural Language
Perhaps the most profound shift on display was the rise of AI not as an analysis tool, but as a new conversational partner for geospatial inquiry. In a striking talk AI Agent & MCP Servers for Jupyter, Eric Charles (Datalayer) gave an AI agent a single, high-level prompt about sea-level rise; in response, the agent autonomously wrote and executed a complete Jupyter Notebook to perform the analysis. This “vibe engineering,” as Charles called it, inverts the traditional workflow by shifting the burden of technical complexity from the scientist to the machine. The magic lies in frameworks like the Model Context Protocol (MCP), a standard that acts as a universal adapter, allowing an LLM to plug into a suite of external “tools”—from the Jupyter kernel to the NASA Earthdata API—and manipulate the digital world as easily as it manipulates words. Simon Ilyushchenko (Google) presented a similar vision with his Earth Agents, presenting his “Functionsmith” and “EE Companion” agents for Earth Engine, which can dynamically create their own tools (Python functions) to solve problems incrementally, mimicking how a human would explore a dataset.
Pushing this concept further, Antoine Dolant’s (UIUC) research on an Agentic LLM for Adaptive Decision Discourse presented a simulated town hall where AI agents, each assigned a persona like a mayor or an environmental scientist, collaborated to draft a flood response plan. While exploratory, the work points toward a future where AI assists not just in technical analysis, but in modeling complex, multi-stakeholder decision-making that has traditionally been difficult to model.
This rise of agentic AI creates a voracious appetite for high-quality metadata. An agent, to be effective, must know what tools and datasets are available and what they are good for. In her talk Questioning Candor, Erin Trochim (University of Alaska, Fairbanks) addressed this need directly, proposing an AI-driven framework to generate “Data Candor”—a richer layer of metadata that moves beyond dry technical specifications to include crucial, user-focused context on a dataset’s suitability, limitations, and intended use. The result is a virtuous cycle: more capable AI demands better metadata, which in turn makes the AI more powerful.
Beyond Keywords: Semantic Search with Geospatial Foundation Models
The next generation of geospatial search was compellingly demonstrated by Kwin Keuter and Brad Andrick (Earth Genome) in their presentation Scaling Earth Index: AI Meets Cloud-Native Geospatial. They detailed their shift from traditional supervised learning models to the use of Geospatial Foundation Models (GeoFMs). These large, pre-trained models generate vector embeddings for satellite image chips, capturing their semantic meaning in a high-dimensional space. This approach enables powerful “find more like this” similarity searches that transcend simple keyword filtering. A user can provide a few visual examples of a feature of interest—such as illegal gold mines or specific agricultural patterns—and the system can rapidly find similar features across vast geographic areas and time periods.
This was complemented by two practical demonstrations of Retrieval-Augmented Generation (RAG). In Using AWS Open Data with Amazon Bedrock, Chris Stoner (AWS) showed how to build a knowledge base from the tabular GHCN daily climate dataset, making decades of weather station records queryable with natural language. Similarly, in Scalable Web-Based Visualization & RAG-enhanced Insights for NASA’s Data, Aashish Panta (University of Utah) presented a system for making petabyte-scale NASA climate simulation data accessible. His approach combined a scalable web visualization front-end with a RAG-based backend, allowing scientists to ask complex questions in plain English without needing to understand the underlying data formats or file structures.
The Foundational Work: Preparing Data for the AI Revolution
An important reality check was provided by Hans Mohrmann (Brightband), whose talk A Cloud-Native Dataset of Atmospheric Observations for ML Applications focused on the extensive data engineering required to fuel modern AI models. He described the “Ninja AI” project, a large-scale effort to make historical atmospheric observation data suitable for training next-generation, observation-driven weather models. This involved a complex ETL (Extract, Transform, Load) pipeline to convert decades of data from archaic formats like BUFR—which he aptly called “the fax machine of data formats”—into cloud-native Parquet files. This work is essential for developing AI weather models that are not dependent on traditional reanalysis products like ERA5. Mohrmann’s presentation was a reminder that the success of the AI revolution depends entirely on the availability of large, clean, analysis-ready training datasets, and that much of the foundational work is in data engineering, not just model architecture.
Broadening this perspective, Alex Merose (Open Athena) argued in his talk Why Machine Learning People Should Think about Databases for a fundamental re-framing of modern AI systems. Using the recent explosion in AI weather model performance as a case study, he posited that these systems are not merely applications that use data, but are themselves a new type of “differentiable database.” He contended that the entire cloud-native geospatial ecosystem—from object storage like S3 to query engines like Dask and catalogs like STAC—can be seen as the disaggregated components of a modern database architecture. His central thesis was that AI/ML models represent a new, fundamental component in this architecture: a “semantic engine” that allows users to “query” the latent space of the data. Citing the fact that the revolution in weather AI was unlocked by the availability of cloud-optimized data rather than a new algorithm, Merose concluded that data has primacy over compute, and that we should understand and build these powerful new AI systems with the conceptual rigor of database systems.
Bridging Worlds: From Scientific Code to Societal Impact
A central challenge highlighted at the conference was the “last mile” problem: translating powerful cloud-native technologies into tangible value for end-users and society. This requires bridging the gaps between expert toolmakers and domain specialists, scientific research and operational products, and experienced developers and newcomers. This theme was powerfully articulated in Julia Lowndes‘ (Openscapes) keynote, which used the “Crossing the Chasm” framework to describe the organizational and cultural strategies needed to move innovations from early adopters to the mainstream majority, emphasizing community, mentorship, and solving painful, practical problems.
The “Science-to-Product” Pipeline
Several presentations offered concrete models for navigating the difficult transition from scientific research to scalable, operational products.
Nathan Swain (Aquaveo) in Bridging Earth Science and Cloud-Native Geospatial Technology presented the Tethys Platform, an open-source framework designed to empower water resources scientists to build and deploy their own scientific web applications. Tethнs provides a scaffold that handles the complexities of web development, allowing domain experts with Python scripting experience to turn their models and analyses into interactive tools without needing to become full-stack developers.
In his presentation, Build Solutions using Satellite Imagery Data at Scale, Damian Wylie (Wherebots) explained how the Whereabots platform aims to lower the barrier for building such solutions. He highlighted their raster inference product, which abstracts away the complexities of preparing imagery, managing models, and scaling compute. By providing a STAC-aware reader, an open model catalog (including models from TorchGeo), and a distributed inference engine, the platform enables developers familiar with SQL and Python to run large-scale computer vision tasks on satellite imagery without needing deep expertise in MLOps or inference infrastructure.
In Tooling Designed for the Complexities of Space Data, Lukas Bindreiter (Tilebox) introduced the Tilebox workflow orchestrator, a system designed to manage complex processing pipelines from end to end. He argued that from the moment a new image is captured, a series of granular tasks—from locating and reading data to regridding, analysis, and visualization—must be executed. The orchestrator allows developers to define these steps as a graph of dependent tasks. Its multi-environment, multi-language task runners can then execute these tasks in the most efficient location, moving the compute to the data, whether that data is on a traditional cloud, a specialized high-performance computing cluster, or even still on the satellite for in-orbit processing before downlink.
The challenge of building these robust pipelines was addressed from multiple architectural perspectives. Greg Van Gaans (South Australia Government) detailed the Deliverance of Enterprise Geo Services using a Cloud-Native Geo Approach, showcasing how to build and maintain scalable, OGC-compliant services on top of cloud-native backends for large organizational use cases. For more agile and event-driven workflows, Maxime Lenormand (TETIS) demonstrated how Serverless Python can be used to make the most out of cloud-native formats, creating efficient, cost-effective processing pipelines that scale on demand. Recognizing that many organizations rely on existing tools, Dean Hintz (FME) presented on FME Workflows to Support Loading & Automation Pipelines for Cloud-Native, illustrating how established ETL (Extract, Transform, Load) platforms can serve as a powerful bridge, enabling users to orchestrate complex data movement and transformation into and out of cloud-native formats without deep coding expertise.
Closing the “Nerdville-to-Fieldville” Gap
Arvind Mohan (Los Alamos National Laboratory) delivered a powerful lightning talk Bridging Nerdville and Fieldville on the critical disconnect between the world of academic research (”Nerdville”) and the world of operational users like first responders (”Fieldville”). He argued that in high-stakes, time-constrained environments, complex tools often fail because they increase, rather than decrease, the user’s cognitive load. A first responder in the “fog of war” needs concise, actionable intelligence, not a dashboard with dozens of layers and options. He called for the development of a “cognitive middleware” capable of synthesizing complex scientific and geospatial data into simple, direct answers.
This same real-world need drove the World Bank’s Space2Stats project, which Benjamin Stewart explained is building a system to translate complex geospatial datasets into the simple tabular statistics that economists and policymakers actually use. It aims to solve the problem of the instability of administrative boundaries, which change over time and vary in quality, making long-term, subnational analysis difficult. The project’s innovative solution is to use a global hexagonal grid as a stable, intermediate geography.
This call to translate complex data into actionable insight was answered by numerous application-focused talks.
Jessie Pechmann’s (OSM) lightning talk on Documenting the Impact of War with HOTOSM provided a powerful example of using geospatial data for critical humanitarian response. OSM’s Humanitarian OpenStreetMap Team (HOT) organizes and supports volunteers—often including people from the affected communities themselves—to meticulously digitize building footprints from high-resolution aerial imagery. Pechmann noted that these manually validated OSM datasets have been found to outperform automated machine learning outputs, as humans are currently better at discerning the precise shapes of individual buildings.
In the environmental domain, Martha Morrissey (Pachama) discussed how cloud-native approaches can help Restore Confidence in Forest Carbon Markets by enabling more transparent and scalable monitoring by creating a single, cooperative repository of LIDAR data and an open-source benchmarking framework to evaluate different biomass models against the same ground truth data using real project boundaries.
The technology’s impact on urban environments was highlighted by Guyu Ye (AWS) in Modeling Sustainable Urban Spaces on AWS. She presented a practical workshop workflow for analyzing urban heat islands in New York City by combining global Landsat surface temperature data with local datasets on population and geographic boundaries. The workflow demonstrated how to use AWS Step Functions to parallelize the processing of hundreds of satellite scenes, moving from a slow, iterative notebook analysis to a scalable, automated pipeline.
In Make your Drone Imagery Open and Cloud-Native, Jeffrey Gillan (University of Arizona) addressed the rapidly growing but highly fragmented world of drone data. He made a compelling case for the community to actively share drone imagery on open, cloud-native platforms like OpenForest Observatory and D2S, which serve data via STAC APIs using COGs and COPCs, thereby creating a richer, more accessible data ecosystem for training, validation, and large-scale analysis.
A critical component of closing this gap involves meeting users where they are. David Wright’s (ESRI) presentation, Cloud-Native Geospatial and ArcGIS, addressed this directly, detailing how the industry’s most widely used GIS platform is integrating with and leveraging cloud-native formats and workflows. This integration is essential for bringing the power of the cloud-native ecosystem to hundreds of thousands of existing GIS professionals, providing a crucial on-ramp for organizations deeply invested in established geospatial software.
The Educational Imperative: Lowering the Barrier to Entry
The rapid pace of innovation in the CNG ecosystem has created a significant educational challenge. This theme was central to the “On-ramp” track, which Julia Wagemann (thriveGEO) framed in her introduction, On-ramp to Cloud-Native Geospatial, as a dedicated space for professionals starting their journey. She outlined a program designed to build foundational knowledge on standards like STAC and formats like COG and Zarr before moving on to practical workflows and real-world case studies, providing a structured pathway for newcomers to join the community.
Tonian Robinson presented a new USGS-managed GitLab repository of Jupyter notebook tutorials in Accessing and Processing Landsat Data in the Cloud, explicitly designed as an on-ramp for users with little to no Python or cloud experience. The tutorials provide practical, step-by-step examples for common tasks like querying the Landsat STAC API, decoding pixel quality bands, and performing a basic time-series analysis. In a similar vein, Emma Marshall’s presentation, Cloud-Native Geospatial DataCube Workflows with Xarray and Zarr, served as a model educational resource itself. Drawing on principles of “coding to learn,” she demonstrated best practices for building data cubes from disparate data sources, emphasizing the importance of connecting the in-memory data model (the Xarray cube) with the physical meaning of the data and the underlying storage format (Zarr) to build a deeper, more intuitive understanding for new users.
Tyler Erickson’s (Vorgeo) lightning talk Building Better Tutorials for CNG Education voiced a common frustration with the state of many open-source tutorials, which are often outdated, difficult to run, or lacking in context, and he called for more structured, collaborative approaches to creating high-quality educational materials. Echoing this sentiment, Matt Fisher spoke on Exploring More Approachable Geospatial Data Workflows as a Community. He introduced the GeoJupyter community’s effort to create more intuitive and interactive experiences, motivated by a common user pain point he termed the “GIS Bounce”—the tedious cycle of writing code in a notebook, exporting a file, and switching to a desktop GIS to visually validate results. He demonstrated the community’s flagship project, Jupyter GIS, which embeds a QGIS-like interface directly within JupyterLab, aiming to combine the approachability of desktop GIS with the reproducibility of code and the collaborative nature of the Jupyter ecosystem.
The importance of community-driven resources was exemplified by Samapriya Roy’s (DRI & SpatialBytes) talk on Cloud-Native Data Commons, a five-year effort that has created a massive, bottom-up Google Earth Engine Community Catalog making thousands of datasets accessible and analysis-ready.
The evolution of tooling also plays a role, as Alex Leith (Auspatious) explained in his talk Open Data Cube is Dead. Long Live Open Data Cube, where the original Open Data Cube’s complexity is being superseded by the much simpler ODC-STAC library, lowering the barrier to entry for creating data cubes from STAC APIs.
Conclusion: The Dawn of the Global Spatial Data Ecosystem
Chris Holmes’s (Planet) keynote Towards a Global Spatial Data Ecosystem was central to this narrative, proposing a fundamental philosophical shift for moving away from the top-down, government-centric model of “Spatial Data Infrastructure“ (SDI) and toward a more organic, resilient, and collaborative “Global Spatial Data Ecosystem“. This ecosystem functions more like a decentralized open-source community, composed of a diverse set of actors and projects that both collaborate and compete, driving innovation from the bottom up.
Concrete examples of this new paradigm in action were presented at the conference. Frédéric Leclerc (Flanders Marine Institute) described the ambitious vision for the EU Public Infrastructure for the European Digital Twin Ocean (EDITO), a collaborative, continental-scale project that embodies the principles of an open, interconnected data ecosystem.
EDITO’s infrastructure is built on two main pillars
The Data Lake: In its initial phase, the data lake is being populated with data from Copernicus and EMODnet. A significant technical challenge is harmonizing the diverse data formats and standards from these sources, which range from Darwin Core for biology to NetCDF for physics. The long-term vision is to onboard data from other institutes and projects, with a strong emphasis on quality control.
The Engine: This component encompasses not just the computing resources but also the advanced models needed to run simulations. A sister project, “EDITO Model Lab,” is dedicated to developing these models. The underlying infrastructure, managed by Mercator Ocean, is designed to be highly flexible, with the ability to connect to various European High-Performance Computing (HPC) resources, such as the LUMI supercomputer in Finland and the MareNostrum in Barcelona.
The EDITO platform provides accessible user interfaces, including on-demand virtual development environments like Jupyter Notebooks and RStudio Server. These tools are connected to the data lake via APIs, allowing scientists to work directly with the data in a collaborative, cloud-based environment without needing deep expertise in HPC systems.
Jed Sundwall‘s welcome address framed the conference itself as a physical manifestation of this idea—a vendor-neutral space for this community to convene and build together. Briana Pagan’s (Development Seed) introduction to the “Resilience” track Building Resilient Data Ecosystems reinforced this, urging the community to build systems and relationships that can withstand political and economic shifts, moving beyond dependence on single funding sources.
Providing a crucial external perspective, industry analyst Lynne Schneider (IDC) in Turning Geospatial Tech into Stakeholder Wins questioned whether the demand for geospatial insights can keep pace with the rapidly growing supply of data and technology. She stressed the importance of creating “stakeholder wins” by focusing on tangible business outcomes and telling compelling stories that translate technical capabilities into value for non-expert decision-makers. This serves as a vital counter-balance: the ecosystem cannot thrive on technical elegance alone; it must be relentlessly focused on solving real-world problems.
In his lightning talk Why Don’t We Have Good Markets for Data? Jed Sundwall further underscored the need for better economic models. He argued that functioning markets require clear “units of consumption” and sustainable business models, both of which are underdeveloped for data. He proposed that the community focus on creating better URLs as a standard unit of consumption and explore “cooperative data utility” models—where users collectively fund shared data infrastructure—as a sustainable business model to ensure the ecosystem’s long-term resilience.
The detailed track recaps delivered by Julia Wagemann, Aimee Barciauskas, and Brianna Pagan served to synthesize these diverse threads, reinforcing the conference’s core themes of architectural maturation, applied AI, and community-driven resilience.
Methodology
How was this summary created?
I used Youtube API to list all the talks in the conference playlist, then to save the transcript of each video into a text file.
Then I loaded the transcripts into https://gemini.google.com and asked to run a Deep Research with this prompt: “Create an extended cohesive summary of these talks from a recent cloud native geospatial conference for a technical audience broadly familiar with the geospatial domain. do research to add brief context to explain new terms and concepts like icechunk, but don’t feel you need to explain all the fundamentals (assume everyone knows what a geotiff is). Group talks into sections as appropriate. mention any recent developments that happened after the conference in April 2025 (it’s now October 2025). Don’t be afraid to add your commentary to the talks if appropriate, just clearly separate 1) the content of the talks themselves, 2) the additional information gleaned from deep research, 3) your own opinions”
The result was okay, but had a lot of AI slop text. I spent about 4 hours on manual edits, rearranging and rewriting sections, sometimes asking LLMs to summarize certain talks in more detail.
