--- title: "Getting started examples" output: rmarkdown::html_vignette description: Reading/writing with sfarrow and how it works. vignette: > %\VignetteIndexEntry{example_sfarrow} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `sfarrow` is designed to help read/write spatial vector data in "simple feature" format from/to Parquet files while maintaining coordinate reference system information. Essentially, this tool is attempting to connect `R` objects in [`sf`](https://r-spatial.github.io/sf/) and in [`arrow`](https://arrow.apache.org/docs/r/) and it relies on these packages for its internal work. A key goal is to support interoperability of spatial data in Parquet files. R objects (including `sf`) can be written to files with `arrow`; however, these do not necessarily maintain the spatial information or can be read in by Python. `sfarrow` implements a metadata format also used by Python `GeoPandas`, described here: [https://github.com/geopandas/geo-arrow-spec](https://github.com/geopandas/geo-arrow-spec). Note that these metadata are not stable yet, and `sfarrow` will warn you that it may change. ```{r setup} # install from CRAN with install.packages('sfarrow') # or install from devtools::install_github("wcjochem/sfarrow@main) # load the library library(sfarrow) library(dplyr, warn.conflicts = FALSE) ``` ## Reading and writing single files A Parquet file (with `.parquet` extension) can be read using `st_read_parquet()` and pointing to the file system. This will create an `sf` spatial data object in memory which can then be used as normal using functions from `sf`. ```{r} # read an example dataset created from Python using geopandas world <- st_read_parquet(system.file("extdata", "world.parquet", package = "sfarrow")) class(world) world plot(sf::st_geometry(world)) ``` Similarly, a Parquet file can be written from an `sf` object using `st_write_parquet()` and specifying a path to the new file. Non-spatial objects cannot be written with `sfarrow`, and users should instead use `arrow`. ```{r} # output the file to a new location # note the warning about possible future changes in metadata. st_write_parquet(world, dsn = file.path(tempdir(), "new_world.parquet")) ``` ## Partitioned datasets While reading/writing a Parquet file is nice, the real power of `arrow` comes from splitting big datasets into multiple files, or partitions, based on criteria that make it faster to query. There is currently basic support in `sfarrow` for multi-file spatial datasets. For additional dataset querying options, see the `arrow` [documentation](https://arrow.apache.org/docs/r/articles/dataset.html). ### Querying and reading Datasets `sfarrow` accesses `arrows`'s `dplyr` interface to explore partitioned, Arrow datasets. For this example we will use a dataset which was created by randomly splitting the nc.shp file first into three groups and then further partitioning into two more random groups. This creates a nested set of files. ```{r} list.files(system.file("extdata", "ds", package = "sfarrow"), recursive = TRUE) ``` The file tree is showing that the data were partitioned by the variables "split1" and "split2". Those are the column names that were used for the random splits. This partitioning is in ["Hive style"](https://hive.apache.org/) where the partitioning variables are in the paths. The first step is to open the Dataset using `arrow`. ```{r} ds <- arrow::open_dataset(system.file("extdata", "ds", package="sfarrow")) ``` For small datasets (as in the example) we can read the entire set of files into an `sf` object. ```{r} nc_ds <- read_sf_dataset(ds) nc_ds ``` With large datasets, more often we will want query them and return a reduced set of the partitioned records. To create a query, the easiest way is to use `dplyr::filter()` on the partitioning (and/or other) variables to subset the rows and `dplyr::select()` to subset the columns. `read_sf_dataset()` will then use the `arrow_dplyr_query` and call `dplyr::collect()` to extract and then process the Arrow Table into `sf`. ```{r, tidy=FALSE} nc_d12 <- ds %>% filter(split1 == 1, split2 == 2) %>% read_sf_dataset() nc_d12 plot(sf::st_geometry(nc_d12), col="grey") ``` When using `select()` to read only a subset of columns, if the geometry column is not returned, the default behaviour of `sfarrow` is to throw an error from `read_sf_dataset`. If you do not need the geometry column for your analyses, then using `arrow` and not `sfarrow` should be sufficient. However, setting `find_geom = TRUE` in `read_sf_dataset` will read in any geometry columns in the metadata, in addition to the selected columns. ```{r} # this command will throw an error # no geometry column selected for read_sf_dataset # nc_sub <- ds %>% # select('FIPS') %>% # subset of columns # read_sf_dataset() # set find_geom nc_sub <- ds %>% select('FIPS') %>% # subset of columns read_sf_dataset(find_geom = TRUE) nc_sub ``` ### Writing to Datasets To write an `sf` object into multiple files, we can again construct a query using `dplyr::group_by()` to define the partitioning variables. The result is then passed to `sfarrow`. ```{r, tidy=FALSE} world %>% group_by(continent) %>% write_sf_dataset(file.path(tempdir(), "world_ds"), format = "parquet", hive_style = FALSE) ``` In this example we are not using Hive style. This results in the partitioning variable not being in the folder paths. ```{r} list.files(file.path(tempdir(), "world_ds")) ``` To read this style of Dataset, we must specify the partitioning variables when it is opened. ```{r, tidy=FALSE} arrow::open_dataset(file.path(tempdir(), "world_ds"), partitioning = "continent") %>% filter(continent == "Africa") %>% read_sf_dataset() ```