Querying data from large cloud databases with R and DuckDB
A few years ago I became fascinated with cloud-friendly data formats that allow us to query large online databases without having to download the entire thing. Here is a post from 2021 on accessing spatial data from online cloud-optimised geotiffs using R, terra and gdal. In this post I will show how to access, manage, and retrieve data from large tabular databases hosted in the cloud, without having to download them first. As you will see, thanks to recent efforts from many parties, the process is easy and very fast.
Example dataset
For this post I am going to use online data on New York City taxi trips from July 2025. The data come in ‘parquet’ format, which is great for tabular/rectangular data: chunked and column-oriented (so that searching is very efficient), and having much smaller file sizes than plain text formats (like CSV), among other advantages.
The dataset on yellow taxi trips from July 2025 contains nearly 4 million records (rows) and 20 variables (columns). This is the direct link to the parquet file:
url <- "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-07.parquet"
Accessing cloud databases with {duckplyr}
{duckplyr} is a newish R package that allows one to query and analyse large datasets pretty fast and easily following {dplyr} syntax (which is very convenient from users’ point of view) but using DuckDB internally.
For example, to get all the taxi trips with fewer than 3 passengers, it is as easy as:
library(duckplyr)
read_parquet_duckdb(url) |>
filter(passenger_count < 3) |>
collect()
## # A tibble: 2,656,706 × 20
## VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## * <int> <dttm> <dttm> <dbl>
## 1 1 2025-07-01 00:29:37 2025-07-01 00:45:30 1
## 2 1 2025-07-01 00:23:28 2025-07-01 01:07:44 1
## 3 2 2025-07-01 00:53:50 2025-07-01 01:27:12 1
## 4 2 2025-07-01 00:58:49 2025-07-01 01:15:55 1
## 5 2 2025-07-01 00:09:22 2025-07-01 00:23:54 1
## 6 1 2025-07-01 00:39:14 2025-07-01 00:55:21 1
## 7 2 2025-07-01 00:15:26 2025-07-01 00:29:39 1
## 8 2 2025-07-01 00:40:58 2025-07-01 00:44:15 1
## 9 2 2025-07-01 00:28:12 2025-07-01 00:39:49 2
## 10 2 2025-07-01 00:38:17 2025-07-01 00:55:44 1
## # ℹ 2,656,696 more rows
## # ℹ 16 more variables: trip_distance <dbl>, RatecodeID <dbl>,
## # store_and_fwd_flag <chr>, PULocationID <int>, DOLocationID <int>,
## # payment_type <dbl>, fare_amount <dbl>, extra <dbl>, mta_tax <dbl>,
## # tip_amount <dbl>, tolls_amount <dbl>, improvement_surcharge <dbl>,
## # total_amount <dbl>, congestion_surcharge <dbl>, Airport_fee <dbl>,
## # cbd_congestion_fee <dbl>
Literally three lines of code: one to read the online parquet file, one to filter cases, and the last one to actually download (collect) the selected records. The output is a tibble which we can of course keep working on.
And it’s not only easy but also extremely fast!
library(tictoc)
tic()
read_parquet_duckdb(url) |>
filter(passenger_count < 3) |>
collect()
## # A tibble: 2,656,706 × 20
## VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
## * <int> <dttm> <dttm> <dbl>
## 1 1 2025-07-01 00:29:37 2025-07-01 00:45:30 1
## 2 1 2025-07-01 00:23:28 2025-07-01 01:07:44 1
## 3 2 2025-07-01 00:53:50 2025-07-01 01:27:12 1
## 4 2 2025-07-01 00:58:49 2025-07-01 01:15:55 1
## 5 2 2025-07-01 00:09:22 2025-07-01 00:23:54 1
## 6 1 2025-07-01 00:39:14 2025-07-01 00:55:21 1
## 7 2 2025-07-01 00:15:26 2025-07-01 00:29:39 1
## 8 2 2025-07-01 00:40:58 2025-07-01 00:44:15 1
## 9 2 2025-07-01 00:28:12 2025-07-01 00:39:49 2
## 10 2 2025-07-01 00:38:17 2025-07-01 00:55:44 1
## # ℹ 2,656,696 more rows
## # ℹ 16 more variables: trip_distance <dbl>, RatecodeID <dbl>,
## # store_and_fwd_flag <chr>, PULocationID <int>, DOLocationID <int>,
## # payment_type <dbl>, fare_amount <dbl>, extra <dbl>, mta_tax <dbl>,
## # tip_amount <dbl>, tolls_amount <dbl>, improvement_surcharge <dbl>,
## # total_amount <dbl>, congestion_surcharge <dbl>, Airport_fee <dbl>,
## # cbd_congestion_fee <dbl>
toc()
## 0.804 sec elapsed
Query a several-million rows online database in roughly one second! Amazing.
Of course, you can make more complex queries and operations using {dplyr}, like
read_parquet_duckdb(url) |>
filter(passenger_count < 3, fare_amount > 100) |>
select(passenger_count, trip_distance, fare_amount) |>
collect()
## # A tibble: 10,991 × 3
## passenger_count trip_distance fare_amount
## * <dbl> <dbl> <dbl>
## 1 1 21.4 103.
## 2 1 16.9 120
## 3 2 19.8 120.
## 4 1 32.5 136.
## 5 1 27.0 102.
## 6 1 34.9 180.
## 7 1 21.0 130.
## 8 1 26.9 220
## 9 1 0 104
## 10 1 27.4 146.
## # ℹ 10,981 more rows
You can summarise data, mutate, join, etc. You only need to remember to ‘collect’ the final data at the end.
Note: to be able to work with parquet files served through ‘https’, you may need to install the appropriate extension first (only once), like this:
duckplyr::db_exec("INSTALL httpfs")
Accessing cloud databases with {duckdb}
Alternatively to {duckplyr}, one could get the same results with a few more lines of code using {duckdb} and {dplyr}:
library(duckdb)
library(dplyr)
con <- dbConnect(duckdb())
db <- tbl(con, paste0("read_parquet('", url, "')"))
db |>
filter(passenger_count < 3) |>
collect()
dbDisconnect(con)
Again, first-time users might need to run the line below (only once) to install a needed extension first:
DBI::dbExecute(con, "INSTALL httpfs;")
Accessing cloud databases with {arrow}
Although this post is focused on DuckDB, one can get pretty similar functionality with {arrow}. At the time of writing this post, it seems a bit more complicated to query online parquet databases served through https than with DuckDB, but in the end working with large databases via {arrow} feels quite similar (e.g. see this tutorial).