Get a lot of data from APIs

️✅ Learning objectives

  • Find information about pagination in API docs and descriptions.
  • Retry API requests respectfully.
  • Retrieve multiple pages of data from an API.
  • Process lists of {httr2} responses.
library(httr2)
library(tibblify)

What is pagination?

Why paginate?

  • Network traffic is slow & expensive
  • Bigger transfers ➡️ more corruption chance ➡️ resends
  • So APIs often limit results per call
  • 1 set of results = 1 page

What are some pagination strategies?

  • shared: Most offer perPage or pageSize param

How can I determine how an API handles pagination?

Unfortunately, no standard.

  • Almost always: Mentioned in relevant paths
  • Often: Separate “pagination” section near top of docs
  • Fall-throughs:
    • page, perPage, cursor parameters
    • Look at a response

How do I perform pagination?

Aside: Retries and API consumer etiquette

  • Pagination ➡️ repeated API calls
  • Be nice!
  • httr2::req_retry()
    • Must set max_tries or max_seconds
    • Other options usually ok as-is

Aside: Caching responses

  • httr2::req_cache() = save responses locally if server says its ok
  • You deal with path
  • Automatically checks standard caching headers
  • Very helpful if data doesn’t change

Aside: Our helper function

We’re going to req_retry() + req_cache() a lot, so:

req_friendly <- function(req, cache_name, max_tries = 3) {
  req |> 
    req_retry(max_tries = max_tries) |> # Tries *per page*
    req_cache(path = rappdirs::user_cache_dir(cache_name))
}

req_perform_iterative()

  • Added in {httr2} 1.0.0.
  • Replaces req_perform()
normal_request |> 
  req_perform_iterative(
    next_req = function_to_choose_next_req,
    max_reqs = 20 # Maximum separate page calls
  )
  • “Iteration helpers” = built-in next_req functions

Iteration helpers: iterate_with_offset()

  • param_name = "page" or whatever API calls it
  • start, offset = almost always leave as-is
  • resp_pages = function to convert resp to page count
    • max_reqs <- min(max_reqs, total_pages)
  • resp_complete = function to check if resp is last page
    • Eg: \(resp) !length(resp_body_json(resp))

Offset example: OpenFEC

candidates_request <- 
  request("https://api.open.fec.gov/v1/candidates") |> 
  req_friendly(cache_name = "fecapi") |> 
  req_url_query(api_key = "DEMO_KEY", election_year = 2020, office = "P") |> 
  req_url_query(has_raised_funds = TRUE)


candidates_single <- 
  candidates_request |> 
  req_perform() |> 
  resp_body_json()
length(candidates_single$results)
#> [1] 20
candidates_single$pagination$count
#> [1] 173
candidates_multi <- 
  candidates_request |> 
  req_perform_iterative(
    next_req = iterate_with_offset(
      "page",
      resp_pages = \(resp) {
        content <- resp_body_json(resp)
        content$pagination$pages
      }
    ),
    max_reqs = Inf
  )
length(candidates_multi)
#> [1] 9

Offset example: crossref.org

crossref_request <- 
  request("https://api.crossref.org/works") |> 
  req_friendly(cache_name = "crossrefapi") |> 
  req_url_query(query = "apis")


crossref_single <- 
  crossref_request |> 
  req_perform() |> 
  resp_body_json()
length(crossref_single$message$items)
#> [1] 20
crossref_single$message$`total-results`
#> [1] 14114
crossref_multi <- 
  crossref_request |> 
  req_retry(max_tries = 3) |> 
  req_perform_iterative(
    next_req = iterate_with_offset(
      "offset",
      resp_pages = \(resp) {
        content <- resp_body_json(resp)
        content$pagination$message$`total-results`
      }
    ),
    max_reqs = Inf
  )
length(crossref_multi)
#> [1] 706

Iteration helpers: iterate_with_cursor()

  • param_name = "cursor" or whatever API calls it
  • resp_param_value = function to convert resp to next cursor
    • NULL if no more pages

Cursor example: Crossref

crossref_request_cursor <- 
  crossref_request |> 
  req_url_query(cursor = "*")


crossref_single <- 
  crossref_request_cursor |> 
  req_perform() |> 
  resp_body_json()
names(crossref_single$message)
#> [1] "facets"         "next-cursor"    "total-results"  "items"          "items-per-page"
#> [6] "query" 
crossref_multi <- 
  crossref_request_cursor |> 
  req_retry(max_tries = 3) |> 
  req_perform_iterative(
    next_req = iterate_with_cursor(
      "cursor",
      resp_param_value = \(resp) {
        content <- resp_body_json(resp)
        if (!length(content$message$items)) {
          return(NULL)
        }
        content$message$`next-cursor`
      }
    ),
    max_reqs = Inf
  )
length(crossref_multi)
#> [1] 706

Roll-your-own iteration

Roll-your-own example: DnD Monsters

dnd_request <- 
  request("https://api.open5e.com/monsters/?limit=100") |> 
  req_friendly(cache_name = "dndapi")


dnd_single <- 
  dnd_request |> 
  req_perform() |> 
  resp_body_json()
length(dnd_single$results)
#> [1] 100
dnd_single$count
#> [1] 2439
dnd_multi <- 
  dnd_request |> 
  req_retry(max_tries = 3) |> 
  req_perform_iterative(
    next_req = function(resp, req) {
      url <- resp_body_json(resp)$`next`
      if (!is.null(url)) {
        req_url(req, url)
      }
    },
    max_reqs = Inf
  )
length(dnd_multi)
#> [1] 25

What else can I do with lists of requests?

Other multi-request situations

  • Looking up multiple, related things
  • Enriching a dataset
  • Manual pagination

Create a list of requests

free_dictionary_request <- 
  request("https://api.dictionaryapi.dev/api/v2/entries/en/") |> 
  req_friendly(cache_name = "free_dictionary")
words <- c("hello", "world", "goodbye")
free_dictionary_reqs <- 
  purrr::map(words, \(word) req_url_path_append(free_dictionary_request, word))
free_dictionary_reqs |> purrr::map_chr("url")
#> [1] "https://api.dictionaryapi.dev/api/v2/entries/en/hello"  
#> [2] "https://api.dictionaryapi.dev/api/v2/entries/en/world"  
#> [3] "https://api.dictionaryapi.dev/api/v2/entries/en/goodbye"

Perform requests sequentially

free_dictionary_resps <- req_perform_sequential(free_dictionary_reqs)
length(free_dictionary_resps)
#> [1] 3

Perform requests in parallel

httr2::req_perform_parallel() performs multiple requests simultaneously

  • Pros: (much?) faster
  • Cons:
    • Might over-tax server
    • Silently ignores req_retry() (and some other things)
    • Harder to debug
free_dictionary_resps <- req_perform_parallel(free_dictionary_reqs)
length(free_dictionary_resps)
#> [1] 3

How do I parse paginated responses?

resps_data()

  • req_perform_*() returns list of responses
  • resps_data() takes that list + resp_data parsing function and merges
  • Usually you need more than just something like resp_body_json
    • Grab just the actual results

resps_data() example: OpenFEC

fec_resp_parser <- function(resp) {
  resp_body_json(resp)$results
}
candidates_multi_results <- resps_data(
  candidates_multi, fec_resp_parser
)
tibblify::tibblify(candidates_multi_results)
#> # A tibble: 173 × 24
#>    active_through candidate_id candidate_inactive candidate_status cycles    
#>             <int> <chr>        <lgl>              <chr>            <list>    
#>  1           2020 P00014241    FALSE              N                <list [3]>
#>  2           2024 P00008193    FALSE              N                <list [4]>
#>  3           2020 P00006296    FALSE              P                <list [4]>
#>  4           2020 P00010900    FALSE              N                <list [1]>
#>  5           2020 P00012104    FALSE              P                <list [3]>
#>  6           2020 P40003170    FALSE              N                <list [5]>
#>  7           2020 P00008128    FALSE              N                <list [4]>
#>  8           2020 P80004369    FALSE              N                <list [2]>
#>  9           2020 P00010090    FALSE              N                <list [3]>
#> 10           2024 P00015297    FALSE              N                <list [3]>
#> # ℹ 163 more rows
#> # ℹ 19 more variables: district <chr>, district_number <int>,
#> #   election_districts <list>, election_years <list>, federal_funds_flag <lgl>,
#> #   first_file_date <chr>, has_raised_funds <lgl>,
#> #   inactive_election_years <list>, incumbent_challenge <chr>,
#> #   incumbent_challenge_full <chr>, last_f2_date <chr>, last_file_date <chr>,
#> #   load_date <chr>, name <chr>, office <chr>, office_full <chr>, …

resps_data() friends

  • req_perform_iterative on_error arg: “stop” or “return”
  • resps_successes() extracts just successes from resps list
  • resps_failures() same, but failures
  • resps_requests() to see how they were called