Get a lot of data from APIs

️✅ Learning objectives

Find information about pagination in API docs and descriptions.
Retry API requests respectfully.
Retrieve multiple pages of data from an API.
Process lists of {httr2} responses.

library(httr2)
library(tibblify)

What is pagination?

Why paginate?

Network traffic is slow & expensive
Bigger transfers ➡️ more corruption chance ➡️ resends
So APIs often limit results per call
1 set of results = 1 page

What are some pagination strategies?

Offset: page param
- OpenFEC
- Crossref.org
Cursor: cursor or token param
- Crossref.org “deep paging”

Header link: Link: to next in response header
- GitHub
- MTG cars
Body link: nextUrl in response body
- open5e.com monsters

shared: Most offer perPage or pageSize param

How can I determine how an API handles pagination?

Unfortunately, no standard.

Almost always: Mentioned in relevant paths
Often: Separate “pagination” section near top of docs
Fall-throughs:
- page, perPage, cursor parameters
- Look at a response

How do I perform pagination?

Aside: Retries and API consumer etiquette

Pagination ➡️ repeated API calls
Be nice!
httr2::req_retry()
- Must set max_tries or max_seconds
- Other options usually ok as-is

Aside: Caching responses

httr2::req_cache() = save responses locally if server says its ok
You deal with path
Automatically checks standard caching headers
Very helpful if data doesn’t change

Aside: Our helper function

We’re going to req_retry() + req_cache() a lot, so:

req_friendly <- function(req, cache_name, max_tries = 3) {
  req |> 
    req_retry(max_tries = max_tries) |> # Tries *per page*
    req_cache(path = rappdirs::user_cache_dir(cache_name))
}

req_perform_iterative()

Added in {httr2} 1.0.0.
Replaces req_perform()

normal_request |> 
  req_perform_iterative(
    next_req = function_to_choose_next_req,
    max_reqs = 20 # Maximum separate page calls
  )

“Iteration helpers” = built-in next_req functions

Iteration helpers: iterate_with_offset()

param_name = "page" or whatever API calls it
start, offset = almost always leave as-is
resp_pages = function to convert resp to page count
- max_reqs <- min(max_reqs, total_pages)
resp_complete = function to check if resp is last page
- Eg: \(resp) !length(resp_body_json(resp))

Offset example: OpenFEC

candidates_request <- 
  request("https://api.open.fec.gov/v1/candidates") |> 
  req_friendly(cache_name = "fecapi") |> 
  req_url_query(api_key = "DEMO_KEY", election_year = 2020, office = "P") |> 
  req_url_query(has_raised_funds = TRUE)

candidates_single <- 
  candidates_request |> 
  req_perform() |> 
  resp_body_json()
length(candidates_single$results)
#> [1] 20
candidates_single$pagination$count
#> [1] 173

candidates_multi <- 
  candidates_request |> 
  req_perform_iterative(
    next_req = iterate_with_offset(
      "page",
      resp_pages = \(resp) {
        content <- resp_body_json(resp)
        content$pagination$pages
      }
    ),
    max_reqs = Inf
  )
length(candidates_multi)
#> [1] 9

Offset example: crossref.org

crossref_request <- 
  request("https://api.crossref.org/works") |> 
  req_friendly(cache_name = "crossrefapi") |> 
  req_url_query(query = "apis")

crossref_single <- 
  crossref_request |> 
  req_perform() |> 
  resp_body_json()
length(crossref_single$message$items)
#> [1] 20
crossref_single$message$`total-results`
#> [1] 14114

crossref_multi <- 
  crossref_request |> 
  req_retry(max_tries = 3) |> 
  req_perform_iterative(
    next_req = iterate_with_offset(
      "offset",
      resp_pages = \(resp) {
        content <- resp_body_json(resp)
        content$pagination$message$`total-results`
      }
    ),
    max_reqs = Inf
  )
length(crossref_multi)
#> [1] 706

Iteration helpers: iterate_with_cursor()

param_name = "cursor" or whatever API calls it
resp_param_value = function to convert resp to next cursor
- NULL if no more pages

Cursor example: Crossref

crossref_request_cursor <- 
  crossref_request |> 
  req_url_query(cursor = "*")

crossref_single <- 
  crossref_request_cursor |> 
  req_perform() |> 
  resp_body_json()
names(crossref_single$message)
#> [1] "facets"         "next-cursor"    "total-results"  "items"          "items-per-page"
#> [6] "query"

crossref_multi <- 
  crossref_request_cursor |> 
  req_retry(max_tries = 3) |> 
  req_perform_iterative(
    next_req = iterate_with_cursor(
      "cursor",
      resp_param_value = \(resp) {
        content <- resp_body_json(resp)
        if (!length(content$message$items)) {
          return(NULL)
        }
        content$message$`next-cursor`
      }
    ),
    max_reqs = Inf
  )
length(crossref_multi)
#> [1] 706

Iteration helpers: iterate_with_link_url()

rel = "next" or whatever API calls it
resp_param_value = function to convert resp to next cursor
- NULL if no more pages

Link-URL example: MTG

mtg_cards_request <- 
  request("http://api.magicthegathering.io/v1") |> 
  req_friendly(cache_name = "mtgapi") |> 
  req_url_path_append("cards") |> 
  req_url_query(name = "bee", pageSize = 20)

mtg_cards_single <- 
  mtg_cards_request |> 
  req_perform()
resp_header(mtg_cards_single, "link")
#> <https://api.magicthegathering.io/v1/cards?name=bee&page=5&pageSize=20>; rel="last", 
#> <https://api.magicthegathering.io/v1/cards?name=bee&page=2&pageSize=20>; rel="next"
mtg_cards_content <- mtg_cards_single |> 
  resp_body_json()

mtg_cards_multi <- 
  mtg_cards_request |> 
  req_retry(max_tries = 3) |> 
  req_perform_iterative(
    next_req = iterate_with_link_url(),
    max_reqs = Inf
  )
length(mtg_cards_multi)
#> [1] 5

Roll-your-own iteration

People love to reinvent this wheel.
Body link: nextUrl in response body
- open5e.com monsters

Roll-your-own example: DnD Monsters

dnd_request <- 
  request("https://api.open5e.com/monsters/?limit=100") |> 
  req_friendly(cache_name = "dndapi")

dnd_single <- 
  dnd_request |> 
  req_perform() |> 
  resp_body_json()
length(dnd_single$results)
#> [1] 100
dnd_single$count
#> [1] 2439

dnd_multi <- 
  dnd_request |> 
  req_retry(max_tries = 3) |> 
  req_perform_iterative(
    next_req = function(resp, req) {
      url <- resp_body_json(resp)$`next`
      if (!is.null(url)) {
        req_url(req, url)
      }
    },
    max_reqs = Inf
  )
length(dnd_multi)
#> [1] 25

What else can I do with lists of requests?

Other multi-request situations

Looking up multiple, related things
Enriching a dataset
Manual pagination

Create a list of requests

free_dictionary_request <- 
  request("https://api.dictionaryapi.dev/api/v2/entries/en/") |> 
  req_friendly(cache_name = "free_dictionary")
words <- c("hello", "world", "goodbye")
free_dictionary_reqs <- 
  purrr::map(words, \(word) req_url_path_append(free_dictionary_request, word))
free_dictionary_reqs |> purrr::map_chr("url")

#> [1] "https://api.dictionaryapi.dev/api/v2/entries/en/hello"  
#> [2] "https://api.dictionaryapi.dev/api/v2/entries/en/world"  
#> [3] "https://api.dictionaryapi.dev/api/v2/entries/en/goodbye"

Perform requests sequentially

free_dictionary_resps <- req_perform_sequential(free_dictionary_reqs)
length(free_dictionary_resps)
#> [1] 3

Perform requests in parallel

httr2::req_perform_parallel() performs multiple requests simultaneously

Pros: (much?) faster
Cons:
- Might over-tax server
- Silently ignores req_retry() (and some other things)
- Harder to debug

free_dictionary_resps <- req_perform_parallel(free_dictionary_reqs)
length(free_dictionary_resps)
#> [1] 3

How do I parse paginated responses?

resps_data()

req_perform_*() returns list of responses
resps_data() takes that list + resp_data parsing function and merges
Usually you need more than just something like resp_body_json
- Grab just the actual results

resps_data() example: OpenFEC

fec_resp_parser <- function(resp) {
  resp_body_json(resp)$results
}

candidates_multi_results <- resps_data(
  candidates_multi, fec_resp_parser
)
tibblify::tibblify(candidates_multi_results)

#> # A tibble: 173 × 24
#>    active_through candidate_id candidate_inactive candidate_status cycles    
#>             <int> <chr>        <lgl>              <chr>            <list>    
#>  1           2020 P00014241    FALSE              N                <list [3]>
#>  2           2024 P00008193    FALSE              N                <list [4]>
#>  3           2020 P00006296    FALSE              P                <list [4]>
#>  4           2020 P00010900    FALSE              N                <list [1]>
#>  5           2020 P00012104    FALSE              P                <list [3]>
#>  6           2020 P40003170    FALSE              N                <list [5]>
#>  7           2020 P00008128    FALSE              N                <list [4]>
#>  8           2020 P80004369    FALSE              N                <list [2]>
#>  9           2020 P00010090    FALSE              N                <list [3]>
#> 10           2024 P00015297    FALSE              N                <list [3]>
#> # ℹ 163 more rows
#> # ℹ 19 more variables: district <chr>, district_number <int>,
#> #   election_districts <list>, election_years <list>, federal_funds_flag <lgl>,
#> #   first_file_date <chr>, has_raised_funds <lgl>,
#> #   inactive_election_years <list>, incumbent_challenge <chr>,
#> #   incumbent_challenge_full <chr>, last_f2_date <chr>, last_file_date <chr>,
#> #   load_date <chr>, name <chr>, office <chr>, office_full <chr>, …

resps_data() friends

req_perform_iterative on_error arg: “stop” or “return”
resps_successes() extracts just successes from resps list
resps_failures() same, but failures
resps_requests() to see how they were called

Get a lot of data from APIs

️✅ Learning objectives

What is pagination?

Why paginate?

What are some pagination strategies?

How can I determine how an API handles pagination?

How do I perform pagination?

Aside: Retries and API consumer etiquette

Aside: Caching responses

Aside: Our helper function

req_perform_iterative()

Iteration helpers: iterate_with_offset()

Offset example: OpenFEC

Offset example: crossref.org

Iteration helpers: iterate_with_cursor()

Cursor example: Crossref

Iteration helpers: iterate_with_link_url()

Link-URL example: MTG

Roll-your-own iteration

Roll-your-own example: DnD Monsters

What else can I do with lists of requests?

Other multi-request situations

Create a list of requests

Perform requests sequentially

Perform requests in parallel

How do I parse paginated responses?

resps_data()

resps_data() example: OpenFEC

resps_data() friends

Chapter To-Do