Parallel and serial pagination

Paged operations

Some GitHub API operations return their results one page at a time. For instance, there are many thousands of gists, but if we call list_public we only see the first 30:

api = GhApi()
gists = api.gists.list_public()
len(gists)
30

That's because this operation takes two optional parameters, per_page, and page:

api.gists.list_public

gists.list_public(since, per_page, page): List public gists

This is a common pattern for list_* operations in the GitHub API. One way to get more results is to increase per_page:

len(api.gists.list_public(per_page=100))
100

However, per_page has a maximum of 100, so if you want more, you'll have to pass page= to get pages beyond the first. An easy way to iterate through all pages is to use paged.

paged[source]

paged(oper, *args, per_page=30, max_pages=9999, **kwargs)

Convert operation oper(*args,**kwargs) into an iterator

We'll demonstrate this using the repos.list_for_org method:

api.repos.list_for_org

repos.list_for_org(org, type, sort, direction, per_page, page): List organization repositories

repos = api.repos.list_for_org('fastai')
len(repos),repos[0].name
(30, 'fast-image')

To convert this operation into a Python iterator, pass the operation itself, along with any arguments (either keyword or positional) to paged:

repos = paged(api.repos.list_for_org, 'fastai')

You can now iterate through repos using Python, e.g:

for page in repos: print(len(page), page[0].name)
30 fast-image
30 fastforest
30 .github
3 tweetrel

GitHub tells us how many pages are available using the link header. Unfortunately the pypi LinkHeader library appears to no longer be maintained, so we've put a refactored version of it here.

parse_link_hdr(header)

Parse an RFC 5988 link header, returning a dict from rels to a tuple of URL and attrs dict

Here's an example of a link header with just one link:

parse_link_hdr('<http://example.com>; rel="foo bar"; type=text/html')
{'foo bar': ('http://example.com', {'type': 'text/html'})}
links = parse_link_hdr('<http://example.com>; rel="foo bar"; type=text/html')
link = links['foo bar']
test_eq(link[0], 'http://example.com')
test_eq(link[1]['type'], 'text/html')

Let's test it on the headers we received on our last call to GitHub. You can access the last call's headers in `recv_hdrs':

api.recv_hdrs['Link']
'<https://api.github.com/organizations/20547620/repos?per_page=30&page=4>; rel="prev", <https://api.github.com/organizations/20547620/repos?per_page=30&page=4>; rel="last", <https://api.github.com/organizations/20547620/repos?per_page=30&page=1>; rel="first"'

Here's what happens when we parse that:

parse_link_hdr(api.recv_hdrs['Link'])
{'prev': ('https://api.github.com/organizations/20547620/repos?per_page=30&page=4',
  {}),
 'last': ('https://api.github.com/organizations/20547620/repos?per_page=30&page=4',
  {}),
 'first': ('https://api.github.com/organizations/20547620/repos?per_page=30&page=1',
  {})}

Getting pages in parallel

Rather than requesting each page one at a time, we can save some time by getting all the pages we need in parallel.

GhApi.last_page[source]

GhApi.last_page()

Parse RFC 5988 link header from most recent operation, and extract the last page

To help us know the number of pages needed, we can use last_page, which uses the link header we just looked at to grab the last page from GitHub.

We will need multiple pages to get all the repos in the github organization, even if we get 100 at a time:

api.repos.list_for_org('github', per_page=100)
api.last_page()
4

pages[source]

pages(oper, n_pages, *args, n_workers=None, per_page=100, **kwargs)

Get n_pages pages from oper(*args,**kwargs)

pages by default passes per_page=100 to the operation.

Let's look at some examples. To get all the pages for the repos in the github organization in parallel, we can use this:

gh_repos = pages(api.repos.list_for_org, api.last_page(), 'github').concat()
len(gh_repos)
367

If you already know ahead of time the number of pages required, there's no need to call last_page. For instance, the GitHub docs specify that we can get at most 3000 gists:

gists = pages(api.gists.list_public, 30).concat()
len(gists)
3000

GitHub ignores the per_page parameter for some API calls, such as listing public events, which it limits to 8 pages of 30 items per page. To retrieve all pages in these cases, you need to explicitly pass the lower per page limit:

api.activity.list_public_events()
api.last_page()
8
evts = pages(api.activity.list_public_events, api.last_page(), per_page=30).concat()
len(evts)
232