Issue 49944

Overview

There are methods provided in pandas that allow for reading from different data structures such as CSV, excel, JSON, HTML, etc. Many of these functions take both data, and a URL as input. When used with a URL, functions such as pandas.read_json, pandas.read_excel, and pandas.read_html make a GET request to the URL and fetch the response, and then process it and (attempt to) return it to the user as a DataFrame.

However, storage_options is not supported for pandas.read_html. This means that it is currently not possible to pass headers alongside HTTP requests when pandas.read_html is used with a URL, which means pandas.read_html is unable to fetch data from any endpoint that requires special information in the headers of the request such as authentication or special instructions.

Problem Analysis

Description

The reason why pandas.read_html doesn’t support the storage_options keyword but many other functions like pandas.read_json do is because storage_options is not accounted for in the class hierarchy and infrastructure of pandas.io.html.

storage_options is not expected or considered as an argument in the following functions within the API:

pandas.io.html:
- when importing from pandas._typing
- _setup_build_doc (for each of the two parsers in pandas.io.html)
- _build_doc
- the constructor for _HtmlFrameParser
- _read ()
- _validate_flavor ()
- read_html ()
pandas.io.common:
- urlopen ()
- get_handle ()

Due to the absence of consideration for storage_options throughout all these locations, it is not possible for headers to be propagated to the HTTP requests that take place in urlopen () and get_handle (). In any case where headers are necessary to fetch data, such as an API that requires an authentication token in the request headers, pandas.read_html will fail for this exact reason.

Desired vs. Actual Output

Desired Output

Demonstrated below is how pandas.read_json supports storage_options natively:

import pandas as pd

url = '<https://www.sump.org/notes/request/>'

# Define headers
headers = {
	'User-Agent':'Mozilla Firefox v14.0',
	'Accept':'application/json',
	'Connection':'keep-alive',
	'Auth':'Bearer 2*/f3+fe68df*4'
	}

# Pass headers in the HTTP request with the optional param "storage_options"
df = pd.read_json(url, storage_options=headers)
print(df)