There are methods provided in pandas that allow for reading from different data structures such as CSV, excel, JSON, HTML, etc. Many of these functions take both data, and a URL as input. When used with a URL, functions such as pandas.read_json
, pandas.read_excel
, and pandas.read_html
make a GET request to the URL and fetch the response, and then process it and (attempt to) return it to the user as a DataFrame.
However, storage_options
is not supported for pandas.read_html
. This means that it is currently not possible to pass headers alongside HTTP requests when pandas.read_html
is used with a URL, which means pandas.read_html
is unable to fetch data from any endpoint that requires special information in the headers of the request such as authentication or special instructions.
The reason why pandas.read_html
doesn’t support the storage_options
keyword but many other functions like pandas.read_json
do is because storage_options
is not accounted for in the class hierarchy and infrastructure of pandas.io.html
.
storage_options
is not expected or considered as an argument in the following functions within the API:
pandas.io.html
:
pandas._typing
_setup_build_doc
(for each of the two parsers in pandas.io.html
)_build_doc
_HtmlFrameParser
_read ()
_validate_flavor ()
read_html ()
pandas.io.common
:
urlopen ()
get_handle ()
Due to the absence of consideration for storage_options
throughout all these locations, it is not possible for headers to be propagated to the HTTP requests that take place in urlopen ()
and get_handle ()
. In any case where headers are necessary to fetch data, such as an API that requires an authentication token in the request headers, pandas.read_html
will fail for this exact reason.
Demonstrated below is how pandas.read_json
supports storage_options
natively:
import pandas as pd
url = '<https://www.sump.org/notes/request/>'
# Define headers
headers = {
'User-Agent':'Mozilla Firefox v14.0',
'Accept':'application/json',
'Connection':'keep-alive',
'Auth':'Bearer 2*/f3+fe68df*4'
}
# Pass headers in the HTTP request with the optional param "storage_options"
df = pd.read_json(url, storage_options=headers)
print(df)