Skip to content

Reuse connectionpool in HTTPFileSystem? #240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rabernat opened this issue Feb 25, 2020 · 2 comments
Closed

Reuse connectionpool in HTTPFileSystem? #240

rabernat opened this issue Feb 25, 2020 · 2 comments

Comments

@rabernat
Copy link
Contributor

This issue originally came up in zarr-developers/zarr-python#536.

HTTPFilesystem is slow to fetch many files, because it does not reuse a connectionpool. Example:

from fsspec.implementations.http import HTTPFileSystem
fs = HTTPFileSystem()

url_base = 'https://mur-sst.s3.us-west-2.amazonaws.com/zarr/time'
def get_chunk_fsspec(n):
    return fs.cat(url_base + f'/{n}')

%prun all_data = [get_chunk_fsspec(i) for i in range(20)]

As you can see, SSL_do_handshake is called 20 times and takes most of the time.

         205967 function calls (205947 primitive calls) in 7.177 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    2.726    0.136    2.730    0.137 {built-in method _openssl.SSL_do_handshake}
       20    1.871    0.094    1.871    0.094 {method 'connect' of '_socket.socket' objects}
       40    1.853    0.046    1.853    0.046 {built-in method _openssl.SSL_read}
       20    0.270    0.014    0.270    0.014 {built-in method _openssl.SSL_CTX_load_verify_locations}
...

If we do basically the same thing with requests

import requests

s = requests.Session()
def get_chunk_http(n):
    r = s.get(url_base + f'/{n}')
    r.raise_for_status()
    return r.content

%prun all_data = [get_chunk_http(i) for i in range(20)] 

In this case, because we reused the session, it goes much faster:

 
         91599 function calls (91579 primitive calls) in 2.114 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       40    1.699    0.042    1.699    0.042 {built-in method _openssl.SSL_read}
        1    0.142    0.142    0.142    0.142 {built-in method _openssl.SSL_do_handshake}
        1    0.069    0.069    0.069    0.069 {method 'connect' of '_socket.socket' objects}
       20    0.025    0.001    0.025    0.001 {built-in method _socket.gethostbyname}
        1    0.016    0.016    0.016    0.016 {built-in method _openssl.SSL_CTX_load_verify_locations}
       20    0.007    0.000    0.007    0.000 {built-in method _scproxy._get_proxy_settings}
...

Could HTTPFileSystem be configured to reuse a session in a similar way?

@martindurant
Copy link
Member

This is a bug in the implementation, and is supposed to work as you suggest. The following change fixes it.

--- a/fsspec/implementations/http.py
+++ b/fsspec/implementations/http.py
@@ -107,7 +107,7 @@ class HTTPFileSystem(AbstractFileSystem):
             return list(sorted(out))

     def cat(self, url):
-        r = requests.get(url, **self.kwargs)
+        r = self.session.get(url, **self.kwargs)
         r.raise_for_status()
         return r.content

Note that reusing sessions does present a small thread safety issue: it is possible for the pool to run out of connections when called from multiple threads simultaneously, resulting in connections being evicted and closed.

@martindurant
Copy link
Member

Sneaked it in here: d80034e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants