HTTP 1.1 implements chunking as a way of servers telling clients how much content is left for a given request, which enables you to send more than one piece of content in a given HTTP connection. Unfortunately for me, the site I was trying to access has a buggy chunking implementation, and that causes the somewhat fragile python urllib2 code to throw an exception:
Traceback (most recent call last): File "./mythingie.py", line 55, in ? xml = remote.readlines() File "/usr/lib/python2.4/socket.py", line 382, in readlines line = self.readline() File "/usr/lib/python2.4/socket.py", line 332, in readline data = self._sock.recv(self._rbufsize) File "/usr/lib/python2.4/httplib.py", line 460, in read return self._read_chunked(amt) File "/usr/lib/python2.4/httplib.py", line 499, in _read_chunked chunk_left = int(line, 16) ValueError: invalid literal for int():
I muttered about this earlier today, including finding the bug tracking the problem in pythonistan. However, finding the will not fix bug wasn’t satisfying enough…
It turns out you can just have urllib2 lie to the server about what HTTP version it talks, and therefore turn off chunking. Here’s my sample code for how to do that:
import httplib import urllib2 class HTTP10Connection(httplib.HTTPConnection): """HTTP10Connection -- a HTTP connection which is forced to ask for HTTP 1.0 """ _http_vsn_str = 'HTTP/1.0' class HTTP10Handler(urllib2.HTTPHandler): """HTTP10Handler -- don't use HTTP 1.1""" def http_open(self, req): return self.do_open(HTTP10Connection, req) // ... request = urllib2.Request(feed) request.add_header('User-Agent', 'mythingie') opener = urllib2.build_opener(HTTP10Handler()) remote = opener.open(request) content = remote.readlines() remote.close()
I hereby declare myself Michael Still, bringer of the gross python hacks.