HTTP
HyperText Transfer Protocol
In this tutorial, we look at the usage of the following methods. These are methods for files that need to be retrieved by HTTP.
data = urllib.request.urlopen() # open an url
data.decode() # decode to python UTF-8 readable informationSource: py4e.com
The network protocol that powers the web is actually quite simple and there is built-in support in Python called sockets which makes it very easy to make network connections and retrieve data over those sockets in a Python program. A socket is much like a file, except that a single socket provides a two-way connection between two programs. You can both read from and write to the same socket. If you write something to a socket, it is sent to the application at the other end of the socket. If you read from the socket, you are given the data which the other application has sent. But if you try to read a socket when the program on the other end of the socket has not sent any data, you just sit and wait. If the programs on both ends of the socket simply wait for some data without sending anything, they will wait for a very long time, so an important part of programs that communicate over the Internet is to have some sort of protocol.
The HyperText Transfer Protocol is described in the following document: https://www.w3.org/Protocols/rfc2616/rfc2616.txt. For example to request a document from a the web server bioinf.nl , we make a connection to the bioinf.nl server on port 80, and then send a line of the form
GET https://bioinf.nl/~fennaf/poem.txt HTTP/1.0\r\n\r\n'The HTTP protocol says we must send the GET command followed by a blank line. \r\n signifies an EOL (end of line), so \r\n\r\n signifies nothing between two EOL sequences. That is the equivalent of a blank line. Since the internet does not speak 'unicode' but needs UTF-8 coded strings we use the method .encode() to encode the get command into a UTF-8 readable format for the server. (Which is more efficient to process)
Once we send that blank line, we write a loop that receives data in 512-character chunks from the socket and prints the data out until there is no more data to read (i.e., the recv() returns an empty string). To make it a string in python we need to decode the data from UTF-8 to unicode using the method .decode(). When finished we can close the socket.
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('bioinf.nl', 80))
cmd = 'GET https://bioinf.nl/~fennaf/poem.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()The output starts with headers which the web server sends to describe the document. For example, the Content-Type header indicates that the document is a plain text document ( text/plain ). After the server sends us the headers, it adds a blank line to indicate the end of the headers, and then sends the actual data of the file romeo.txt. This example shows how to make a low-level network connection with sockets. Sockets can be used to communicate with a web server or with a mail server or many other kinds of servers. All that is needed is to find the document which describes the protocol and write the code to send and receive the data according to the protocol. However, since the protocol that we use most commonly is the HTTP web protocol, Python has a special library specifically designed to support the HTTP protocol for the retrieval of documents and data over the web. One of the requirements for using the HTTP protocol is the need to send and receive data as bytes objects, instead of strings. In the preceding example, the encode() and decode() methods convert strings into bytes objects and back again.
using urllib
While we can manually send and receive data over HTTP using the socket library, there is a much simpler way to perform this common task in Python by using the urllib library. Using urllib , you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details. The equivalent code to read the romeo.txt file from the web using urllib is as follows:
Once the web page has been opened with urllib.urlopen , we can treat it like a file and read through it using a for loop.
In the example above the textfile is a simple txt format, easy to read and handle. A txt file is difficult to send data in a format of an object or a table. For that reason, protocols like HTML XML and JSON are invented.
Last updated
Was this helpful?
