In this tutorial, we look at the usage of the following methods. These are methods for files that need to be retrieved by HTTP.
data = urllib.request.urlopen() # open an url
data.decode() # decode to python UTF-8 readable information
Source: py4e.com
The network protocol that powers the web is actually quite simple and there is built-in support in Python called sockets which makes it very easy to make network connections and retrieve data over those sockets in a Python program. A socket is much like a file, except that a single socket provides a two-way connection between two programs. You can both read from and write to the same socket. If you write something to a socket, it is sent to the application at the other end of the socket. If you read from the socket, you are given the data which the other application has sent. But if you try to read a socket when the program on the other end of the socket has not sent any data, you just sit and wait. If the programs on both ends of the socket simply wait for some data without sending anything, they will wait for a very long time, so an important part of programs that communicate over the Internet is to have some sort of protocol.
The HyperText Transfer Protocol is described in the following document: . For example to request a document from a the web server bioinf.nl , we make a connection to the bioinf.nl server on port 80, and then send a line of the form
GET https://bioinf.nl/~fennaf/poem.txt HTTP/1.0\r\n\r\n'
The HTTP protocol says we must send the GET command followed by a blank line. \r\n signifies an EOL (end of line), so \r\n\r\n signifies nothing between two EOL sequences. That is the equivalent of a blank line. Since the internet does not speak 'unicode' but needs UTF-8 coded strings we use the method .encode() to encode the get command into a UTF-8 readable format for the server. (Which is more efficient to process)
Once we send that blank line, we write a loop that receives data in 512-character chunks from the socket and prints the data out until there is no more data to read (i.e., the recv() returns an empty string). To make it a string in python we need to decode the data from UTF-8 to unicode using the method .decode(). When finished we can close the socket.
import socket
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('bioinf.nl', 80))
cmd = 'GET https://bioinf.nl/~fennaf/poem.txt HTTP/1.0\r\n\r\n'.encode()
mysock.send(cmd)
while True:
data = mysock.recv(512)
if len(data) < 1:
break
print(data.decode(),end='')
mysock.close()
HTTP/1.1 200 OK
Date: Tue, 23 Apr 2019 07:11:36 GMT
Server: Apache/2.4.25 (Debian)
Last-Modified: Mon, 22 Apr 2019 17:59:14 GMT
ETag: "705-5872239fbaee7"
Accept-Ranges: bytes
Content-Length: 1797
Vary: Accept-Encoding
Connection: close
Content-Type: text/plain
The Eagle soars in the summit of Heaven,
The Hunter with his dogs pursues his circuit.
O perpetual revolution of configured stars,
O perpetual recurrence of determined seasons,
O world of spring and autumn, birth and dying!
The Eagle soars in the summit of Heaven,
The Hunter with his dogs pursues his circuit.
O perpetual revolution of configured stars,
O perpetual recurrence of determined seasons,
O world of spring and autumn, birth and dying!
The endless cycle of idea and action,
Endless invention, endless experiment,
Brings knowledge of motion, but not of stillness;
Knowledge of speech, but not of silence;
Knowledge of words, and ignorance of the Word.
All our knowledge brings us nearer to death,
But nearness to death no nearer to God.
Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?
The cycles of heaven in twenty centuries
Brings us farther from God and nearer to the Dust.
The lot of man is ceaseless labor,
Or ceaseless idleness, which is still harder,
Or irregular labour, which is not pleasant.
I have trodden the winepress alone, and I know
That it is hard to be really useful, resigning
The things that men count for happiness, seeking
The good deeds that lead to obscurity, accepting
With equal face those that bring ignominy,
The applause of all or the love of none.
All men are ready to invest their money
But most expect dividends.
I say to you: Make perfect your will.
I say: take no thought of the harvest,
But only of proper sowing.
The world turns and the world changes,
But one thing does not change.
In all of my years, one thing does not change,
However you disguise it, this thing does not change:
The perpetual struggle of Good and Evil.
from "The Rock"
by T.S. Eliot
The output starts with headers which the web server sends to describe the document. For example, the Content-Type header indicates that the document is a plain text document ( text/plain ). After the server sends us the headers, it adds a blank line to indicate the end of the headers, and then sends the actual data of the file romeo.txt. This example shows how to make a low-level network connection with sockets. Sockets can be used to communicate with a web server or with a mail server or many other kinds of servers. All that is needed is to find the document which describes the protocol and write the code to send and receive the data according to the protocol. However, since the protocol that we use most commonly is the HTTP web protocol, Python has a special library specifically designed to support the HTTP protocol for the retrieval of documents and data over the web. One of the requirements for using the HTTP protocol is the need to send and receive data as bytes objects, instead of strings. In the preceding example, the encode() and decode() methods convert strings into bytes objects and back again.
using urllib
While we can manually send and receive data over HTTP using the socket library, there is a much simpler way to perform this common task in Python by using the urllib library. Using urllib , you can treat a web page much like a file. You simply indicate which web page you would like to retrieve and urllib handles all of the HTTP protocol and header details. The equivalent code to read the romeo.txt file from the web using urllib is as follows:
import urllib.request
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
fhand = urllib.request.urlopen('https://bioinf.nl/~fennaf/poem.txt', context=ctx)
for line in fhand:
print(line.decode().strip())
The Eagle soars in the summit of Heaven,
The Hunter with his dogs pursues his circuit.
O perpetual revolution of configured stars,
O perpetual recurrence of determined seasons,
O world of spring and autumn, birth and dying!
The Eagle soars in the summit of Heaven,
The Hunter with his dogs pursues his circuit.
O perpetual revolution of configured stars,
O perpetual recurrence of determined seasons,
O world of spring and autumn, birth and dying!
The endless cycle of idea and action,
Endless invention, endless experiment,
Brings knowledge of motion, but not of stillness;
Knowledge of speech, but not of silence;
Knowledge of words, and ignorance of the Word.
All our knowledge brings us nearer to death,
But nearness to death no nearer to God.
Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?
The cycles of heaven in twenty centuries
Brings us farther from God and nearer to the Dust.
The lot of man is ceaseless labor,
Or ceaseless idleness, which is still harder,
Or irregular labour, which is not pleasant.
I have trodden the winepress alone, and I know
That it is hard to be really useful, resigning
The things that men count for happiness, seeking
The good deeds that lead to obscurity, accepting
With equal face those that bring ignominy,
The applause of all or the love of none.
All men are ready to invest their money
But most expect dividends.
I say to you: Make perfect your will.
I say: take no thought of the harvest,
But only of proper sowing.
The world turns and the world changes,
But one thing does not change.
In all of my years, one thing does not change,
However you disguise it, this thing does not change:
The perpetual struggle of Good and Evil.
from "The Rock"
by T.S. Eliot
Once the web page has been opened with urllib.urlopen , we can treat it like a file and read through it using a for loop.
import urllib.request, urllib.parse, urllib.error
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
fhand = urllib.request.urlopen('https://bioinf.nl/~fennaf/poem.txt', context = ctx)
counts = dict()
for line in fhand:
words = line.decode().split()
for word in words:
counts[word] = counts.get(word, 0) + 1
print(counts)
In the example above the textfile is a simple txt format, easy to read and handle. A txt file is difficult to send data in a format of an object or a table. For that reason, protocols like HTML XML and JSON are invented.