API stands for Application Programming Interface. It is the interface that allows software applications to communicate with one another. An API is a software-to-software interface, not a user interface. With APIs, applications talk to each other without any user knowledge or intervention.
An example is the Twitter API. It is a web-based JSON API that allows developers to programmatically interact with Twitter data. The Twitter API is a web-based API. It must be accessed by making requests over the Internet to services that Twitter hosts. With a web-based API such as Twitter’s, your application sends an HTTP request, just like a web browser does. But instead of the response being delivered as a webpage, for human understanding, it’s returned in a format that applications can easily parse. Various formats exist for this purpose, and Twitter uses a popular and easy-to-use format called JSON.
In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token, and Access token secret. If you to https://apps.twitter.com/ and log in with your Twitter credentials you can create a New App and get the API key credentials for yourself.
In the example below we see a piece of code that downloads the tweets into a JSON file
#source: http://adilmoujahid.com/posts/2014/07/twitter-analytics/#Import the necessary methods from tweepy libraryfrom tweepy.streaming import StreamListenerfrom tweepy import OAuthHandlerfrom tweepy import Stream#Variables that contains the user credentials to access Twitter API access_token = tokenaccess_token_secret = secret_tokenconsumer_key = api_keyconsumer_secret = api_secret_key#This is a basic listener that just stores tweets in json fileclassStdOutListener(StreamListener):defon_data(self,data):withopen('data/result2.json','a')as f: f.write(data)print(data)returnTruedefon_error(self,status):print(status)if__name__=='__main__':#This handles Twitter authetification and the connection to Twitter Streaming API l =StdOutListener() auth =OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) stream =Stream(auth, l)#This line filter Twitter Streams to capture data by the keywords stream.filter(track=['Sunday'])
Since this is a JSON format we can process the data accordingly. If we open the json file we see it contain records according the { } format, but is is not closed by the [ ]. Therefor we apply a little trick to enclose the data in the [ ] format. After that we can put the data into a pandas DataFrame and process further.
contributors
coordinates
created_at
display_text_range
entities
extended_tweet
favorite_count
favorited
filter_level
geo
...
quoted_status_id_str
quoted_status_permalink
reply_count
retweet_count
retweeted
source
text
timestamp_ms
truncated
user
0
None
None
Sun May 19 17:53:12 +0000 2019
NaN
{'hashtags': [], 'urls': [{'expanded_url': 'ht...
{'full_text': 'Can we put this out to the spor...
0
False
low
None
...
1130144770255413248
{'expanded': 'https://twitter.com/girlontheriv...
0
0
False
<a href="http://twitter.com/download/iphone" r...
Can we put this out to the sports med communit...
1558288392774
True
{'listed_count': 49, 'following': None, 'defau...
1
None
None
Sun May 19 17:56:17 +0000 2019
[12, 130]
{'hashtags': [], 'urls': [], 'symbols': [], 'u...
NaN
0
False
low
None
...
NaN
NaN
0
0
False
<a href="http://twitter.com/download/iphone" r...
@mboyle1959 I might clarify that the “low” end...
1558288577756
False
{'listed_count': 3, 'following': None, 'defaul...
2
None
None
Sun May 19 17:59:11 +0000 2019
NaN
{'hashtags': [], 'urls': [{'expanded_url': 'ht...
{'full_text': 'Mechanical efficiency of high v...
0
False
low
None
...
NaN
NaN
0
0
False
<a href="http://twitter.com" rel="nofollow">Tw...
Mechanical efficiency of high versus moderate ...
1558288751178
True
{'listed_count': 18, 'following': None, 'defau...
3 rows × 34 columns
For instance we can select a column
Subtract by regex
We can apply regex to filter interesting information. In the examples below we first extract the hyperlinks, then we search for the word 'sport'
import json
tweets_data_path = 'data/result2.json'
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)
print(len(tweets_data))
3
import pandas as pd
tweets = pd.DataFrame(tweets_data)
tweets
tweets['text']
0 Can we put this out to the sports med communit...
1 @mboyle1959 I might clarify that the “low” end...
2 Mechanical efficiency of high versus moderate ...
Name: text, dtype: object
def extract_link(text):
import re
regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
match = re.search(regex, text)
if match:
return match.group()
return ''