API's
Application Programming Interface
Retrieving data troughout API's
API stands for Application Programming Interface. It is the interface that allows software applications to communicate with one another. An API is a software-to-software interface, not a user interface. With APIs, applications talk to each other without any user knowledge or intervention.
An example is the Twitter API. It is a web-based JSON API that allows developers to programmatically interact with Twitter data. The Twitter API is a web-based API. It must be accessed by making requests over the Internet to services that Twitter hosts. With a web-based API such as Twitter’s, your application sends an HTTP request, just like a web browser does. But instead of the response being delivered as a webpage, for human understanding, it’s returned in a format that applications can easily parse. Various formats exist for this purpose, and Twitter uses a popular and easy-to-use format called JSON.
In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token, and Access token secret. If you to https://apps.twitter.com/ and log in with your Twitter credentials you can create a New App and get the API key credentials for yourself.
For the twitter API we need the tweepy library see https://tweepy.readthedocs.io/en/latest/
In the example below we see a piece of code that downloads the tweets into a JSON file
#source: http://adilmoujahid.com/posts/2014/07/twitter-analytics/
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
#Variables that contains the user credentials to access Twitter API
access_token = token
access_token_secret = secret_token
consumer_key = api_key
consumer_secret = api_secret_key
#This is a basic listener that just stores tweets in json file
class StdOutListener(StreamListener):
def on_data(self, data):
with open('data/result2.json', 'a') as f:
f.write(data)
print(data)
return True
def on_error(self, status):
print(status)
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords
stream.filter(track=['Sunday'])
Since this is a JSON format we can process the data accordingly. If we open the json file we see it contain records according the { } format, but is is not closed by the [ ]. Therefor we apply a little trick to enclose the data in the [ ] format. After that we can put the data into a pandas DataFrame and process further.
import json
tweets_data_path = 'data/result2.json'
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
tweet = json.loads(line)
tweets_data.append(tweet)
print(len(tweets_data))
3
import pandas as pd
tweets = pd.DataFrame(tweets_data)
tweets
contributors
coordinates
created_at
display_text_range
entities
extended_tweet
favorite_count
favorited
filter_level
geo
...
quoted_status_id_str
quoted_status_permalink
reply_count
retweet_count
retweeted
source
text
timestamp_ms
truncated
user
0
None
None
Sun May 19 17:53:12 +0000 2019
NaN
{'hashtags': [], 'urls': [{'expanded_url': 'ht...
{'full_text': 'Can we put this out to the spor...
0
False
low
None
...
1130144770255413248
{'expanded': 'https://twitter.com/girlontheriv...
0
0
False
<a href="http://twitter.com/download/iphone" r...
Can we put this out to the sports med communit...
1558288392774
True
{'listed_count': 49, 'following': None, 'defau...
1
None
None
Sun May 19 17:56:17 +0000 2019
[12, 130]
{'hashtags': [], 'urls': [], 'symbols': [], 'u...
NaN
0
False
low
None
...
NaN
NaN
0
0
False
<a href="http://twitter.com/download/iphone" r...
@mboyle1959 I might clarify that the “low” end...
1558288577756
False
{'listed_count': 3, 'following': None, 'defaul...
2
None
None
Sun May 19 17:59:11 +0000 2019
NaN
{'hashtags': [], 'urls': [{'expanded_url': 'ht...
{'full_text': 'Mechanical efficiency of high v...
0
False
low
None
...
NaN
NaN
0
0
False
<a href="http://twitter.com" rel="nofollow">Tw...
Mechanical efficiency of high versus moderate ...
1558288751178
True
{'listed_count': 18, 'following': None, 'defau...
3 rows × 34 columns
For instance we can select a column
tweets['text']
0 Can we put this out to the sports med communit...
1 @mboyle1959 I might clarify that the “low” end...
2 Mechanical efficiency of high versus moderate ...
Name: text, dtype: object
Subtract by regex
We can apply regex to filter interesting information. In the examples below we first extract the hyperlinks, then we search for the word 'sport'
def extract_link(text):
import re
regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
match = re.search(regex, text)
if match:
return match.group()
return ''
tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))
tweets['link']
0 https://t.co/4ne1XxaIVi
1
2 https://t.co/74eXuY4uG0
Name: link, dtype: object
def word_in_text(word, text):
import re
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
tweets['sport'] = tweets['text'].apply(lambda tweet: word_in_text('sport', tweet))
tweets['sport']
0 True
1 False
2 False
Name: sport, dtype: bool
Read more
Last updated
Was this helpful?