API's

Application Programming Interface

Retrieving data troughout API's

API stands for Application Programming Interface. It is the interface that allows software applications to communicate with one another. An API is a software-to-software interface, not a user interface. With APIs, applications talk to each other without any user knowledge or intervention.

An example is the Twitter API. It is a web-based JSON API that allows developers to programmatically interact with Twitter data. The Twitter API is a web-based API. It must be accessed by making requests over the Internet to services that Twitter hosts. With a web-based API such as Twitter’s, your application sends an HTTP request, just like a web browser does. But instead of the response being delivered as a webpage, for human understanding, it’s returned in a format that applications can easily parse. Various formats exist for this purpose, and Twitter uses a popular and easy-to-use format called JSON.

In order to access Twitter Streaming API, we need to get 4 pieces of information from Twitter: API key, API secret, Access token, and Access token secret. If you to https://apps.twitter.com/ and log in with your Twitter credentials you can create a New App and get the API key credentials for yourself.

For the twitter API we need the tweepy library see https://tweepy.readthedocs.io/en/latest/

In the example below we see a piece of code that downloads the tweets into a JSON file

#source: http://adilmoujahid.com/posts/2014/07/twitter-analytics/
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#Variables that contains the user credentials to access Twitter API 
access_token = token
access_token_secret = secret_token
consumer_key = api_key
consumer_secret = api_secret_key


#This is a basic listener that just stores tweets in json file
class StdOutListener(StreamListener):

    def on_data(self, data):
        with open('data/result2.json', 'a') as f:
            f.write(data)
        print(data)
        return True

    def on_error(self, status):
        print(status)


if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by the keywords
    stream.filter(track=['Sunday'])

Since this is a JSON format we can process the data accordingly. If we open the json file we see it contain records according the { } format, but is is not closed by the [ ]. Therefor we apply a little trick to enclose the data in the [ ] format. After that we can put the data into a pandas DataFrame and process further.

import json
tweets_data_path = 'data/result2.json'

tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)
print(len(tweets_data))

import pandas as pd
tweets = pd.DataFrame(tweets_data)
tweets

contributors

coordinates

created_at

display_text_range

entities

extended_tweet

favorite_count

favorited

filter_level

geo

...

quoted_status_id_str

quoted_status_permalink

reply_count

retweet_count

retweeted

source

text

timestamp_ms

truncated

user

None

Sun May 19 17:53:12 +0000 2019

NaN

{'hashtags': [], 'urls': [{'expanded_url': 'ht...

{'full_text': 'Can we put this out to the spor...

False

low

None

...

1130144770255413248

{'expanded': 'https://twitter.com/girlontheriv...

False

<a href="http://twitter.com/download/iphone" r...

Can we put this out to the sports med communit...

1558288392774

True

{'listed_count': 49, 'following': None, 'defau...

None

Sun May 19 17:56:17 +0000 2019

[12, 130]

{'hashtags': [], 'urls': [], 'symbols': [], 'u...

NaN

False

low

None

...

NaN

False

<a href="http://twitter.com/download/iphone" r...

@mboyle1959 I might clarify that the “low” end...

1558288577756

False

{'listed_count': 3, 'following': None, 'defaul...

None

Sun May 19 17:59:11 +0000 2019

NaN

{'hashtags': [], 'urls': [{'expanded_url': 'ht...

{'full_text': 'Mechanical efficiency of high v...

False

low

None

...

NaN

False

<a href="http://twitter.com" rel="nofollow">Tw...

Mechanical efficiency of high versus moderate ...

1558288751178

True

{'listed_count': 18, 'following': None, 'defau...

3 rows × 34 columns

For instance we can select a column

tweets['text']

0    Can we put this out to the sports med communit...
1    @mboyle1959 I might clarify that the “low” end...
2    Mechanical efficiency of high versus moderate ...
Name: text, dtype: object

Subtract by regex

We can apply regex to filter interesting information. In the examples below we first extract the hyperlinks, then we search for the word 'sport'

def extract_link(text):
    import re
    regex = r'https?://[^\s<>"]+|www\.[^\s<>"]+'
    match = re.search(regex, text)
    if match:
        return match.group()
    return ''

tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))
tweets['link']

0    https://t.co/4ne1XxaIVi
1                           
2    https://t.co/74eXuY4uG0
Name: link, dtype: object

def word_in_text(word, text):
    import re
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)
    if match:
        return True
    return False

tweets['sport'] = tweets['text'].apply(lambda tweet: word_in_text('sport', tweet))

tweets['sport']

0     True
1    False
2    False
Name: sport, dtype: bool

More information is to be found here https://github.com/fenna/twitter_analysis

PreviousHTTP NextDatabase concepts

Last updated 3 years ago

hashtagRetrieving data troughout API's

hashtagSubtract by regex

hashtagRead more

Retrieving data troughout API's

Subtract by regex

Read more