Pulling User Data from Fosstodon

Blog IndexRolling ViewRSS

[linkstandalone]

I stopped using Facebook a while ago. Aside from the issues with privacy and free expression, it was a time sink, and the news feed only served as a source of unnecessary anger. However, I have been using the Mastodon instance Fosstodon lately. It's largely devoid of international news, I enjoy the community, and has a code of rules that suit my preferences.

I'm interested in how Fosstodon operates, when its members are the most active, and how often they interact with one another. As such, I've been playing with the Mastodon API, in order to pull statuses and user information from the server. While I'm not much for Python, the Mastodon.py library is an easy tool for interacting with the API. It only requires registering an application and logging in to your server.


from mastodon import Mastodon

# create an application
Mastodon.create_app(
     'myapp',
     api_base_url = 'https://fosstodon.org',
     to_file = 'myapp_clientcred.secret'
)

# log in to our server
mastodon = Mastodon(
    client_id = 'myapp_clientcred.secret',
    api_base_url = 'https://fosstodon.org'
)
mastodon.log_in(
    'parker@pbanks.net',
    'aReallyGreatPassword',
    to_file = 'pytooter_usercred.secret'
)

After connecting to your instance, the mastodon.timeline() function pulls statuses from the home, local, or public timelines, or statuses with a given tag, starting from max_id. For me, each call returned the last 40 statuses, so I used the following loop to pull the last 50000 toots from the local timeline, from May 28, 2020 to February 7, 2021. Keep in mind there is a request limit of 300 per 5 minutes, so the time.sleep() funtion can be used to space out requests.


# request files from local timeline, starting at max_id
myFile=mastodon.timeline(timeline='local', max_id=105687600003001040, limit=40)
output=myFile

# use last status id at starting point for next request
myId=myFile[39]["id"]
for x in range(1249):
    myFile=mastodon.timeline(timeline='local', max_id=myId, limit=40)
    myId=myFile[39]["id"]
    output.extend(myFile)

The end result is a list of dictionaries that contain the time and content of each status, tags, number of replies, boosts, etc. Also nested within is a dictionary detailing the associated account, creation date, total followers, etc., that I pulled into a separate list. Given I am only insterested in the frequency and timing of toots, I also removed status tags, content, and other unnecessary information. Finally, I put each list into a pandas data frame and exported these to excel files.


# put account information into separate list
accounts = [output[0]['account']]
for d in output[1:]:
    accounts.extend([d['account']])

# remove unwanted fields
for d in output:
    for e in ['in_reply_to_account_id', 'spoiler_text','uri','favourited','reblogged',
	'muted','bookmarked','content','account','reblog','media_attachments',
	'mentions','emojis','card','poll','tags']: 
        d.pop(e, None)
for d in accounts:
    for e in ['username','display_name','note','url','avatar','avatar_static',
        'header','header_static','last_status_at','emojis','fields']: 
        d.pop(e, None)

#convert lists to data frames
import pandas
output = pandas.DataFrame(output)
accounts = pandas.DataFrame(accounts)

# delete time zone from status and account creation dates
dfPosts['created_at'] = dfPosts['created_at'].astype(str).str[:-6]
dfAccounts['created_at'] = dfAccounts['created_at'].astype(str).str[:-6]

# cast account id to string to prevent rounding errors
dfPosts['id'] = dfPosts['id'].astype(str)
dfPosts.to_excel("foss50k.xlsx")
dfAccounts.to_excel("account50k.xlsx")

CSV files containing the Fosstodon statuses from May 28, 2020 to February 7, 2021 and associated accounts are available here.