Analyzing 760,000 Messages From the /r/EngineeringStudents Discord Chatserver

The /EngineeringStudents Discord server has been up since December of 2016, and has been public since May of 2017. In that time, the user count has risen to over 1400, and those users have sent more than 760,000 messages in just the #general chat alone. Today I will show a bit of analysis of this big data set, using an Engineering Student's favorite tool, Excel.


First, I scraped the chat logs off of discord using this handy tool, which worked for a bit over an hour scraping all the messages. This script, while very helpful and the only easy way to collect this data, outputs in a format very unreadable to excel. To start, I could only save the data as a text file, 1 line long, and very very very long. All these messages and the associated metadata took up about 67 megabytes all told. Adding some logical line breaks and a little bit of find and replace gave us two distinct data sets.
One set was all of the messages sent in the channel, and the other set was a key of sorts, that matched usernames both with their discord UUID, but also a separate numerical ID which corresponded to the order the original scraping script saw a new username. The end result was one excel sheet that looked like this, and one excel sheet that looked like this.
These two data sets could work together to output a readable spreadsheet of chat data that might actually be useful to me:


Total Messages Over Time

Every message sent in our corpus contains a time stamp, and we can use it to see just how much the server has grown in activity. By taking this list of 760,000 timestamps, we can sort them into how many messages occurred before a certain date.
Repeat this for every day the server has been online, and you can visualize how many messages have been sent in the server as a function of time.

Messages Sent Per Day Over Time

Going further with the previous concept, we can compute how many messages were sent between any two points in time. If we compute this value every day, we can visualize the upward trend of messages sent per day as the server grows.

Messages Per Day of Week

Using the data from above, we can then simplify the list of messages per day into just a list of how many messages were sent per day of week. Add in a little excel pivot table magic, and we can determine the most popular days of the week for the server's activity. The same pivot table can be configured to show messages sent per month as well.

Total Messages Per User

Right next to the list of dates in our data set is a list of usernames, which each username appearing every time that user sends a message. We can count how many times each username appears in this list to get a top chatters leader board for all-time.

User Activity Scores (Average Messages Per User Per Day)

The graph above is interesting to have, but it doesn't paint the most accurate picture because every user joined at a different point in the server's life. People who have been here for over a year will have lots more messages than people who joined 2 months ago, even if the newer user is much more active. To remedy this, we can calculate the messages sent per user per day.
For any given user, we can filter our data such that only their messages are included. Of that single user's data, we can find the row with the lowest date (their first appearance on the server), and the row with the highest date (their last or most recent appearance on the server). We can also count how many messages (rows) are between those two dates. This gives us data like this. Using the magic of excel data tables (and lots of time waiting for excel to chug through all that data ~1000 times), we can compile a list of the most active users on the server by messages/day.

Number of Mentions Per User

Using similar techniques to the stats above, we can search through every message ever sent on the server, and count how many times each user is mentioned.

Keyword Tracking and Activity

Extending the previous code a bit further, it can be used to track the popularity of a certain word or topic over time, in line with current events. For this to mean anything though, we have to normalize it to correct for the growth of the server over time. The end result is for a given keyword or string, we can find how many times messages containing it were sent for each day, and divide that by the total number of messages sent that day-- in effect normalizing the data. For example, tracking the bitcoin boom via chat popularity,, or seeing when politics has been in the news. We can even use this same application for more esoteric purposes, like finding what fraction of messages over time contained the letter "e".

Popularity (or Divisiveness) Index

The problem with just counting the number of mentions of each user in chat is that some users are more active than others, and have been on the server for a longer time. We can correct for this by instead finding the ratio between how many times a user was mentioned in chat, and how many messages they have actually sent in chat.
This gives us a more balanced popularity index, however, some of the less active users have to be dropped from this data because at small message sample sizes, the popularity index goes too extreme. For example, a user who enters the server, says hello, and never speaks again could have a popularity index of 5, if 5 people mention them when they say hello back. Furthermore, the name "Popularity Index" is a bit of a misnomer, because this metric is more accurately a measure of a user's messages' ability to illicit a response, positive or otherwise.

A User's Activity By Time of Day

Reusing lots of the code we wrote for the messages per user per day worksheet, we can find a user's daily activity; that is, the times of day when a user is most active. All that happens is we take a user's list of time stamps for all their messages and sort them into bins based on the time of the day they were sent. I used 5 minute increment bins for this. This leads to some of the most interesting statistics we can draw from this chat history. For example, we can find:

A User's Activity Over Time

Taking the same code and formulas used to calculate the activity by time of day graphs, we can convert this to a user's activity over time by changing the bins we sort the data into. Instead of sorting the messages into 5 minute time increments, we sort them into 1 day increments. Using this, we can find a user's activity as a function of date. You can see that the data when sorted into daily bins is rather noisy/jagged, so a weekly moving average works best to visualize the values. Like before, this works for any user you input into the spreadsheet; everyone from the dearly departed, to someone who should maybe slow down a bit.
I am no doubt just scratching the surface of how much a rich collection of data like a discord chat can be manipulated, but I thought these were a few cool examples of the information hidden within such an unrefined mass of data.
submitted by labtec901 to excel [link] [comments]

EXPLOSIVE!!!!!!! BITCOIN IS BREAKING OUT RIGHT NOW & YOU ... How To Trade Bitcoin Cryptocurrency for Beginners - YouTube BREAKING: BITCOIN KOMMT auf die DEUTSCHE BÖRSE (Erklärung und Folgen)  & Aktien Irrsinn My Personal Bitcoin Investing Cheat Sheet for 2019 - YouTube Cryptography: The Science of Making and Breaking Codes ...

I have entered the following code in my VBA sub: Sub copy2sheet() Dim wkSht As Worksheet For Each wkSht In Sheets If IsNumeric(wkShrt.Name) Then Worksheets("Anleitung"). Codes. Caesar Cipher – This is an online, Java-Script enabled version of a Caesar cipher program, for you to try. Also, there is a FREE version of the Caesar cipher program that can be downloaded.. AutoKey Cipher – This is an online, Java-Script enabled version of an AutoKey cipher program, for you to try.. Keyword Cipher – This is an online, Java-Script enabled version of a Keyword ... If bitcoin were to try to match this, it would require significant updates to the code that everyone on the bitcoin network is currently running. The disadvantage of this higher volume of blocks ... Many users of Bitcoin and virtual currency will use an array of wallets and services. If you engage in practices like this where you use a variety of virtual currency services, a single wallet or exchange cannot determine what occurred prior to importing the Bitcoin or after it has been exported to another exchange or printed. As such, the statements provided may be inaccurate for the taxpayer ... How many bitcoin nodes exist. Bitcoin cash vs litecoin comparison - bch ltc. 6 things you can buy with bitcoin right now - coinmama.. 6 things you can buy with bitcoin right now - coinmama.! Top 1801-1900 richest bitcoin addresses - bitinfocharts. $3500 usd; the new bitcoin support line itsblockchain. Bitcoin quizno's subs in tucson.

[index] [42094] [30948] [5351] [1795] [35968] [17986] [47985] [25279] [44675] [12866]


Get an additional $10 in Bitcoins from Coinbase when purchasing through my referral link Here is a quick beginner's guide on ... bitcoin is explosive!!! and btc is breaking out now!!!! you will not believe the next price target!!! 🔥 bybit bonus free now: (f... NOTE: CostX is now iTWO costX. For more information, please visit This webinar demonstrates several important CostX functions, inc... Bitcoin (BTC) Update! Blick auf die Charts und die News des Tages! 🔥 🔥 Bitwala Aktion: 💰 35€ ohne Einzahlung sichern: ... Whether or not it's worth investing in, the math behind Bitcoin is an elegant solution to some complex problems. Hosted by: Michael Aranda Special Thanks: Da...