I’m a data nerd (ergo the dual CS/Math major). I’m also very interested in the interactions among differing subsets of individuals (hence the Org. Com. minor).
A few months ago, I saw a similar project to the one presented below. Since I had a free weekend, I thought I’d recreate it - with a few added twists.
I’ve been backing up my texts for years. The internationally popular messaging service WhatsApp also allows a user to download all messages sent between two users on that platform in a plain text format. These two sources have propagated the data for this project, however it wouldn’t be too much work to add other sources such as email or Facebook Messages.
Hopefully you’ll get a kick out of this.
All times Mountain Standard/Daylight
The first few recorded chats between Sam and Micah
Each raw source file (WhatsApp text file, SMS dump, etc.) was first parsed using a Java application. In addition to standardizing and cleaning the initial data, several statistics were also computed. The word count of each Chat, for example. Once that information was compiled, it was then exported into a pipe-separated file (PSV). The pipe character (|) was chosen to delineate fields as chat Chats frequently contained commas in their data and the task of escaping them seemed to be far more a bother than it was worth. That PSV was then loaded into R, which is a programming language used for statistical analysis. This document is the output of that R program.
Below are all the fields imported, as well as example data drawn from the first few rows of the PSV file. To save screen space, the content of the Chat field has been truncated to the first 30 characters in the table below.
Time | Sender.First.Name | Sender.Last.Name | Type | Word.Count | Misspelled.Words | Sentiment | Numeric.Sentiment | Message |
---|---|---|---|---|---|---|---|---|
2016-08-18 22:22:00 | Sam | Schwartz | SMS | 16 | 0 | Positive | 5 | Hey, I hope you get home safel |
2016-08-18 22:24:00 | Micah | Orton | SMS | 9 | 0 | Positive | 5 | Thanks, you too! I’m just ente |
2016-08-18 22:58:00 | Micah | Orton | SMS | 15 | 0 | Neutral | 0 | Made it back. I realized that |
2016-08-18 22:58:00 | Micah | Orton | SMS | 12 | 0 | Positive | 5 | And guaranteed tomorrow I wake |
2016-08-18 22:59:00 | Sam | Schwartz | SMS | 6 | 0 | Neutral | 0 | Because of the memory of today |
2016-08-18 22:59:00 | Micah | Orton | SMS | 1 | 0 | Neutral | 0 | Yup |
Each field should be either self-explanatory or will be well defined later on in this document.
Total Chats Sent: 3301
The most common question: how often do we communicate? Below is the time-series of chat counts as grouped by day.
Protip: Click on the name label to toggle the overlay.
The second most common question: when do we talk?
In order to know anything about a data set, it’s often wise to know what time-frame you’re working in. Based on the chats uploaded, here is the time interval information:
Property | Date and Time |
---|---|
First Chat Timestamp | 2016-08-18 22:22:00 |
Last Chat Timestamp | 2017-01-21 11:44:00 |
Average Chat Timestamp | 2016-11-04 09:29:07 |
Median Chat Timestamp | 2016-11-05 13:39:00 |
Additionally, here is a summary of the length of time between chats:
Property | Time in Seconds |
---|---|
Shortest Time Between Consecutive Chats | 0 |
Longest Time Between Consecutive Chats | 2262000 |
Average Time Between Consecutive Chats | 4074 |
*Median Time Between Consecutive Chats | 60 |
*The median is the best heuristic for determining response time.
Sender | Chat Count |
---|---|
Micah Orton | 1707 |
Sam Schwartz | 1594 |
Different individuals may have stronger preferences for one platform over another. Here’s the breakdown.
Platform Type | Chat Count |
---|---|
SMS | 793 |
2508 |
While it is customary for individuals to respond within the same environment as they were solicited in, this may not always be the case. Furthermore, some individuals may have differing chat sending rates depending on the platform. The below plot gives some insight into those cases.
Sender Platform Type | Chat Count |
---|---|
From Micah Orton via SMS | 435 |
From Micah Orton via WhatsApp | 1272 |
From Sam Schwartz via SMS | 358 |
From Sam Schwartz via WhatsApp | 1236 |
The quantity of chats shows only part of the story: the amount of information conveyed in each chat also matters.
All Chats | Chats sent by Sam Schwartz | Chats sent by Micah Orton | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
In general, chats tend to follow a exponential decay distribution with respect to word count.
What may be more interesting is how the chatters compare to each other.
What happens when we ignore the chats entirely? When we take the aggregated word counts across all chats in the data set and partition them by sender, the following results.
Sender | Total Word Count |
---|---|
Sam Schwartz | 24347 |
Micah Orton | 11578 |
Remarkably similar to how often do we communicate?, the following shows how much is said over time.
Some individuals may send many short messages. Others may send fewer chats, but those chats are correspondingly longer. This section tries to normalize those differences so that two individuals chat styles can be more readily compared.
Quantity is important, but so is quality. Each message was spell checked using the library provided by LanguageTool. The message was parsed using English (American) and Spanish (Spain) dictionaries. A message’s misspelled word count was the minimum of either misspelled English words or misspelled Spanish words.
For my coding/math friends, the pseudo-code is something like
misspelledWords = min{numberOfMisspelledEnglishWords(message), numberOfMisspelledSpanishWords(message)}
Unfortunately, messages which code switchs within the same chat may be flagged as having a high number of misspelled words.
Several common chatspeak entries, such as “lol” or “haha”/“jaja” have also been added as correctly spelled words for the purpose of this count. My #1 pet peeve, “u”, is most definitely considered a misspelling.
All Chats | Chats sent by Sam Schwartz | Chats sent by Micah Orton | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
This plot will also often follow a decaying exponential distribution.
So who’s the worse speller?
Sender | Total Misspelled Word Count |
---|---|
Sam Schwartz | 236 |
Micah Orton | 85 |
Hopefully this will be going down with time!
Hold up though! If one individual writes a lot more than another, it’s fair to claim that the long-winded individual will make more mistakes.
By dividing by the total word count, we can see the probability that a given word is misspelled.
Sender | Probability (as a percentage) That a Randomly Selected Word is Misspelled |
---|---|
Sam Schwartz | 0.9693186 |
Micah Orton | 0.7341510 |
Also hoping this goes down with time!
Sentiment Analysis is one of the new sexy buzzwords on the forefront of machine learning and natural language processing today. While the field as a whole is still in its infancy, some guys at Stanford have developed a classification neural-net for assigning sentiment to a particular utterance. While still not quite perfect, their model is the best on the market for now.
It’s really cool, and after checking out the rest of this document take a look at their stuff here.
In a nutshell, each chat was processed through Stanford’s software and given a classification: Very negative, Negative, Neutral, Positive, Very positive.
Examples of positive phrases include “I love you”, versus negate phrases like “I hate you”. You can try out your own phrases through Stanford’s live demo page.
In order to quantify this information, the following arbitrarily chosen numeric weight was assigned for each value:
Likert Classification | Numeric Weight |
---|---|
Very negative | -10 |
Negative | -5 |
Neutral | 0 |
Positive | 5 |
Very positive | 10 |
With all that in mind, here’s the initial breakdown of feeling expressed in the data set.
All Chats | Chats sent by Sam Schwartz | Chats sent by Micah Orton | ||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
Thank you for checking out this document. If you have any questions, recommendations, or concerns shoot me an email at samorschwartz@gmail.com.