Introduction

I’m a data nerd (ergo the dual CS/Math major). I’m also very interested in the interactions among differing subsets of individuals (hence the Org. Com. minor).

A few months ago, I saw a similar project to the one presented below. Since I had a free weekend, I thought I’d recreate it - with a few added twists.

I’ve been backing up my texts for years. The internationally popular messaging service WhatsApp also allows a user to download all messages sent between two users on that platform in a plain text format. These two sources have propagated the data for this project, however it wouldn’t be too much work to add other sources such as email or Facebook Messages.

Hopefully you’ll get a kick out of this.

All times Mountain Standard/Daylight

Overview of Data

Hey, I hope you get home safely! Text me when you’re safely there. Remember: you’re amazing!
Sent by Sam Schwartz | 2016-08-18 22:22:00


Thanks, you too! I’m just entering the canyon now.
Sent by Micah Orton | 2016-08-18 22:24:00

The first few recorded chats between Sam and Micah

Each raw source file (WhatsApp text file, SMS dump, etc.) was first parsed using a Java application. In addition to standardizing and cleaning the initial data, several statistics were also computed. The word count of each Chat, for example. Once that information was compiled, it was then exported into a pipe-separated file (PSV). The pipe character (|) was chosen to delineate fields as chat Chats frequently contained commas in their data and the task of escaping them seemed to be far more a bother than it was worth. That PSV was then loaded into R, which is a programming language used for statistical analysis. This document is the output of that R program.

Below are all the fields imported, as well as example data drawn from the first few rows of the PSV file. To save screen space, the content of the Chat field has been truncated to the first 30 characters in the table below.

Example Input Data

Time Sender.First.Name Sender.Last.Name Type Word.Count Misspelled.Words Sentiment Numeric.Sentiment Message
2016-08-18 22:22:00 Sam Schwartz SMS 16 0 Positive 5 Hey, I hope you get home safel
2016-08-18 22:24:00 Micah Orton SMS 9 0 Positive 5 Thanks, you too! I’m just ente
2016-08-18 22:58:00 Micah Orton SMS 15 0 Neutral 0 Made it back. I realized that
2016-08-18 22:58:00 Micah Orton SMS 12 0 Positive 5 And guaranteed tomorrow I wake
2016-08-18 22:59:00 Sam Schwartz SMS 6 0 Neutral 0 Because of the memory of today
2016-08-18 22:59:00 Micah Orton SMS 1 0 Neutral 0 Yup

Each field should be either self-explanatory or will be well defined later on in this document.

Summary Information

Total Chats Sent: 3301

Number of Chats Sent Per Day

The most common question: how often do we communicate? Below is the time-series of chat counts as grouped by day.

Number of Chats Sent Per Day by Sender

Protip: Click on the name label to toggle the overlay.

Histogram of Chat Times

The second most common question: when do we talk?

Histogram of Chat Times by Sender

Time

In order to know anything about a data set, it’s often wise to know what time-frame you’re working in. Based on the chats uploaded, here is the time interval information:

Property Date and Time
First Chat Timestamp 2016-08-18 22:22:00
Last Chat Timestamp 2017-01-21 11:44:00
Average Chat Timestamp 2016-11-04 09:29:07
Median Chat Timestamp 2016-11-05 13:39:00

Additionally, here is a summary of the length of time between chats:

Property Time in Seconds
Shortest Time Between Consecutive Chats 0
Longest Time Between Consecutive Chats 2262000
Average Time Between Consecutive Chats 4074
*Median Time Between Consecutive Chats 60

*The median is the best heuristic for determining response time.

Chats by Sender

Sender Chat Count
Micah Orton 1707
Sam Schwartz 1594

Chats by Platform Type

Different individuals may have stronger preferences for one platform over another. Here’s the breakdown.

Platform Type Chat Count
SMS 793
WhatsApp 2508

Chats by Sender and Platform Type

While it is customary for individuals to respond within the same environment as they were solicited in, this may not always be the case. Furthermore, some individuals may have differing chat sending rates depending on the platform. The below plot gives some insight into those cases.

Sender Platform Type Chat Count
From Micah Orton via SMS 435
From Micah Orton via WhatsApp 1272
From Sam Schwartz via SMS 358
From Sam Schwartz via WhatsApp 1236

Chats By Word Count

The quantity of chats shows only part of the story: the amount of information conveyed in each chat also matters.

Word Count Distribution Summary

All Chats Chats sent by Sam Schwartz Chats sent by Micah Orton
No. of Words in Chat
Min. 1.00
1st Qu. 4.00
Median 7.00
Mean 10.88
3rd Qu. 13.00
Max. 247.00
No. of Words in Chat
Min. 1.00
1st Qu. 6.00
Median 10.00
Mean 15.27
3rd Qu. 19.00
Max. 247.00
No. of Words in Chat
Min. 1.000
1st Qu. 2.000
Median 5.000
Mean 6.783
3rd Qu. 9.000
Max. 87.000

Histogram of Word Count

In general, chats tend to follow a exponential decay distribution with respect to word count.

Histogram of Word Count by Sender

What may be more interesting is how the chatters compare to each other.

Total Word Count

What happens when we ignore the chats entirely? When we take the aggregated word counts across all chats in the data set and partition them by sender, the following results.

Sender Total Word Count
Sam Schwartz 24347
Micah Orton 11578

Number of Words Sent Per Day

Remarkably similar to how often do we communicate?, the following shows how much is said over time.

Number of Words Sent Per Day by Sender

Number of Words Per Chat Per Day

Some individuals may send many short messages. Others may send fewer chats, but those chats are correspondingly longer. This section tries to normalize those differences so that two individuals chat styles can be more readily compared.

Number of Words Per Chat Per Day by Sender

Chats By Misspelled Words

Quantity is important, but so is quality. Each message was spell checked using the library provided by LanguageTool. The message was parsed using English (American) and Spanish (Spain) dictionaries. A message’s misspelled word count was the minimum of either misspelled English words or misspelled Spanish words.

For my coding/math friends, the pseudo-code is something like

misspelledWords = min{numberOfMisspelledEnglishWords(message), numberOfMisspelledSpanishWords(message)}

Unfortunately, messages which code switchs within the same chat may be flagged as having a high number of misspelled words.

Several common chatspeak entries, such as “lol” or “haha”/“jaja” have also been added as correctly spelled words for the purpose of this count. My #1 pet peeve, “u”, is most definitely considered a misspelling.

Word Misspelling Distribution Summary

All Chats Chats sent by Sam Schwartz Chats sent by Micah Orton
No. of Misspelled Words in Chat
Min. 0.00000
1st Qu. 0.00000
Median 0.00000
Mean 0.09724
3rd Qu. 0.00000
Max. 5.00000
No. of Misspelled Words in Chat
Min. 0.0000
1st Qu. 0.0000
Median 0.0000
Mean 0.1481
3rd Qu. 0.0000
Max. 5.0000
No. of Misspelled Words in Chat
Min. 0.00000
1st Qu. 0.00000
Median 0.00000
Mean 0.04979
3rd Qu. 0.00000
Max. 2.00000

Histogram of Misspelled Word Count

This plot will also often follow a decaying exponential distribution.

Histogram of Misspelled Word Count by Sender

Total Misspelled Word Count by Sender

So who’s the worse speller?

Sender Total Misspelled Word Count
Sam Schwartz 236
Micah Orton 85

Total Number of Misspelled Words Sent Per Day

Hopefully this will be going down with time!

Total Number of Misspelled Words Sent Per Day by Sender

Probability a Word is Misspelled by Sender

Hold up though! If one individual writes a lot more than another, it’s fair to claim that the long-winded individual will make more mistakes.

By dividing by the total word count, we can see the probability that a given word is misspelled.

Sender Probability (as a percentage) That a Randomly Selected Word is Misspelled
Sam Schwartz 0.9693186
Micah Orton 0.7341510

Probability a Word is Misspelled Per Day

Also hoping this goes down with time!

Probability a Word is Misspelled Per Day by Sender

Sentiment Analysis

Sentiment Analysis is one of the new sexy buzzwords on the forefront of machine learning and natural language processing today. While the field as a whole is still in its infancy, some guys at Stanford have developed a classification neural-net for assigning sentiment to a particular utterance. While still not quite perfect, their model is the best on the market for now.

It’s really cool, and after checking out the rest of this document take a look at their stuff here.

In a nutshell, each chat was processed through Stanford’s software and given a classification: Very negative, Negative, Neutral, Positive, Very positive.

Examples of positive phrases include “I love you”, versus negate phrases like “I hate you”. You can try out your own phrases through Stanford’s live demo page.

In order to quantify this information, the following arbitrarily chosen numeric weight was assigned for each value:

Likert Classification Numeric Weight
Very negative -10
Negative -5
Neutral 0
Positive 5
Very positive 10

Sentiment Summary

With all that in mind, here’s the initial breakdown of feeling expressed in the data set.

All Chats Chats sent by Sam Schwartz Chats sent by Micah Orton
Sentiment Rating
Min. -10.00000
1st Qu. -5.00000
Median 0.00000
Mean -0.06362
3rd Qu. 0.00000
Max. 10.00000
Sentiment Rating
Min. -10.00000
1st Qu. -5.00000
Median 0.00000
Mean 0.00941
3rd Qu. 5.00000
Max. 10.00000
Sentiment Rating
Min. -10.0000
1st Qu. 0.0000
Median 0.0000
Mean -0.1318
3rd Qu. 0.0000
Max. 10.0000

Chats By Sentiment

Chats by Sentiment and Sender

Average Sentiment Per Day

Average Sentiment Per Day By Sender

Conclusion

Thank you for checking out this document. If you have any questions, recommendations, or concerns shoot me an email at samorschwartz@gmail.com.