I'm Christopher Tarry, a second year computer science student at the University of Michigan and intern at Sia. The opinions and work here are my own and are not sponsored by my university or employer etc. I made this site using a couple hundred lines of Python and some extra time I had on the weekends. It works using Burrows' Delta. Essentially you find the most common n words in a dataset, then for each user get their frequency for each most common word (if n=3 and your most common words are ["the", "a", "I"] and your corpus is "the fish in the ocean is a large one" then your frequency array would be [2/9, 1/9, 0/9]), then for each user compare their normalized frequency arrays with cosine similarity and print the 20 other users with the highest similarity scores.

I tested it by creating a test set and training set where for each user the test set had half of their comments and the training set had the other half. My tests using a subset of the comments and 5000 users found about 90% accuracy if you looked only at the most similar user and 99% if the user was in the top 20 by similarity score. The current database has about 78,000 users.

My main motivation behind making this site was to make people aware how easy this is. I'm just someone guy who knows a little bit of Python and has some time on his hands. Imagine how much more accurate a company or government with millions of dollars and 50 PhD linguists on staff would be.

You can email c\\hrisjtarry @ gmail.com if you have any questions.