classifier

A Patch for Social Media Bankruptcy

Posted on Updated on

As evidenced in blog posts predating this WordPress account, I have a keen interest in social media. Particularly, I would like to make it less of a noisy place and more usable “at a glance” than it exists today.

A recent post from tacit reminded me that, in some respects, this problem has been solved. Email messages are remarkably similar to posts on Twitter, Tumblr, and Facebook, which isn’t a huge surprise given their Usenet and BBS roots, which are themselves rooted in much earlier and older methods of human communication. Email messages are furthermore just as disposable, just as informative, and just as easy to abuse as the sites we now use as aggregators for our communication.

So, I wonder: why don’t we see more spam filtering in social media? For that matter: why aren’t we using this same class of machine learning algorithms to determine legitimate interest instead of “driving engagement?”

Is it because the gatekeepers have a vested interest in serving advertisements? Well, in some part, yes: by maximizing the amount of time people spend with the site, reading duplicate but engaging information, they get more opportunities to gather data and serve advertising. But, there’s a second side to this: pattern recognition is difficult to get right generically. For a simple example, try Google Voice transcription or Google Translate. They’re serviceable, but often, hilariously wrong.*

But, there’s nothing stopping client authors from writing their own classifier for each user. Imagine this for a moment: a theoretical Twitter client where each tweet gave you an “interested” and “uninterested” button. “Interested” includes more content similar to what you selected. “Uninterested” selects less, affecting future tweet selection as well. Both have configurable fall-off and the ability to randomly bypass the filter as desired, so even as one’s interests change over time, you’re guaranteed to not get a perfect echo chamber (unless you want one, of course).

And I’m not even thinking of anything fancy here. Under the hood is just a naive Bayes classifier trained on your selections. If you want to be fancy, consider using information entropy as a low-pass filter. Maybe you want to use your favorite machine learning algorithm instead? Sure, go for it.

My point is, of the gamut of social media clients available, very few bother to give us the tools necessary to manage information. I find this incredibly strange: either we’ve forgotten why they were relevant in the first place, or social media has forced us to recontextualize what is fundamentally the same problem as spam in email.

So, I intend to try this for my own purposes. Since I interact with the majority of short-form social media using IRC (using the truly wonderful IRSSI and Bitlbee), I’m going to write this into a plugin that color-codes incoming messages by how interesting they are. At any time, I can add the text to my classifier by just passing the message through it. All unclassified incoming messages push classification closer to neutral, while all “interested” and “uninterested” votes push them closer to bold or invisible (but never so invisible that mouse highlighting can’t unmask them).

And, we’ll see how that goes. If it works, I might advocate for other people to try it (or patch it into clients until more do).


* Conversely, this is why products like Dragon NaturallySpeaking are so good. They train on a set exclusive to the user, which is better at picking up deviations in patterns of speech, accents, etc. If Google or any of the other large organizations provided a training set per user, this could improve.

Advertisements