Accuracy Improved - who you interact with


Posted by Si Dawson on 10/12/09 in Improvements

I've been wondering for some time now how to deal with the "Hey! My friend is on the report!" issue.

The Twit Cleaner looks at a limited set of data - it's simply not practical to go back through every tweet someone has ever written, for example. As it is, some reports require us to download & analyse up to 3 Gigs of data. It can get pretty crazy.

What I have done instead is the following:

1. Look at who @mentions you. Obviously spammers do this all the time, so it's not foolproof, however it will now take someone completely off the "never interacts" part of the report. If they are spammers, they'll likely show up elsewhere.

2. Look at who you @reply or RT to. If you're RTing or @ing someone, then obviously they're significant to you - therefore, that person is now removed from the report completely.

Unfortunately, this isn't quite as awesome as I'd like - mainly due to Twitter flakiness, but it's a definite improvement. What flakiness? Here's a typical conversation I recorded earlier:

Me: I'd like the last few hundred tweets this person made please!

Twitter: The last 57 you say? Sure thing! Here they are!

I have a plan in place to work around this, but it involves rewriting the entire back-end in a different language & moving it all to a different operating system altogether. This might take a little time. Heh.

Accuracy Improved - removing The Secretive


Posted by Si Dawson on 09/12/09 in Improvements

It was brought to my attention recently that people were using the bit.ly & ow.ly URL shortening services to track who clicked through from their profiles.

I have to admit this is something that hadn't occurred to me when I first wrote The Twit Cleaner (I didn't realise some URL shorteners did this).

I investigated a bit more deeply, & my conclusion is that spammers in general aren't using shortened URLs in their profiles so much these days. In fact, the number of people appearing on the report vs the number of "bad guys" was way too high.

So, I've removed that sub category from the report altogether. No point in just showing a bunch of shy (or stats geeks like myself) people on there. I've also removed the "No bio no url" sub category, since there was a similarly high false positive rate there. This means the entire "Secretive" category has now disappeared. One or two profiles will still show for the next little while, as they drain slowly from the cache, but their incidence will be drastically reduced.

In general, any spammers or other dodgy people will be adequately caught with the other criteria on the report.

I have more significant things in the pipeline to improve accuracy (particularly identifying people you care most about), but I have some technical hiccups (broken 3rd party libraries) to work around first. More on that later.