Did Digg game its own system to benefit publisher partners?
Digg recently published a blog post titled “Digg’s Algorithmic Mystery Tour” on October, 15th. While a Digg blog post is a normal thing, a post about the algorithm was very surprising to me. Why did Digg, which never bothers to blog about very visible changes, numerous bugs and issues, decided to blog about its secret “algorithm”? They never even verbally discussed it in public. Since the announcement, the front page stories have changed. A lot of sites not found much on the front page of Digg since the v4 fiasco; started to resurface frequently. Many diggers noticed this and there has been a lot of chatter about this. Many diggers wondered, how did such a small ‘tweak’ to the algorithm cause so much change to the front page. I wondered this as well.
Because I found this so strange I wanted to find an answer at least for my own edification. The main reason for my curiosity was that most of these sites never got enough diggs prior to the ‘algo’ change to make it anywhere close to “top news.” I play all the time with the various social network APIs (Digg, Twitter, Google, Facebook etc). This is more of a hobby to me. What started as a casual search for an answer has now turned out to be what I think is a major revelation –- big enough for me to go public about it. Essentially what I think I have discovered is that someone has created dozens of accounts, in order to make sure that Digg’s publishing partners get front pages on the site, so that those sites get Digg referrals. They certainly had not been getting many Digg referrals in the last several weeks before the recent ‘algo’ change.
Some disclaimers & notes: I am no seasoned writer and please pardon my poor language usage. All data I refer to here can be made available by tweeting me. The data I am using is current (as of 11pm CST, October 23rd), however I will continue to pull and maintain the data from the API for a while. The API has a lot of limitations and I have tried my best to work around those limitations. I will be writing in the sequential order in which I was drilling the data and any inferences or opinions I make will be clearly identified so that the difference between facts and opinions are obvious. Some data/graphs I present may be irrelevant to the crux of the matter under discussion, but interesting nevertheless. In this report’s original form, I divided it into several pages, not just because it was very long, but because I would like your comments on various pages – as they each show a different set of data. I would really like to hear if you agree with my viewpoints or not.
Did the new ‘algo’ really change anything?
I began with downloading the details about all “top news stories” in the month of October, to see how many stories have “popped” and to get an idea of the variety in the stories that have reached the front page during the course of the month.
There is nothing out of the ordinary to be noticed in the chart above. And here is a bit more information on the top news domains. Interactive Data
Now, what are the top 20 domains (by the number of stories in a given time period) in three different time ranges?
The items highlighted in yellow in the third column, are new entrants into the “top 20 league” since the ‘algo’ change.
Did only the domains change or much more?
Now that it is obvious that the ‘algo’ change on Oct 15th has affected the chance of certain domains popping, I went forward to download ALL the UPCOMING diggs (227,936 diggs) made to any story (2390 stories) which eventually entered top news since Oct 1st. I should note here that, as explained in the Digg blog post, certain stories have had the “promote date” timestamp updated and thus, a few diggs made in the time from the 1st pop to the 2nd pop are included as well, as there is no means to exclude them.
As I mentioned before, out of curiosity to see who was digging the upcoming stories, I probed the API to find the top 100 people casting upcoming diggs on these top news stories. I also tried the same three time ranges as in step 1. Now is when interesting data started to pour in.
Much like the previous table, I have highlighted the new “entrants” in column 3. Suddenly, new, dedicated users have appeared. All those new domains in the previous table might not have been suspicious. But what we now see is that these new users are responsible ONLY for voting on the publisher domains. These users, all have similar profiles, similar names, and much more (similarities to follow in further pages).
At this point, if you are still with me – I have a request to you. Go ahead and download some of the data from the Digg servers and store them; as I fear that the data might vanish. I am only listing a small list of sample data, which has the mystery buried in them. All that you need to do is, right click each of the following links and save the (xml) files to your computer. Should the data vanish “mysteriously” – you will have proof that data did exist. If you fear for the security of the link, note that all links are pointing to Digg servers.
Who and how many are they?
Doing some pattern matching, 159 users are suspicious. And here they are with links to their profiles:
Now that 159 suspicious users have been found, note the similarities in their profiles. If you have not visited their profiles, please do now – to see that all of them are “new” and they do nothing but digg (no comments, submissions etc). Sample profile screen shots:
Look at the last profile, no followers, no followings, but digging a very select set of stories.
(From now, the 159 suspicious users will be called suspects)
So, What have they been digging? May be just spammers!
How much have these suspect’s diggs been spread across the various domains in 2390 stories we are analyzing. The data used is from Oct 1st. This “operation” appears to have only begun after Oct 15th.
As can be seen in the table above, this seems to be a very widespread attempt, not targeting any single domain. One thing they do have in common though is that they are Digg publishing partners. Did you notice one notable absentee on the domain list? Hint… it starts with a “mash” and ends with “able” ;) I know they caused controversy when they were all over the front page during the transition to Version 4. Maybe that is the reason why they are not included in this front page effort at this time.
These accounts are not “spammers.” What have they achieved?
How many pops did these domains get because diggs from these suspect accounts? 229. However, just one digg from one these IDs should not make any of the stories by themselves suspicious, so I am now going to list all of the 229 stories and the number of suspect diggs and non suspect diggs. There’s no real way to know whether Digg is responsible for this or not. Remember that due to ‘promote_date confusion’ in Digg data, the total number of upcoming diggs of a few stories might not be accurate. Also remember that you are only seeing data as of 11pm CST on Oct 23rd, while this is still continuing to happen.
Link to interactive and detailed version of this data.
Now that each of the stories has been given an “ID”, we will use it for our reference. Did you notice that story with ID 1, got only 1 actual digg!. Yes, all it took was guardian.co.uk to submit the story and the rest was taken care of (but by who?). Any story with or more than 100 upcoming diggs, for sure has promote_date problem in it, so let’s for now leave those stories and crunch a few numbers. Also stories 209, 219 and 221 were excluded as they are clear outliers. For the rest of the stories (leaving out 31+3), 10016 suspect diggs were cast, they also had about 4055 non suspect diggs, but this 4055 is very far high from the reality, due to the promote_time bug/feature. To get more reasonable estimate of the problem, let’s now only consider stories which needed 60 or less upcoming diggs, as these stories clearly are not a part of the promote_date bug. In this case, 986 diggs out of 1257 diggs were suspicious, that is 78.44% of diggs on these stories are suspicious.
There are a few interesting domains, submitters and stories to note here, which are discussed in a later page.
Is there a pattern to their digging?
So, is there a time pattern to these suspicious diggs? How would these stories compare to other regular stories? I am now showing some charts, with all of the stories in them needing 63 diggs to enter the top news. The 63 is just arbitrary, but useful in comparing the data. There are 8 suspicious stories with 63 upcoming diggs, so I am randomly picking 8 non-suspicious stories as well.
X Axis below is the number of diggs (until reaching front page) and Y axis is the number of minutes for each of those diggs — for the 16 selected stories.
As can be seen above, 5 out of the 8 suspect stories are obviously distinct from the other stories. Though not obvious, once the suspect users get into action, the respective story enters the top news section in about average 100 minutes. This graph will be used for a later discussion.
The 16 stories used for this graph are:
Who could have been doing it?
Without pointing fingers directly at anyone, let’s get into some analysis.
- As seen earlier, all of the domains are widely spread-out with no clear link between them all, except for the FACT that they are all “preferred publishing partners” of Digg.
- Why did Digg out of nowhere decide to do a “Digg’s Algorithmic Mystery Tour” blog post, when it never before spoke about it? How did it even expect that everyone will accept that post at its face value?
- Why is it that the blog post came on Oct 15th and this started immediately?
- Now see the list of 229 stories – story with ID 72 from tumblr blog is submitted by “daryapino”. This is no one but the girl friend of the ‘Founder & Seller of the Sole of Digg’ Kevinrose. See that story is heavily dugg and commented by digg staff.
- Now see story 192, submitted by Kevinrose. Recently Kevin himself was getting only around 30 diggs (examples: http://digg.com/news/business/bezos_backed_doxo_launches_paperless_billing_service_2, http://digg.com/news/lifestyle/a_house_by_the_park )per story, but now ….. you know he is the founder.
- Why did digg decide to stop showing who dugg a story? Never responded to users feedback regarding this request. When you hide anything, people will think you are behind any problems connected to your act of hiding.
Clearly the above facts point the finger at digg with no one else as a suspect, however there is no concrete evidence to say that digg is 100% responsible, so I will only write to say that Digg is the one and only prime suspect here. This also coordinates well with their urge in getting Diggable ads out … well digg we just realized that most stories in the “top news” are ads, thanks!
So now what?
I can keep writing about this for ever …. but nothing is going to change. This is happening even this minute (4.27 pm CST, Oct 25th), but I must end it.
I am going to split this into two pieces – The piece titled “By Digg” is meant to be read if you think Digg has a direct involvement in this (as I do) and the piece titled “Not by Digg” is meant to be read if you think digg is not involved in this.
And what a coincidence, I would here like you to point to audio quote (http://www.youtube.com/watch?v=Ay8_cKWrOqw#t=62m50s ) by none but myself, in the SocialBlade show – just wow to myself ;)
If you think Digg did this:
Am sure someone from Digg is going to read this. So, I will address my points here to digg itself.
Digg, putting it very simple, this is like the US treasury printing fake dollars, just exactly the same. You lost any iota of credibility users may have had on you. Good job!
You messed up V4, you failed to listen to your users, and after a long time you agreed that you messed it up. You promised to listen to your users and are pretending to listen. Except for minor changes here and there, there is so much to be done. Instead of really working on those changes and coming back in an integral way, why did you choose to use such a cheesy method? Did you assume that all of your “several million” users are idiots? You have now not only failed traffic wise, users wise etc., you have failed as a business.
Integrity in a business is the first step towards success. A few small tricks here and there, to keep things running is seen as a “clever” thing, but cheating with the core of your business is an absolute crime. What caused this? VC pressure? Urge to not fail? For that I have spent hours and hours getting this out, be bold and give me a reply.
To the VC funders of Digg, I think you just lost your last hope!
You think Digg didn’t do this:
So you think this has not all been done by Digg and I see that you would give them the benefit of doubt. But this has been happening daily since Oct 15th. Why did Digg not spot this? Can you answer that? The chart I showed earlier shows that the curves for the suspicious stories are clearly way off. Digg keeps boasting about its complicated “algo” and monitoring system. This is so widespread – and they could not catch or stop it. Now how would you trust their algo or monitoring systems? Why would you believe them? Answer to yourself or post as comments here.
Giving a Fair Chance
Now that I am accusing Digg of something huge, I am going to give them a fair chance to explain their side until I publish this. However, I strongly suspect that data destruction might happen. So, I am going to record a video (don’t watch it, unless you have nothing else to do - http://www.youtube.com/watch?v=aQH5oC-iVnc ) showing the data being downloaded from digg servers and stored (229 XML files. I will also upload the files to a public server ( http://www.megaupload.com/?d=LUP0WFJ4 ), as a proof. That way if they ever delete data, you could trust my copy.
If I hear back from Digg, one more page will be added.
Passionate Digg User,
I have contacted digg ….
I asked for a phone number for the Communication Director, but was told that they could take over this by email. I sent an email as below:
Here is the link:
Runs several pages, I would appreciate a comment within about 30 mins, as I mention in the article – I fear data destruction will happen.
I then got a reply in 20 mins:
That is a lot of information to assess in such a short period of time. Unfortunately, we’re not going to be able to get back to you with a comment within 30 minutes.
Realizing that 30 mins might have been too short, I responded after 15 minutes with:
Is there a reasonable time you want me to wait for?
I am going to wait until 6:34 CST, that is 1 hr from when digg got a first chance to read it. If by then, they do not give me a reasonable time to wait, I will be going ahead and make this link public.