Dumplings from this Panda!

Oct 25

One Page Version of - Mystery Behind the “Digg’s Algorithmic Mystery Tour”

Digg published a blog post titled “Digg’s Algorithmic Mystery Tour” on October, 15th. While a digg blog post is just normal, a post about the algorithm was very surprising to me. How come Digg which never bothers to blog about very visible changes, numerous bugs & issues – decided to blog about its most hidden “algorithm”? They never even verbally discussed about it in public. And since then the front page stories changed. A lot of sites not found much on the front page of digg since the v4 fiasco; started to resurface frequently. Many diggers noticed this and there has been a lot of chatter about this. Many diggers wonder, how come a small tweak to the algo is doing so much change to the front page, and so “did” I.

Finding this strange, I wanted to find an answer to this question, at least to myself. Main reason for my curiosity was that, most of these sites did not get enough diggs prior to the algo change to make it anywhere close to the “top news”.  I play much with the various APIs (Digg, Twitter, Google, Facebook etc) and more of a hobby to me. What I started as a casual search for an answer has now turned to be a major revelation – big enough for me to go public about it.

Some disclaimers & notes: I am no seasoned writer and pardon my poor language usage. All data I refer to here can be availed by tweeting me. The data I am using is current (as of 11pm CST, October 23rd), however I will continue to pull and maintain the data from the API for a while. The API has a lot of limitations and I have tried my best to work around those limitations. I will be writing in the sequential order in which I was drilling the data and any inferences or opinions I make will be clearly identified to make the facts remain as facts. Some data/graphs I present may be irrelevant to the crux of the matter under discussion, but interesting nevertheless. I have separated this into several pages, not just because it is going to be very long, but because I would like your comments on the various pages – as they each show a different set of data. I would really like to hear if you agree to my view points or not.


Did the new algo really change anything?

I began with downloading the details about all “top news stories” in the month of October, to see how many stories have “popped” and how spread have the domains popped been across the month.


There is nothing particularly to be noticed in the chart above. And here is a bit more information on the top news domains. Interactive Data

Now, what are the top 20 domains (by the number of stories in a given time period) in three different time ranges:

The items highlighted in yellow in the third column, are new entrants into the “top 20 league” since the algo change.

Did only the domains change or much more?

Now that it is obvious that the algo change on Oct 15th has affected the chance of certain domains popping, I went forward to download ALL the UPCOMING diggs (227,936 diggs) made to any story (2390 stories) which eventually entered top news since Oct 1st. I should here note that, as explained in the digg blog post, certain stories have had the “promote date” timestamp updated and thus, a few diggs made in the time from the 1st pop to the 2nd pop are included as well, as there is no means to exclude them.

As told before, curious to see who is digging the upcoming stories, I queried to see who are the top 100 people casting upcoming diggs on these top news stories. I also, tried the same three time ranges as in step 1, and now is when interesting data started to pour.


Much like the previous table, I have highlighted the new “entrants” in column 3. But wait, new entrants in the previous table about the domains with most number of stories after an algo change might make sense, but how does so many new users – from just exactly the day after the “mysterious” blog post make sense? If you can connect the dots and convince me that there is nothing suspicious in this FACT that the algo change has got a new big bunch of dedicated users, all with similar profiles, similar names, and much more (similarities to follow in further pages) …. I bow to you.

At this point, if you are still with me – I have a request to you. Go ahead and download some of these data from the digg servers and store them; as I fear that the data might vanish. I am only listing a small list of sample data, which has the mystery buried in them. All that you got to do is, right click each of the following links and save the (xml) files to your computer. Should the data vanish “mysteriously” – you will have proof that data did exist.  If you fear for the security of the link, note that all links are pointing to digg servers.

1
2
3
4
5
6
7
8
9
10

Who and how many are they?

Doing some pattern matching, 159 users are suspicious. And here they are with links to their profiles:

a1
a3
a5
d10
d11
d12
d13
d14
d15
d16
d17
d2
d4
d5
d6
d8
d9
dd1
dd13
dd14
dd15
dd16
dd17
dd18
dd19
dd2
dd20
dd21
dd23
dd26
dd27
dd28
dd3
dd30
dd33
dd34
dd35
dd36
dd37
dd38
dd39
dd4
dd41
dd42
dd43
dd45
dd46
dd47
dd5
dd6
dd7
dd8
dd9
diggerz10
diggerz11
diggerz13
diggerz14
diggerz16
diggerz17
diggerz18
diggerz19
diggerz20
diggerz21
diggerz22
diggerz23
diggerz24
diggerz25
diggerz26
diggerz27
diggerz29
diggerz30
diggerz31
diggerz32
diggerz33
diggerz34
diggerz35
diggerz36
diggerz37
diggerz38
diggerz39
diggerz40
diggerz41
diggerz42
diggerz43
diggerz44
diggerz45
diggerz46
diggerz47
diggerz5
diggerz55
diggerz6
diggerz7
diggerz8
diggerz9
s1
s10
s11
s12
s13
s14
s3
s4
s5
s6
s7
s9

Now that 159 suspicious users have been found, note the similarities in their profiles. If you have not visited their profiles, please do now – to see that all of them are “new” and they do nothing but digg (no comments, submissions etc). Sample profile screen shots:

Look at the last profile, non followers, no following, but digging a very select set of stories.

(From now, the 159 suspicious users will be called suspects)

So, What have they been digging? May be just spammers!

How much have these suspect’s diggs been spread across the various domains in 2390 stories we are analyzing. The data used is from Oct 1st, however this “operation” only began after Oct 15th.

Domain(count)
newsfeed.time.com (644)
dailymail.co.uk (578)
boingboing.net (461)
techcrunch.com (440)
telegraph.co.uk (408)
youtube.com (395)
huffingtonpost.com (378)
collegehumor.com (331)
slate.com (331)
wired.com (311)
arstechnica.com (295)
cbsnews.com (280)
bbc.co.uk (235)
maximumpc.com (232)
rawstory.com (190)
space.com (174)
gawker.com (168)
theonion.com (165)
news.discovery.com (159)
washingtonpost.com (147)
voices.washingtonpost.com (143)
newsweek.com (128)
livescience.com (122)
physorg.com (120)
news.nationalgeographic.com (118)
tpmlivewire.talkingpointsmemo. (118)
motherjones.com (115)
businessinsider.com (114)
engadget.com (113)
alternet.org (112)
i.imgur.com (112)
torrentfreak.com (110)
news.yahoo.com (108)
gizmodo.com (102)
funnyordie.com (102)
thedailybeast.com (99)
xkcd.com (91)
jalopnik.com (89)
news.cnet.com (86)
bloomberg.com (78)
greencarreports.com (65)
teamcoco.com (65)
news.com.au (65)
blogs.techrepublic.com.com (65)
tech.fortune.cnn.com (65)
abcnews.go.com (65)
novafm.com.au (64)
foxnews.com (64)
aolnews.com (63)
tuaw.com (63)
businessweek.com (63)
ucbcomedy.com (63)
io9.com (62)
buzzfeed.com (62)
guardian.co.uk (62)
holytaco.com (62)
scientificamerican.com (62)
spacefellowship.com (61)
salon.com (61)
ktla.com (60)
thefoxnation.com (60)
life.com (60)
msnbc.msn.com (60)
symmetrymagazine.org (60)
boston.com (60)
upi.com (59)
psychologytoday.com (58)
muslimswearingthings.tumblr.co (58)
myfoxdc.com (58)
reuters.com (57)
thelocal.se (56)
newgrounds.com (56)
tpmdc.talkingpointsmemo.com (55)
readwriteweb.com (55)
popsci.com (55)
expressjetpilots.com (55)
flickr.com (55)
bits.blogs.nytimes.com (54)
blogs.forbes.com (54)
indiareport.com (53)
religion.blogs.cnn.com (53)
warlogs.wikileaks.org (51)
cnn.com (50)
theappleblog.com (50)
kottke.org (49)
breitbart.com (48)
tokeofthetown.com (48)
generic1.tumblr.com (47)
blogs.discovermagazine.com (47)
theatlanticwire.com (47)
jezebel.com (46)
examiner.com (45)
npr.org (43)
treehugger.com (43)
zdnet.com (42)
spiegel.de (40)
holykaw.alltop.com (37)
blastr.com (35)
howtogeek.com (27)
ccinsider.comedycentral.com (26)
thesmokingjacket.com (25)
edition.cnn.com (22)
hollywoodreporter.com (5)
buzzll.com (1)

As can be seen in the table above, this seems to be very clearly widespread attempt, not targeting any single domain. My only conclusion/inference here is that, the diggs have mostly been towards “publishing partners”. Did you notice one notable absentee in this domain list: hint – starts with a “mash” and ends with “able” ;) I know they did some “advice” posts to digg and them being all over the front page was the trigger point for the recent fiasco, not sure why that particular blog is missing in this list, but is obvious.

They are not spammers, what have they achieved?

How many pops did these domains gets by diggs from these suspects? 229. However, just one digg from one these IDs should not make any of the stories by themselves suspicious, so I am now going to list all of the 229 stories and the number of suspect diggs and non suspect diggs. While whether the digg is suspect or non suspect is clear, remember that due to promote_date confusion in digg data, the total number of upcoming diggs of a few stories might not be accurate. Also remember that you are only seeing data as of 11pm CST on Oct 23rd, while this is still continuing to happen.

Link to interactive and detailed version of this data.




Now that each of the story has been given an “ID”, we will use it for our reference. Did you notice that story with ID 1, got only 1 actual digg!. Yes, all it took was guardian.co.uk to submit the story and the rest was taken care (by who?). Any story with or more than 100 upcoming diggs, for sure has promote_date problem in it, so let’s for now leave those stories and crunch a few numbers. Also stories 209, 219 and 221 were excluded as they are clear outliers. For the rest of the stories (leaving out 31+3), 10016 suspect diggs were cast, they also had about 4055 non suspect diggs, but this 4055 is very far high from the reality, due to the promote_time bug/feature.  To get more reasonable estimate of the problem, let’s now only consider stories which needed 60 or less upcoming diggs, as these stories clearly are not a part of the promote_date bug. In this case, 986 diggs out of 1257 diggs were suspicious, that is 78.44% of diggs on these stories are suspicious.

There are a few interesting domains, submitters and stories to note here, which are discussed in a later page.

Is there a pattern to their digging?

So, is there any time pattern among these suspicious diggs? How would these stories compare to other regular stories? I am now showing some charts, with all of the stories in them needing 63 diggs to enter the top news. The 63 is just arbitrary, but useful in comparing the data. There are 8 suspicious stories with 63 upcoming diggs, so I am randomly picking 8 non-suspicious stories as well.

X Axis below is the number of diggs (until reaching front page) and Y axis is the number of minutes for each of those diggs — for the 16 selected stories.



As can be seen above, 5 out of the 8 suspect stories are very clearly obvious distinct from the other stories. Though not obvious, once the suspect users get into action, the respective story enters in about average 100 minutes. This graph will be used for a later discussion.

The 16 stories used for this graph are:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

Who could have been doing it?

Without pointing fingers directly anyone, let’s get into some analysis.

  1. As seen earlier, all of the domains are widely spread-out with no clear link between them all, but for the FACT that they are all “preferred publishing partners” of digg.
  2. Why did digg out of nowhere decide to do a “Digg’s Algorithmic Mystery Tour” blog post, when it never before spoke about it? How did it even expect that everyone will accept that post at its face value?
  3. Why is it that the blog post came on Oct 15th and this started immediately?
  4. Now see the list of 229 stories – story with ID 72 from tumblr blog is submitted by “daryapino”. This is no one but the girl friend of the ‘Founder & Seller of the Sole of Digg’ Kevinrose. See that story is heavily dugg and commented by digg staff.
  5. Now see story 192, submitted by Kevinrose. Recently Kevin himself was getting only around 30 diggs (examples: http://digg.com/news/business/bezos_backed_doxo_launches_paperless_billing_service_2, http://digg.com/news/lifestyle/a_house_by_the_park )per story, but now ….. you know he is the founder.
  6. Why did digg decide to stop showing who dugg a story? Never responded to users feedback regarding this request. When you hide anything, people will think you are behind any problems connected to your act of hiding.

Clearly the above facts point the finger at digg with no one else as a suspect, however there is no concrete evidence to say that digg is 100% responsible, so I will only write to say that digg is the one and only prime suspect here. This also coordinates well with their urge in getting Diggable ads out … well digg we just realized that most stories in the “top news” are ads, thanks!

So now what?

I can keep writing about this for ever …. but nothing is going to change. This is happening even this minute (4.27 pm CST, Oct 25th), but I got to conclude.

I am going to split this into two pieces – The piece titled “By Digg” is meant to be read if you think digg has a direct involvement in this (as I do) and the piece titled “Not by Digg” is meant to be read if you think digg is not involved in this.

And what a coincidence, I would here like you to point to audio quote (http://www.youtube.com/watch?v=Ay8_cKWrOqw#t=62m50s ) by none but myself, in the SocialBlade show – just wow to myself ;)

By Digg:

Am sure someone from digg is going to read this. So, I will address my points here to digg itself.

Digg, putting it very simple – this is like the US treasury printing fake dollars, just exactly the same. You lost any iota of credibility users may have had on you. Good job!

You messed up V4, you failed to listen to your users and after a long time you agreed that you messed it up. You promised to listen to your users and are pretending to listen to. Except for minor changes here and there, there is so much to be done. Instead of really working on those changes and coming back in an integral way, why did you choose to use such a cheesy method? Did you assume that all of your “several million” users are idiots? You have now not only failed traffic wise, users wise etc., you have failed as a business.

Integrity in a business is the first step towards success. A few small tricks here and there, to keep things running is seen as a “clever” thing, but cheating with the core of your business is an absolute crime. What caused this? VC pressure? Urge to not fail? For that I have spent hours and hours getting this out, be bold and give me a reply.

To the VC funders of Digg, I think you just lost your last hope!

Not By Digg:

So you think this is not by digg and I see that you would give them the benefit of doubt. But this has been happening daily since Oct 15th, why did not spot this? Can you answer that? The chart I showed earlier shows that the curves for the suspicious stories are clearly way off. Digg keeps boasting about its complicated “algo” and monitoring system. This is so widespread – and they could not catch or stop it. Now how would you trust their algo or monitoring systems? Why would you believe them? Answer to yourself or post as comments here.

Giving a Fair Chance


Now that I am accusing digg of something huge, I am going to give them a fair chance to explain their side until I publish this. However, I strongly suspect that data destruction might happen. So, I am going to record a video ( don’t watch it, unless you have nothing else to do - http://www.youtube.com/watch?v=aQH5oC-iVnc ) showing the data being downloaded from digg servers and stored (229 XML files. I will also upload the files to a public server ( http://www.megaupload.com/?d=LUP0WFJ4 ), as a proof. That way if they ever delete data, you could trust my copy.

If I hear back from digg, one more page will be added.

Until then,

Passionate Digg User,

LtGenPanda

I have contacted digg ….

I asked for a phone number for the Communication Director, but was told that they could take over this by email. I sent an email as below:

————————————————————

Here is the link:

http://ltgenpanda.tumblr.com/post/1399805023/mystery-behind-the-diggs-algorithmic-mystery-tour

Runs several pages, I would appreciate a comment within about 30 mins, as I mention in the article – I fear data destruction will happen.

————————————————————

I then got a reply in 20 mins:

————————————————————

That is a lot of information to assess in such a short period of time. Unfortunately, we’re not going to be able to get back to you with a comment within 30 minutes. 

————————————————————

Realizing that 30 mins might have been too short, I responded after 15 minutes with:

————————————————————

Is there a reasonable time you want me to wait for?

————————————————————

I am going to wait until 6:34 CST, that is 1 hr from when digg got a first chance to read it. If by then, they do not give me a reasonable time to wait, I will be going ahead and make this link public.


  1. taryn-carilli reblogged this from ltgenpanda
  2. dagorret reblogged this from ltgenpanda
  3. socialnews reblogged this from ltgenpanda and added:
    Great job, Mohan. We have been discussing this behind the scenes that this is likely taking place, but thank you for...
  4. badqat reblogged this from ltgenpanda
  5. ltgenpanda posted this