Tuesday, 15 March 2011

Twitter. It's doomed, you know.

The Problems of Twitter-scale data architecture.


All data at Twitter is big data. I have a previous life in databases (quite a long time ago), and back then large data was of the order of a terabyte. Twitter’s data is growing at a rate of half a gigabyte a second - and that mostly in 16 bit chunks.

I have extensive notes from the Twitter talk, detailed numbers and code examples, and you know what, I don’t think much of it is very interesting.


There are all sorts of interesting problems of this level of data, but none of them really that need a room full of PhDs to sort out.

I use twitter, I enjoy it. I don’t think that if it went down the pan tomorrow, it would really upset anyone very much. Sure, I’m glad it’s around, but beyond that it’s slightly meh.

It’s a bit better than linked-in for finding business contacts. It’s a bit better than google-groups for getting in touch with people who have a particular political motivation, celebrity crush or sexual perversion (I expect), but it’s not as good as facebook for keeping you in touch with friends, it’s almost totally unsearchable: If you have over 300 friends – and the average is 700 – try and find an old tweet from someone you follow, or worse, find that friend again out of your follower list if you can’t quite remember their twittername.

The reason that these problems exist is informed by the architecture. How do you index billions upon billions of tweets? You can’t index by keyword, username is presumably already a primary key (or at least, user_id), geolocation is meaningless.

The only factor left is temporality. Basically, Twitter is *consumed* at the point of generation and tweets are partitioned according to age. One database per month.
Databases are searched newest to latest until LIMIT is reached, and in practical terms, LIMIT is always reached in the first or second DB (this and last month).

Like I said. Not what you need to pay a room full of PhDs for. And the belief that this structure was informed by usage is precisely why I wouldn’t want PhDs looking at the problem. I want UX to look at the problem and to understand the data available and do something exciting with it. As soon as the übergeeks take on the problem , they solve it for current useage. With no apparent appreciation that current useage might be caused by not an effect of, the current architecture.

Which, by itself, renders the archival of every single tweet ever both pointless and unsustainable. What is the point of archiving every tweet? Especially when the act of archiving every tweet renders the system incapable of the simplest of actions.

A database is NOT for storing information. A database is for retrieving information. And extracting relationships and meta data from it. Data mining on the twitter scale will never be more interesting than

SELECT * FROM user LIMIT 20

Never? No, never. This is the current limit. You cannot do anything more interesting than that.

Twitter data is currently growing faster than Moore’s Law allows us to do anything interesting with it. This growth with continue exponentially. It may top out at some level, but at a level years away from anyone's ability to process it.

Twitter has therefore two options.
The first, extremely likely one is to be doomed eternally to never grow beyond it’s current interesting factor
The second, extremely unlikely one is that they hire a bunch of UX to create something impossible and a considerably smarter bunch of geeks to implement it.


-colophon-
If anyone wants a more in-depth analysis of the talk, facts and figures, comment below saying so and and I will be happy to post away.
Here are some to keep you geeked for now, though.

The first “running out of disk space” panic happened at 80GB – roughly 3 billion tweets
Conflicts are all dealt with a “last edit wins” policy.
These means that if the entire database was randomized and rerun, all social graphs would be replicated in their entirety.
Tweets average at 2000 / second.
Follow / unfollow activity rocks in at 20000 / second
Timelines are updated 4.8 million times a second (this was 21k in 2008)
Data for realitime queries (this and last month) are all kept in RAM. Lots of it.
Some searches are precomputed.
All timelines are precomputed.
Tier 1 downtime SLA is 0.
Any system can be “darkmoded”. Cool word for it.

No comments:

Post a Comment