Archive for the ‘wikipedia’ Category

Spikes are not fun anymore

Thursday, August 20th, 2009

English Wikipedia just scored “three million articles”, so I thought I’d give some more numbers and perspectives :) Four years ago we observed impressive +50% traffic spike on Wikipedia – people came in to read about the new pope. Back then it was probably twenty additional page views a second, and we were quite happy to sustain that additional load :)

Nowadays big media events can cause some troubles, but generally they don’t bring huge traffic spikes anymore. Say, Michael Jackson’s English Wikipedia article had peak hour of one million page views (2009-06-25 23:00-24:00) – and that was merely 10% increase on one of our projects (English Wikipedia got 10.4m pageviews that hour). Our problems back then were caused by complexity of page content – and costs got inflated because of lack of rendering farm concurrency control.

Other interesting sources of attention are custom Google logos leading to search results leading to Wikipedia (of course!). Last ones, for Perseids or Hans Christian Ørsted sent over 1.5m daily visitors each – but thats mere 20 article views a second or so.

What makes those spikes boring nowadays is simply the length of long-tail. Our projects serve over five million different articles over the course of an hour (and 20m article views) – around 3.5m articles are opened just once. If our job would be serving just hot news, our cluster setup and software infrastructure would be very very very different – and now we have to accommodate millions of articles, that aren’t just stored in archives, but also are constantly read, even if once an hour (and daily hot set is much larger too).

All this viewership data is available in raw form, as well as nice visualizations at trendingtopics, wikirank and stats.grok.se. It is amazing to hear about all the research that is built on this kind of data, and I guess it needs some improved interfaces and APIs already for all the future uses ;-)

Board again (perhaps)

Monday, July 27th, 2009

Tomorrow voting for Wikimedia Foundation Board of Trustees Election starts – and Yours truly is a candidate.

You can find most of my views on various issues in our question pages (I was somewhat boiling when answering the What will you do about the WMF mishandling it’s funding? one – it probably takes great effort to phrase such a bad question, and so easy to answer it :), as well as Wikipedia Signpost ‘interview’.

I was appointed to the Board back in January 2008, after holding various other volunteer (at some point in time – ‘officer’) positions within the organization since 2004 – and brought in the core technology and operational efficiency skill set there. The appointment was supposed to be somewhat temporary, but board restructure appeared to be much longer process than we expected – both the chapters part, and nomination committee work. As a community member, after the restructure I was in ‘community-elected’ seat, though I never participated in any election – so that wasn’t too fair to the actual community, need to fix that :)

So, even though I wasn’t too visible to actual community (people would notice me mostly when things go wrong, and I’m not in best mood then, usually :-), I feel that the values I’ve worked on, evangelized and supported for all these years – efficiency and general availability of our projects – can win mindshare not only of our read-only users I work mostly for, but also eligible voters.

And I do think, that internal technology expertise has to be represented on board, as things we’ve been doing, and methods we’ve been using, are very much unique in the technology world. Oh, and somewhere I mentioned, our technology spending is close to 50%, that has to be represented too :-)

Knight’s Cross!

Sunday, July 5th, 2009

Celebrating the Knight's Cross
We had very special State Award Ceremony today. Special, as it happens at the year we celebrate “thousand years of Lithuania”, special as it is the last one given by our very special President Valdas Adamkus.

Though for me, it was really special – Alvydas Mituzas, my Dad, got a Knight’s Cross, for his lifetime merits and service for our country, spanning forty years of creativity and dedication. Anyone knowing my Dad really know what the award is for, but for me it is also a symbol of support for virtues that I’ve seen in my whole life – hard work and imagination combined for public service and public good, no matter how difficult it is to start, or finish. Stories of his past are fascinating, and even I can learn more and more of them over the years. We’re proud of him, and we’re lucky to be his family.

embarrassment

Friday, June 26th, 2009

So, we had a major embarrassment last night. It consisted of multiple factors:

  • We don’t have parallelism coordinator for our most cpu-intensive task at Wikipedia, so it can work on same job in ten, hundred, thousand threads across the cluster at the same time.
  • Some parts of our parsing process ended up extremely CPU-intensive, and that happened not in our code, but in ‘templates’, that are in user-space. We don’t have profiling for templates, so we can just guess which one is slow, which one is fast, nor their overall aggregates.
  • Some parts of pages are extremely template-heavy, making page rendering cost a lot (e.g. citations – see this discussion).
  • In order to avoid content integrity race conditions, editing process releases locks and invalidates objects early, separated from ‘virgin parse’ which populates caches.
  • It takes quite some time to refill the cache, as rendering is CPU-bound for quite a while in certain cases.
  • During that short time when caches are empty, stampede of users on single article causes lots of redundant work across the cluster/grid/cloud.
  • Michael Jackson article on English Wikipedia alone had a million views in one hour

So, in summary, we had havoc in our cluster because stampede of heavy requests between cache purge and cache population was consuming all available CPU resources, mostly working on rendering references section on Michael Jackson article.

Oh well, quick operations hack looked like this:

Index: ParserCache.php
===================================================================
--- ParserCache.php	(revision 52088)
+++ ParserCache.php	(working copy)
@@ -63,6 +63,7 @@
  if ( is_object( $value ) ) {
    wfDebug( "Found.\n" );
    # Delete if article has changed since the cache was made
    // temp hack!
+   if( $article->mTitle->getPrefixedText() != 'Michael Jackson' ) {
    $canCache = $article->checkTouched();
    $cacheTime = $value->getCacheTime();
    $touched = $article->mTouched;

It is embarrassing, as actual pageview count was way below our usual capacity, whenever we have problems is because of some narrow expensive problem, not because of overall unavoidable resource shortage. We can afford much more edits, much more pageviews. We could have handled this load way better if our users wouldn’t be creating complex logic in articles. We could have handled this way better, if we had more aggressive redundant job elimination.

Thats the real story of operations, though headlines like “High profile event brought down Wikipedia” may sound nice, the real story is “shit happens”.

I loved Encarta

Monday, March 30th, 2009

That happened long before Wikipedia. I loved Encarta. Well, before Encarta, I used to read this thing a lot:

But then Encarta arrived and I loved it. It did fit into single CD and didn’t take too much space on disk. I could look up all these articles in it, without having to use expensive dialup, fast. I remember my school buddies coming over and watching those tiny movies in it. I could rip it off for my school works, and look incredibly smart (now people rip off Wikipedia and don’t get too much credit for that :).

It is dead.

People on the interwebs suggest that employees at Wikipedia and Encyclopaedia Britannica will be throwing parties tonight. Oh well, Wikipedia is already up to date about this. Every encyclopedia out there was an inspiration for Wikipedia, more so than any technology or “web-two-oh” hype. There’s not much joy seeing good things die.

Ten years ago I imagined, that once I have my own home, I’ll have a place to put a full set of dead-tree Britannica, like my parents had “Lithuanian soviet encyclopaedia”. Wikipedia changed my plans (now there’re two flat panels staring at Wiki, inside and outside), but it seems it already is changing the world around it way more. RIP Encarta. You were inspiring, and really too young to die. If it was us, we didn’t mean it, really. By the way, that content of yours, I’d be glad to see it free. *wink*

I’m a creative commoner

Saturday, March 28th, 2009

Lately Creative Commons is becoming very dominant topic in my life. First of all, I see all the people in free culture world holding their breath and waiting for Wikipedia switch to CC license. I’m waiting for that too – and personally I really endorse it. Though usually people do not really notice licenses on web content, they really do once they see something they really want to reuse. Wikipedia ends up being isolated island, if it doesn’t go after sharing and exchanging information with other projects.

It takes time to understand one is ‘creative commoner’. I do have a t-shirt with such caption, but it is much more comfortable once you start feeling real power of use and reuse of information. Few anecdotes…
(more…)

Tim is now vocal

Tuesday, December 16th, 2008

Tim at the datacenter
Tim is one of most humble and intelligent developers I’ve ever met – and we’re extremely happy having him at Wikimedia. Now he has a blog, where the first entry is already epic by any standards. I mentioned the IE bug, and Tim has done thorough analysis on this one, and similar problems.

I hope he continues to disclose the complexity of real web applications – and that will always be a worthy read.

Knol

Thursday, July 24th, 2008

There isn’t much to talk about Knol technology – it is either nicely engineered or missing (they probably thought that search is main tool for collaboration). Of course, many issues are already covered by others, but…

My first look was at the featured articles. What was wrong?

  • It features ‘closed collaboration’. Actually, thats no different from a blog, then…
  • It doesn’t care much about the licensing – featured articles had images with “all rights reserved”, or images taken from Wikipedia, with attribution but without share-alike clause. Also, no share-alike license forbids importing of content from many other places, but as we see it – nobody cares. ;-)
  • It doesn’t care about linking. Google search was based on the web links. Wikipedia was built on top of lots of broken links (oh, and working ones too). And nobody is going to type a Knol URL.
  • It doesn’t seem to have community tools. It just doesn’t.
  • WYSIWYG editing leads to articles without structure, just some text parts bolder than the other.

So for now, it seems to be pure-engineering approach at the problem, without looking at actual work done, social implications or properly respecting copyrights.

One needs community for that. Community helps not only with content, but with style, metadata, organizing, and most of all – ensures that project maintains values and spirit.

Wikipedia at Velocity conference

Thursday, June 19th, 2008

Next Monday I’ll be presenting (if jetlag doesn’t kill me) at Velocity 2008 – webops and performance conference. It won’t be my first time talking about Wikipedia infrastructure, but this time people will know the technology and scaling methods anyway.

As I see it, in such context Wikipedia is more interesting as a case of operations underdog – non-profit lean budgets, brave approaches in infrastructure, conservative feature development, and lots of cheating and cheap tricks (caching! caching! caching!).

Also, I’ll be able to share (making audience jealous) how it is great to be on non-profit ops team (and one of example perks – we can be cheap about getting conference passes too ;-)

The best part (for audience, not for me) – I will be forced to be honest. Nearly whole tech team will be at the event, and if I fail to attribute any developments, or start talking crap – not only they can throw rotten tomatoes, but also disable my login access and claim they never knew me, without me being able to fight back :) I didn’t publicly present in front of these guys since 2005 – will be tough.

Board

Wednesday, February 13th, 2008

Exciting times, I’m joining the Wikimedia Foundation Board of Trustees. That means lots of work on what is strong community organization, supporting the modern day wonders.