[geeklog-devel] [Fwd: Re: Geeklog optimalisations]

Fri Mar 12 09:40:30 EST 2004

This is an FYI.  We'll be discussing this on the development lists over 
the next few days (I hope).  It's important we help Groklaw as best we 
can as they are one of our bigger sites and by them pushing the limits 
of Geeklog we can address their issues and make Geeklog a better product 
at the same time.

--Tony

---------------------

Niels,

I think a bit of background is in order before you can understand how
Geeklog got where it is.  First, nearly all the code you are referring
to is legacy code.  It was there before I managed the project and it is
still there under Dirk's management.  In it's infancy, Geeklog was only
servicing smaller sites so performance was never really an issue and,
frankly, I was a bit young and dumb when I first got started with
Geeklog so performance tuning PHP scripts wasn't even a consideration
and my focus was on the feature set.

Under Dirk's management, the feature set has continued to grow to the
point that we have a large userbase and what you are encountering with
Geeklog is only natural.  Groklaw is clearly one of the biggest sites to
run Geeklog.  I have posted questions to our mailing lists asking about
performance issues realted to bigger Geeklog sites getting no responses
back so your email was a pleasant surprise.

The long and the short of it is Geeklog has matured to a point where
bigger sites are using it and we pushing the performance limits it has.
  Geeklog's database interaction has always been an issue for me and is
a large part why I have chosen to get a new codebase up (i.e. Geeklog 2)
while the 1.3.x continues.  You are right, we need to address the
performance issues and given the amount of work you have put into
troubleshooting Groklaw I think you can play a critical part in that.

What I would like to do is see us work closely with you to begin
addressing these issues.  A starting point would be to have a place
where we can install a development version of Groklaw's database
somewhere where we can run tests.  Dirk and I don't have access to a
database of that size and while we could fudge together some data using
a real world example would sure be nice.  Once we have a test bed, I'd
be open to suggestions on how we might work on this to resolve your
immediate issues *and* begin addressing performance tuning as a whole.

#geeklog is where I dwell (though not always at the keyboard).  If
possible I'd like to see us discuss this on geeklog-devtalk.  Niels, if
you could join that list at http://lists.geeklog.net/listinfo we can
carry this on there.  In the meantime if you can catch Dirk or myself in
IRC feel free to do so.  FYI I'm out of town this weekend (FWIW I'm GMT
-6) so I may not seem too responsive until I get back on Sunday.

Thanks for contacting us, I'm sure we can address these issues.

--Tony

Niels Leenheer wrote:
> Hi guys,
> 
> First of all. What were you guys thinking? Sorry to be so rude, but I simply
> had to get that off my chest. I feel better now. I'm okay. Really.
> 
> As some of you may be aware of Groklaw is using Geeklog. It has turned in to
> quite a busy website and stories with more than 700 comments are not out of
> the ordinary. In addition to this being slashdotted has become normal. This
> is where the problems started. The server can't handle much more. On busy
> days the website turns into a crawling slow pile of ..
> 
> As a regular reader and volunteer of Groklaw I offered to take a look at the
> Geeklog source code and try to find some places that could benefit from
> optimalisation. After some testing I've noticed that most of the problems
> are due to load on the database server.
> 
> The first thing I started working on is the code that generates all the
> comments. It turns out that for every comment at least two queries are
> executed. For a story with more than 700 comments this would mean more than
> almost 1500 queries to generate the page.
> 
> I've modified this code extensively and now we use one query to fetch all
> the user details of all the people involved in posting. One query is used to
> fetch all the comments that have no parent. One query to fetch all the
> comments to do have parents. And if needed, one query to fetch the parent.
> All this data is then turned into one big nested array, which is passed by
> reference to the functions that actually print the data. Depending on how
> many comments there are this could result in a speed improvement of about
> 0% - 1000%. As you can imagine if you only have about 10 comments it would
> not mean much, with 500 comments it would reduce the amount of queries
> needed by about a 1000. It's a very big improvement.
> 
> One other problem I've identified is table locking of the story table. The
> statistics are stored in the same table as the actual content of the story.
> So each time a story is displayed, it will use an UPDATE query and a SELECT
> query on the same table. With a lot of requests the table is constantly
> locked by the UPDATE queries and the SELECT queries are waiting. We've
> disabled the statistics for now, but we are investigating the possibility of
> moving the statistics to a separate table.
> 
> Next is the database layer. The mysql_fetch_array() function has two
> arguments. The second determines what the function returns. Either an
> associative array, a numbered array or both. By default the function returns
> both. This is what Geeklog does. In most of the code only the associative
> array is used. Only in a couple of small instances the code requires an
> numbered array. What we have done is to instruct the mysql_fetch_array()
> function to return only an associative array by default. Only when the code
> requires a numbered array we request both. This should lower the amount of
> memory needed by Geeklog.
> 
> The SEC_getUserGroups() function is also quite expensive. It is called
> throughout the generation a page and it does not cache the information. We'
> ve added a simple cache for the data that is fetched from the database which
> eliminates another 30 or so queries.
> 
> Next is the index page. The COM_featuredCheck() function is executed every
> time the frontpage is requested. I've changed the loop that actually
> displays the stories on the frontpage and included a check to see if there
> is more than one featured story. If there is, the second story is not
> displayed as such and the featuredCheck() function is called. This again
> saves a couple of queries and the end result is the same.
> 
> We are also using the mycal extension which I've almost completely
> rewritten. Mycal uses a query for every day that is displayed and after my
> modifications it only uses one query. A 27-34 reduction in queries.
> 
> Now back to my first paragraph. I was pretty impressed with how easy it was
> to get used to the way everything works in Geeklog. It was pretty easy to
> understand and it looks like it was designed pretty well. But I was also
> horrified when I saw the enormous amount of queries that are used, but I
> guess Geeklog wasn't really designed with this kind of traffic and these
> enormous amounts of comments in mind.
> 
> Most of the changes we've made are not yet running on the production server.
> Once we've properly tested everything and everything is stable, I'm willing
> to look at how we can give these changes back to Geeklog. As simple patch
> between the current version of Geeklog and Groklaw will be difficult,
> because we are using 1.3.8-1sr4 and it also includes a lot of Groklaw
> specific modifications. If you are interested in these modifications, please
> let me know and we'll work something out.
> 
> If you want to talk to me about this you can e-mail. In addition to this
> I'll try to visit #geeklog as often as I can.
> 
> Niels Leenheer
> -- project manager phpAdsNew
> 
> 
>