Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /share/CACHEDEV1_DATA/Web/www/libraries/UBBcode/text_parser.class.php on line 228
Database optimalisation for developers

Comments Blog About Development Research Sites

Database optimalisation for developers

Apr 23, 2010
Just about any webapplication with dynamic data uses a database server to store that data. While some are little more than a way to store text efficiently, most relational database management systems are capable of doing far more than that.

While it is of course great to have so much functionality available, an unfortunate side effect is that it becomes less clear how to use all these abilities in the most optimal way. For example, a query can be written in such a way that it will either take a fraction of a second to execute, or several minutes - yet yield the exact same result in both cases.

I will describe some easy tips and tricks to lower the stress on your database system - most of these are written with MySQL in mind, but some general tips apply to virtually any RDBMS currently available. As written from a developers point of view, these do not so much elaborate on administrating a database (which is what the allmighty DBA's are for) as using it efficiently.

1. Thou Shalt Not Use Too Many Queries
Say you have a table containing grouped data and you wish to know some general property about this data. For example, you have a table of news reports with a limited set of categories and you wish to the number of reports per category. One of way doing this is to obtain a list of all categories and then use a for loop to obtain the count for each category - something like this:
Code (php) (nieuw venster):
1
SELECT COUNT(*) FROM `news` WHERE `category` = [index]

With a dozen categories, this means a dozen queries in rapid succession - a sure way to stress your database. Using GROUP BY however you can retrieve the same data with a single query!

In most languages, firing off an SQL query is relatively cheap. You generally need not worry about adding an extra query, especially not if it is a simple one. But SQL queries within a for(each) loop are almost never necesary and a dozen excess queries will bring down performance. Avoid if you can, but remember 2:.

2. Thou Shalt Not Make Thy Queries Overly Complex
The other end of the spectrum of using too many queries is using too few - or to be more exact, make your queries do too much. This does not often apply and in fact means you are already using a fairly large subset of what your DBMS is capable of, but it can happen.

Most notorious are subqueries, or in-query calculations. Using structures like "WHERE POW(`a`, `b`) > `c`" for example means your DBMS has to calculate POW(a,b) for every row in the table to ensure it does not exceed c. This is a fairly cheap calculation, but imagine performing it a million times for each query! Sure, some caching mechanisms apply, but your results will come back a lot faster if you add a column with the data already calculated - then your DBMS can instead use an index to find the rows you need, which is much, much faster. This leads us to:

3. Thou Shalt Remember Thyne Indici
Need to search for text? Use a fulltext index. Indexes allow your database to quickly jump to the data it needs instead of having to examine every row in the table. Especially when your tables grow larger the benefit of indici is undeniable. Of course, there is a tradeoff - in memory or disk space - but that is almost always a very small price to pay.

Note that even an index can not save you when you attempt to do something like "WHERE `description` LIKE '%query%'". When searching through text, always try to use MATCH() AGAINST - it exists for a very good reason.

4. Thou Shalt Ask Only What Thy Need
Yes, we all know the temptation: use an asterisk to select every column in your dataset, instead of going through the tedious task of specifying each and every column you need. Heck, what if you need another column later - then you have to go back and add it. So why not do a "SELECT *"?

First of all, because it can lead to ambiguity problems. Most of your tables columns will have unique names, but some might not - for example, what if your primary key column is always called "id"? Some people stick to conventions like "tablename_id", but that does not prevent duplicate names between tables entirely either. If you do not specify your column names and later on add a JOIN to your query, the ID you have been selecting on could suddenly change to that of a completely different table - and suddenly your results will be a heap of garbage and you spend the next day finding out how it could have happened (or less if you used your source control well).

By specifying exactly what you need, you need not worry about duplicate names when adding a JOIN to your query: your database will do it for you! MySQL will throw a nice ambiguity error as soon as a name you specified is not unique - which tells you directly what to change!

The other good reason to always specify only what you need is a simple matter of resources: if you do a select on a table with a dozen fields, and maybe join it on a few other, similarly sized tables, your resultset can easily range over fifty, sixty columns - and chances are good you need far less than that. Retrieving five times as much data won't take five times as long, but it will increase memory consumption considerably - not just in your application, but in your database servers query cache as well. For these reasons, you should never do a global select unless you are really sure you need every column in the table and you will never expand your query.

5. Measure And Thou Shalt See
You did your homework, your queries are ticking along nicely, your indexes are in place (and have a high hitrate), so why is your application still sluggish?

The best way to find out, is to simply ask your database server! MySQL allows you to log slow queries (where you can determine yourself what a 'slow' query is, in terms of execution time), so with a bit of logging and analyzing you can see exactly what queries are causing the trouble. Alternatively, with the 'SHOW FULL PROCESSLIST' command you can see a slow executing query as it runs - though most queries will take less than a second to run and are therefore hard to catch using this method. Still, for anything taking more than a few seconds it is a quick way to determine the bottleneck.

Having found your slow query, the next step is of course determining why it is so slow:

6. Ask For An Explanation And Thou Shalt Receive
The EXPLAIN command yields detailed information on your query - in particular, when "key" is empty or "Extra" notes it is using filesort you should be wary - these are signs of a bad query.

If no key is used, you should check the "possible_keys" field - is the index you inteded to be used listed? If it is, why is it not being used? If not, how come your index is not applicable? These questions are complex, as are their answers - too much so in fact for me to answer here. Note though that MySQL offers syntax to help using keys, use them wisely.

When the "Extra" field notes that your query is using filesort, something different is happening: you have most likely made your query too complex, or are attempting to order / limit your resultset in a way that is not inherent in the table. Adding a column with cached results of calculations can sometimes help - for example, if you have a table with coordinates and want to know the distance from a central location to these coordinates, it is faster to calculate the distances once and store them in an extra column than to calculate them again and again for each query - of course at the expanse of diskspace. Sometimes however, when for example searching through text without match/against this will not help you. In these cases, search the net for common solutions - chances are you are not the first to encounter a particular problem.

7. Connectionpools Are Thyne Friends
A common misconception seems to arise every now and again that you should use mysql_close() at the end of each PHP script. While it is (very rarely) not a bad idea to use mysql_free_result to garbace-collect large amounts of data when you are executing a particularly long script, the same cannot be said for mysql_close. Opening a database connection takes time and effort. So does closing it. So why would you want to do that for no other purpose than to free a few kilobytes of memory in-between pageloads? In fact, if performance is an issue you will most likely have concurrent pageloads, meaning several concurrent database connections at all times anyway. I can really think of no reason why you would need to close the connection manually halfway through a script.

MySQL uses 150 as default max_connections limit these days - in other words, it expects to handle 150 connections simultaneously. But this is by no means an upper limit - any decent database-server is easily capable of handling several thousand connections with no ill effects. The few dozen opened by your script are by no means a problem. For this reason Apache will already keep a connection pool ready if configured to do this: whenever a script needs a connection it can use an existing one, speeding things up nicely. You can enforce this behavior with mysql_pconnect, but in my experience it is often better to trust your webserver and database-administrator: they will already have arranged for your connections to keep alive between pageviews.

In conclusion, every performance issue requires its own approach: measure what the bottleneck is, determine which part of the query causes this and adjust either your query or your database to accommodate. This is sometimes easy, but more often quite difficult - much has been written about it already, so do not forget to search for existing answers. Do not rule out external factors either: if your database has to swap on disk because your server runs out of memory you can optimize for eternity and still not achieve significantly better performance. This is especially true for most shared hosting accounts. Good fast hardware (in particular, hard disks with low access times, SSD's for preference), plenty of memory and a well configured database can sometimes do far more than a change in SQL can - not seldom at far lesser costs.

FragFrog out!

Apr 25, 2010 Willem

When using persistent connections you will have to keep some things in mind or you might experience weird effects. For example, temporary tables, the connection's encoding, variables, perpared queries all persist on the same database connection. If one PHP script changes any of those, the changes also become visible to other PHP scripts.

So effects from one PHP script may carry over to the other script. Be carefull when you use these features, or unexpected things might happen.

May 12, 2010 Vynsan

Have you got rss set up for this?

May 13, 2010 Matthijs

A good point Willem, though I cannot imagine you would soon wish to switch connection encoding between requests for example. Prepared statements and more importantly table locks are more of a concern, though I have not encountered problems with them in production environments so far.

Ytys: of course, check the RSS icon in your browser ;)

May 14, 2010 Vynsan

Thanks, found it, didn't show up on chrome, firefox for the win my old friend

May 18, 2010 Dries

It's catEgory, not catAgory.

May 19, 2010 Matthijs

Woops, fixed that, thanks! Though I fear there'll probably be a few more typos of the sort in my writings..

New comment

Your name:
Comment: