MySQL is the most frequently used database engine used in online applications. Only Oracle DB is used more extensively, but is not a unlike MySQL, it's not free. When a programmer starts using SQL and uses just small tables, performance is not the priority. When used in real-life applications, databases start to grow fast. With a database consisting of hundreds of thousands or even millions of records and being hammered with tens or hundreds connections simultaneously, performance optimization comes in. In this article I'll present some basic and more advanced techniques to optimize SQL-databases, more specific MySQL-databases. This article is for DBA just starting out or senior administrators who want to refresh their knowledge about optimization.
I’ve written this article relying on my experience developed when developing the online party search engine Party Locator. This platform was powered by a mediocre web server using dual XEON hexacores 16GB of ram and running on a raid of hard drives. When I just started the project, only several hundreds of entries where in the events table. The platform grew exponentially and after just five months, the platform stored more than one hundred thousand events in the database. This was the first stage when database optimization became necessary. Just one month later the crawler had indexed over half a million events, the second time optimization was desired. Using the techniques described below, I managed to reduce the average response time from 5.24 seconds to just 1.02 seconds (including abnormal peaks). The practical response time for a more complex query is now just 0.78 seconds, a performance boost of roughly 413%. Just in time, one year after launch, the Party Locator contained about 2.3 milion events.
Side note: Party Locator is no longer active due to Facebook changing it's access restrictions to it's search and page/event endpoints. If someone has a way to apply for an access token to access these endpoints, please let me now!
Not every tip I’ll give you will solve all your problems and will magically let your database double its performance. Optimization is all about testing and benchmarking. I’ll failed hundreds of times when trying to improve overall performance. But after testing multiple days, the solution finally had the desired impact. And at last, I'm not a professional ofcourse.
Try to replace subqueries with JOINS as subqueries do not use indices and are evaluated for every row. Use JOIN-structures instead. A subquery, though more reader friendly, can easily be rewritten to a regular query using JOINS. This was the first optimization I tried and it had immediate effect. Almost cutting the response time in half, this was one of the most influential techniques I used.
WHERE t1.b NOT IN (
could be rewritten to
FROM t1 LEFT JOIN t2 ON t1.b = t2.b
WHERE t2.b IS NULL
Every accessed field in the WHERE-clause should have an index. But keep in mind to not index every column as MySQL will not use more than two indices when it evaluates a query. Using many indices will also have an impact on INSERT and UPDATE statements as indices (most of the time BTREES) need to be rebuilt or updated. When 'writing queries' are not executed very often, using multiple indices is the way to go.
Note that it is possible to define indices on multiple fields. The order of these fields has a role as well.
Try not to cast where not necessary. For example, when comparing strings, converting a string to lowercase (or uppercase) is a bottleneck as it is to be executed for every row. String comparisons are expensive. Try using another character set or other operands.
Another form of casting can be in date functions, e.g. when you compare dates but have a stored timestamp, you'll use the cast DATE(timestamp), but this function is also executed for every row. If you use more date or time functions, try using DATETIME format instead of timestamp.
Defined field lengths
Field lengths have an impact on string comparisons. Using varchar with a specified maxium length insted of a text block (e.g. tinytext), the comparison will run much faster.
Do not use larger types than necessary. When you now that names will never be longer than 100 characters, don't allocate more space for the field. Note that it is good practise to specify maximum lengths and let your application check these requirements before sending a query to the database.
In the SELECT clause, almost never use the * selector. This selector selects all columns from the result. Only specify the columns that you need. If you select columns which are not needed, useless data will be retreived from disk and sent to the application which is ofcourse slower.
Optimize the table
MySQL haq a build in optimize command which works pretty well after many insertions or updates. Give it a try! The command is:
OPTIMIZE TABLE tablename
Another option is to defragment the table. Assuming you are using the innoDB engine:
ALTER TABLE tablename ENGINE = InnoDB
Order of expressions
The general rule is that expressions that evaluate to true for most expressions are placed last, expressions that accept the least rows are placed first. e.g.
1 = 1
a LIKE 'something'
Note that the first expression '1 = 1' will be evaluated for every row in the table (smart DBMS will ignore this row though) and then the field 'a' will be compared to the string 'something'.
When this query is restructured to:
a LIKE 'something'
1 = 1
At first the field 'a' will be compared with the string 'something'. Not all fields in the column 'a' will equal to 'something'. That means that the '1 = 1' expression will be executed less times.
Some number: lets say that there are 1000 enteries in the table and 10 enteries in the table have 'something' in column 'a'. In the first code example, there will be approx. 1000 + 1000 = 2000 evaluations while there will be 1000 + 10 evaluations in the second.
From my experience, innoDB is faster when the database is extensively used for writing and MyISAM is faster when reading is the most frequent task. Thus the Party Locator API queries a MyISAM database. The crawler runs in the background and has a less strict responstime requirement.
The public API consists mostly out of search endpoints. These endpoints are case insensitive. One way to handle this is to convert you search query to lowercase character and use LOWER(str) in the SQL. However this function has a performance impact on the query. It's better to use a characterset which is case insensitive for this purpose.
In my queries I also like to ignore special characters, for me a 'é' is the same as a regular 'e' and visa-versa. A naive approach for this problem is to convert a search query to only use ASCII characters, but this cannot be done for the data in the table. What if these special characters are important in the data? For this purpose there exists a characterset (or collation) (which is also case insensitive) and threats an 'é' als an 'e' and visa-versa (and ofcourse for all the variants), e.g. utf8_general_ci.
Connection keep alive
When you need the result of multiple queries, keep the connection with the DBMS open for further requests. Connection to the database takes some time and only increases the response time of your application.
When you only need a certain amount of enteries, limit the amount of rows using the LIMIT clause. Sending unnecessary data to the application will only increase latency.
Using LIMIT does not always have a positive performance impact. In some cases, using LIMIT did have a negative impact when sorting was enabled. E.g. when you have a ORDER BY clause and a LIMIT clause, all the rows will first be sorted before the LIMIT clause will take effect.
Mind your paging!
If one is sure that no doubles can exists in a table, do not use the DISTINCT clause. This clause will iterate over all selected rows and make sure every row only appears once. When returning many rows, this can take some time.
Do not use ORDER BY when sorting isn't needed. Most of the time, MySQL will sort it's data clusters on disk using the primary key. However this is not always the case. Not using ORDER BY does not guarantee the order of the records.
I've encountered som situations where only a very small number of rows was selected (without LIMIT) about 10 but when ODER BY was added, the query took about 10 times longer than without it. In that scenario, it was better to do the sorting of the litte rows on application level.
Sometimes all these optimizations aren't enough and the only possiblity is to buy faster hardware. If you want to check if your DBMS is eating up all the system resources, use the SQL command:
SHOW FULL PROCESSLIST