Thought this was cool: InnoDB’s gap locks

One of the most important features of InnoDB is the row level locking. This feature provides better concurrency under heavy write load but needs additional precautions to avoid phantom reads and to get a consistent Statement based replication. To accomplish that, row level locking databases also acquire gap locks.

What is a Phantom Read

A Phantom Read happens when in a running transaction, two identical statements get different values, because some other transaction has modified the table’s rows. For example:


transaction1> START TRANSACTION;
transaction1> SELECT * FROM t WHERE i > 20 FOR UPDATE;
+------+
| i |
+------+
| 21 |
| 25 |
| 30 |
+------+


transaction2> START TRANSACTION;
transaction2> INSERT INTO t VALUES(26);
transaction2> COMMIT;

transaction1> select * from t where i > 20 FOR UPDATE;
+------+
| i |
+------+
| 21 |
| 25 |
| 26 |
| 30 |
+------+

Phantom reads do not occur if you’re simply doing a SELECT. They only occur if you do UPDATE or DELETE or SELECT FOR UPDATE. InnoDB provides REPEATABLE READ for read-only SELECT, but it behaves as if you use READ COMMITTED for all write queries, in spite of your chosen transaction isolation level (considering only the two most common isolation levels, REPEATABLE READ and READ COMMITTED).

What is a gap lock?

A gap lock is a lock on the gap between index records. Thanks to this gap lock, when you run the same query twice, you get the same result, regardless other session modifications on that table. This makes reads consistent and therefore makes the replication between servers consistent. If you execute SELECT * FROM id > 1000 FOR UPDATE twice, you expect to get the same value twice. To accomplish that, InnoDB locks all index records found by the WHERE clause with an exclusive lock and the gaps between them with a shared gap lock.

This lock doesn’t only affect to SELECT … FOR UPDATE. This is an example with a DELETE statement:

transaction1 > SELECT * FROM t;
+------+
| age |
+------+
| 21 |
| 25 |
| 30 |
+------+

Start a transaction and delete the record 25:

transaction1 > START TRANSACTION;
transaction1 > DELETE FROM t WHERE age=25;

At this point we suppose that only the record 25 is locked. Then, we try to insert another value on the second session:

transaction2 > START TRANSACTION;
transaction2 > INSERT INTO t VALUES(26);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
transaction2 > INSERT INTO t VALUES(29);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
transaction2 > INSERT INTO t VALUES(23);
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
transaction2 > INSERT INTO t VALUES(31);
Query OK, 1 row affected (0.00 sec)

After running the delete statement on the first session, not only the affected index record has been locked but also the gap before and after that record with a shared gap lock preventing the insertion of data to other sessions.

How to troubleshoot gap locks?

Is possible to detect those gap locks using SHOW ENGINE INNODB STATUS:

---TRANSACTION 72C, ACTIVE 755 sec
4 lock struct(s), heap size 1248, 3 row lock(s), undo log entries 1
MySQL thread id 3, OS thread handle 0x7f84a78ba700, query id 163 localhost msandbox
TABLE LOCK table `test`.`t` trx id 72C lock mode IX
RECORD LOCKS space id 19 page no 4 n bits 80 index `age` of table `test`.`t` trx id 72C lock_mode X
RECORD LOCKS space id 19 page no 3 n bits 80 index `GEN_CLUST_INDEX` of table `test`.`t` trx id 72C lock_mode X locks rec but not gap
RECORD LOCKS space id 19 page no 4 n bits 80 index `age` of table `test`.`t` trx id 72C lock_mode X locks gap before rec

If you have lot of gaps locks in your transactions affecting the concurrency and the performance you can disable them in two different ways:

1- Change the ISOLATION level to READ COMMITTED. In this isolation level, it is normal and expected that query results can change during a transaction, so there is no need to create locks to prevent that from happening.
2- innodb_locks_unsafe_for_binlog = 1. Disables the gap locks except for foreign-key constraint checking or duplicate-key checking.

The most important difference between these two options is that the second one is a global variable that affects all sessions and needs a server restart to change its value. Both options cause phantom reads (non repeatable reads) so in order to prevent problems with the replication you should change the binary log format to “row”.

Depending on the statement, the behavior of these locks can be different. In the following link there is a good source of information:

http://dev.mysql.com/doc/refman/5.1/en/innodb-locks-set.html.

Conclusion

MySQL uses REPEATABLE READ as the default isolation level so it needs to lock the index records and the gaps to avoid phantom reads and to get a consistent Statement based replication. If your application can deal with phantom reads and your binary log is in row format, changing the ISOLATION to READ COMMITTED will help you to avoid all those extra locks. As a final advice, keep your transactions short :)

from MySQL Performance Blog: http://www.mysqlperformanceblog.com/2012/03/27/innodbs-gap-locks/

Thought this was cool: Multi Range Read (MRR) in MySQL 5.6 and MariaDB 5.5

This is the second blog post in the series of blog posts leading up to the talk comparing the optimizer enhancements in MySQL 5.6 and MariaDB 5.5. This blog post is aimed at the optimizer enhancement Multi Range Read (MRR). Its available in both MySQL 5.6 and MariaDB 5.5

Now let’s take a look at what this optimization actually is and what benefits it brings.

Multi Range Read

With traditional secondary index lookups, if the columns that are being fetched do not belong to the secondary index definition (and hence covering index optimization is not used), then primary key lookups have to be performed for each secondary key entry fetched. This means that secondary key lookups for column values that do not belong to the secondary index definition can result in a lot of Random I/O. The purpose of MRR is to reduce this Random I/O and make it more sequential, by having a buffer in between where secondary key tuples are buffered and then sorted by the primary key values, and then instead of point primary key lookups, a range lookup is performed on the primary key by using the sorted primary key values.

Let me give you a simple example. Suppose you have the following query executed on the InnoDB table:

SELECT non_key_column FROM tbl WHERE key_column=x

This query will roughly be evaluated in following steps, without MRR:

  1. SELECT key_column, pk_column FROM tbl WHERE key_column=x ORDER BY key_column
    (Note that secondary keys in InnoDB contain primary key columns)
  2. For each pk_column value in step 1 do:
    SELECT non_key_column FROM tbl WHERE pk_column=val

As you can see that the values returned from Step 1 are sorted by the secondary key column ‘key_column’, and then for each value of ‘pk_column’ which is a part of the secondary key tuple, a point primary key lookup is made against base table, the number of these point primary key lookups will be depend on the number of rows that match the condition ‘key_column=x’. You can see that there are a lot of random primary key lookups made.

With MRR, then steps above are changed to the following:

  1. SELECT key_column, pk_column FROM tbl WHERE key_column=x ORDER BY key_column
    (Note that secondary keys in InnoDB contain primary key columns)
  2. Buffer each pk_column value fetched from step 1, and when the buffer is full sort them by pk_column, and do a range primary key lookup as follows:
    SELECT non_key_column from tbl WHERE pk_column IN (…)

As you can see by utilizing the buffer for sorting the secondary key tuples by pk_column, we have converted a lot of point primary key lookups to one or more range primary key lookup. Thereby, converting Random access to one or more sequential access. There is one another interesting thing that has come up here, and that is the importance of the size of the buffer used for sorting the secondary key tuples. If the buffer size is large enough only a single range lookup will be needed, however if the buffer size is small as compared to the combined size of the secondary key tuples fetched, then the number of range lookups will be:

CEIL(S/N)
where,
S is the combined size of the secondary key tuples fetched, and
N is the buffer size.

In MySQL 5.6 the buffer size used by MRR can be controlled by the variable read_rnd_buffer_size, while MariaDB introduces a different variable to control the MRR buffer size mrr_buffer_size. Both buffer sizes default to 256K in MySQL 5.6 and MariaDB 5.5 respectively, which might be low depending on your scenario.

You can read more about the MRR optimization available in MySQL 5.6 here:
http://dev.mysql.com/doc/refman/5.6/en/mrr-optimization.html
and as available in MariaDB 5.5 here:
http://kb.askmonty.org/en/multi-range-read-optimization

Now let’s move on to the benchmarks, to see the difference in numbers.

Benchmark results

For the purpose of this benchmark, I have used TPC-H Query #10 and ran it on TPC-H dataset (InnoDB tables) with a Scale Factor of 2 (InnoDB dataset size ~5G). I did not use Scale Factor of 40 (InnoDB dataset size ~95G), because the query was taking far too long to execute, ~11 hours in case of MySQL 5.5 and ~5 hours in case of MySQL 5.6 and MariaDB 5.5. Note that query cache is disabled during these benchmark runs and that the disks are 4 5.4K disks in Software RAID5.

Also note that the following changes were made in the MySQL config:
optimizer_switch=’index_condition_pushdown=off’
optimizer_switch=’mrr=on’
optimizer_switch=’mrr_sort_keys=on’ (only on MariaDB 5.5)
optimizer_switch=’mrr_cost_based=off’
read_rnd_buffer_size=4M (only on MySQL 5.6)
mrr_buffer_size=4M (only on MariaDB 5.5)

We have turned off ICP optimization for the purpose of this particular benchmark, because we want to see the individual affect of an optimization (where possible). Also note that we have turned off mrr_cost_based, this is because the cost based algorithm used to calculate the cost of MRR when the optimizer is choosing the query execution plan, is not sufficiently tuned and it is recommended to turn this off.

The query used is:

select
        c_custkey,
        c_name,
        sum(l_extendedprice * (1 - l_discount)) as revenue,
        c_acctbal,
        n_name,
        c_address,
        c_phone,
        c_comment
from
        customer,
        orders,
        lineitem,
        nation
where
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and o_orderdate >= '1993-08-01'
        and o_orderdate < date_add( '1993-08-01' ,interval '3' month)
        and l_returnflag = 'R'
        and c_nationkey = n_nationkey
group by
        c_custkey,
        c_name,
        c_acctbal,
        c_phone,
        n_name,
        c_address,
        c_comment
order by
        revenue desc
LIMIT 20;

In-memory workload

Now let’s see how effective is MRR when the workload fits entirely in memory. For the purpose of benchmarking in-memory workload, the InnoDB buffer pool size is set to 6G and the buffer pool was warmed up, so that the relevant pages were already loaded in the buffer pool. Note that as mentioned at the start of the benchmark results section, the InnoDB dataset size is ~5G. Ok so now let’s take a look at the graph:

MRR doesn’t really make any positive difference to the query times for MySQL 5.6, when the workload fits entirely in memory, because there is no extra cost for memory access at random locations versus memory access at sequential locations. In fact there is extra cost added by the buffering step introduced by MRR, and hence, there is a slight increase in query time for MySQL 5.6, increase of 0.02s. But the query times for MariaDB 5.5 are greater than both MySQL 5.5 and MySQL 5.6

IO bound workload

Now let’s see how effective is MRR when the workload is IO bound. For the purpose of benchmarking IO bound workload, the InnoDB buffer pool size is set to 1G and the buffer pool was not warmed up, so that it does not have the relevant pages loaded up already:

MRR does make a lot of difference when the workload is IO bound, the query time is decreased from ~11min to under a minute. The query time is reduced further when the buffer size is set to 4M. Note also that query time for MariaDB is still a little higher by a couple of seconds, when compared to MySQL 5.6.

Now let’s take a look at the status counters.

MySQL Status Counters

These status counters were captured when performing the benchmark on IO bound workload, mentioned above.

Counter Name MySQL 5.5 MySQL 5.6 MySQL 5.6 w/ read_rnd_bufer_size=4M MariaDB 5.5 MariaDB 5.5 w/ mrr_buffer_size=4M
Created_tmp_disk_tables 1 1 1 1 1
Created_tmp_tables 1 1 1 1 1
Handler_mrr_init N/A 0 0 1 1
Handler_mrr_rowid_refills N/A N/A N/A 1 0
Handler_read_key 508833 623738 622184 508913 507516
Handler_read_next 574320 574320 572889 574320 572889
Handler_read_rnd_next 136077 136094 136366 136163 136435
Innodb_buffer_pool_read_ahead 0 20920 23669 20920 23734
Innodb_buffer_pool_read_requests 1361851 1264739 1235472 1263290 1235781
Innodb_buffer_pool_reads 120548 102948 76882 102672 76832
Innodb_data_read 1.84G 1.89G 1.53G 1.89G 1.53G
Innodb_data_reads 120552 123872 100551 103011 77213
Innodb_pages_read 120548 123868 100551 123592 100566
Innodb_rows_read 799239 914146 912318 914146 912318
Select_scan 1 1 1 1 1
Sort_scan 1 1 1 1 1
  • As you can see from the status counters above that both MySQL 5.6 and MariaDB 5.5 are reporting high numbers for Innodb_buffer_pool_read_ahead which shows that the access pattern was sequential and hence InnoDB decided to do read_ahead, while in MySQL 5.5 no read_ahead was done because the access pattern was not sequential. Another thing to note is that more read_ahead is done when the buffer size used by MRR, is set to 4M, which obviously means that the more index tuples that can fit in the buffer the more sequential the access pattern will be.
  • There is one MRR related variable introduced in MySQL 5.6 and MariaDB 5.5 Handler_mrr_init and another additional one introduced in MariaDB 5.5 Handler_mrr_rowid_refills. Handler_mrr_init is incremented when a MRR range scan is performed, but we can see its only incremented in MariaDB 5.5 and not in MySQL 5.6, is that because of a bug in MySQL 5.6 code? As MRR was used in both MySQL 5.6 and MariaDB 5.5. Handler_mrr_rowid_refills counts how many times the buffer used by MRR had to be reinitialized, because the buffer was small and not all index tuples could fit in the buffer. If this is > 0 then it means Handler_mrr_rowid_refills + 1 MRR range scans had to be performed. As in the table above you can with default buffer size of 256K, MariaDB 5.5 shows that Handler_mrr_rowid_refills = 1, which means the buffer is small and there were 2 MRR range scans needed. But with a buffer size of 4M, we can see that Handler_mrr_rowid_refills = 0, which means that the buffer was big enough and only 1 MRR range scan was needed, which is as efficient as it can be. This is also evident in the query times, which is lower by a couple of seconds when buffer size of 4M is used.
  • Another interesting thing to note is that MySQL 5.6 and MariaDB 5.5 are both reading more rows than MySQL 5.5, as can be seen by the numbers reported for the status counter Innodb_rows_read. While MySQL 5.6 is also reporting increased numbers for the counter Handler_read_key. This is because of how status counter values are incremented when index lookup is performed. As I explained at the start of the post that traditional index lookup (for non-index-only columns) involves, reading an index record, and then using the PK column value in the index record to make a lookup in the PK. Both these lookups are performed in a single call to the storage engine and the counters Handler_read_key and Innodb_rows_read are incremented by ONE. However, when MRR is used then there are two separate calls made to the storage engine to perform the index record read and then to perform the MRR range scan on the PK. This causes the counters Handler_read_key and Innodb_rows_read to be incremented by TWO. It does not actually mean that queries with MRR are performing badly. The interesting thing is that though both MariaDB and MySQL 5.6 are reporting high numbers for Innodb_rows_read, which is completely in line with how the counters behave with MRR, but the value for counter Handler_read_key is more or less the same for MariaDB 5.5 when compared to MySQL 5.5, and this does not make sense to me. Probably its due to a bug in how counter is calculated inside MariaDB?

Other Observations

Sometimes both for MariaDB 5.5 and MySQL 5.6, the optimizer chooses the wrong query execution plan. Let’s take a look at what are the good and bad query execution plans.

a. Bad Plan

id      select_type     table   type    possible_keys   key     key_len ref     rows    filtered        Extra
1       SIMPLE  nation  ALL     PRIMARY NULL    NULL    NULL    25      100.00  Using temporary; Using filesort
1       SIMPLE  customer        ref     PRIMARY,i_c_nationkey   i_c_nationkey   5       dbt3.nation.n_nationkey 2123    100.00
1       SIMPLE  orders  ref     PRIMARY,i_o_orderdate,i_o_custkey       i_o_custkey     5       dbt3.customer.c_custkey 7       100.00  Using where
1       SIMPLE  lineitem        ref     PRIMARY,i_l_orderkey,i_l_orderkey_quantity      PRIMARY 4       dbt3.orders.o_orderkey  1       100.00  Using where

b. Good Plan

id      select_type     table   type    possible_keys   key     key_len ref     rows    filtered        Extra
1       SIMPLE  orders  range   PRIMARY,i_o_orderdate,i_o_custkey       i_o_orderdate   4       NULL    232722  100.00  Using where; Rowid-ordered scan; Using temporary; Using filesort
1       SIMPLE  customer        eq_ref  PRIMARY,i_c_nationkey   PRIMARY 4       dbt3.orders.o_custkey   1       100.00  Using where
1       SIMPLE  nation  eq_ref  PRIMARY PRIMARY 4       dbt3.customer.c_nationkey       1       100.00
1       SIMPLE  lineitem        ref     PRIMARY,i_l_orderkey,i_l_orderkey_quantity      PRIMARY 4       dbt3.orders.o_orderkey  2       100.00  Using where

So during cold query runs the optimizer would switch to using plan ‘a’, which does not involve MRR, and the query time for MySQL 5.6 and MariaDB 5.5 jumps to ~11min (this is the query time for MySQL 5.5) While when it sticks to plan ‘b’ for MySQL 5.6 and MariaDB 5.5, then query times remain under a minute. So when the correct query execution plan is not used, there is no difference in query times between MySQL 5.5 and MySQL 5.6/MariaDB 5.5 This is another area of improvement in the optimizer, as it is clearly a part of the optimizer’s job to select the best query execution plan. I had noted a similar thing when benchmarking ICP, the optimizer made a wrong choice. It looks like that there is still improvement and changes needed in the optimizer’s cost estimation algorithm.

MariaDB 5.5 expands the concept of MRR to improve the performance of secondary key lookups as well. But this works only with joins and specifically with Block Access Join Algorithms. So I am not going to cover it here, but will cover it in my next post which will be on Block Access Join Algorithms.

Conclusion

There is a huge speedup when the workload is IO bound, the query time goes down from ~11min to under a minute. The query time is reduced further when buffer size is set large enough so that the index tuples fit in the buffer. But there is no performance improvement when the workload is in-memory, in fact MRR adds extra sorting overhead which means that the queries are just a bit slower as compared to MySQL 5.5 MRR clearly changes the access pattern to sequential, and hence InnoDB is able to do many read_aheads. Another thing to take away is that MariaDB is just a bit slower as compared to MySQL 5.6, may be something for the MariaDB guys to look at.

from MySQL Performance Blog: http://www.mysqlperformanceblog.com/2012/03/21/multi-range-read-mrr-in-mysql-5-6-and-mariadb-5-5/

Thought this was cool: What Are Full, Incremental, and Differential Backups?

Sometimes you might hear people talk about full backups, and differential backups versus incremental backups. What is the difference?

A full backup is pretty self-explanatory. It makes a copy of all of your MySQL data.

A differential backup, on the other hand, simply records the differences since your last full backup. The advantage of taking a differential backup is usually the space savings. Most databases have a lot of data that does not change from one backup to the next. Not copying this data into your backups can result in significantly smaller backups. In addition, depending on the backup tool used, a differential backup can be less labor-intensive for the server. If a differential backup does not have to scan all of the data to determine what has changed, the differential backup process can be significantly more efficient.

An incremental backup is a modification of a differential backup. An incremental backup records the changes since the last backup of any type, be it a differential or full backup. The advantages of incremental backups are similar to those of differential backups.

Here is an example that might explain this more clearly. Suppose that you take a full backup on Sunday. On Monday, you make a backup of all the changes since Sunday. This is a differential backup.

On Tuesday is when you begin to see the differences between the backup strategies. If you back up the changes since Sunday, then you have made a differential backup. If you back up the changes since Monday, it is an incremental backup.

Why would you choose an incremental versus a differential backup? That is a little bit out of scope for this blog post, because there are a lot of subtleties to consider. However, perhaps the biggest difference is in the way that you would restore a backup. Suppose that you need to restore your database on Friday. If you have taken differential backups all week long, you only need to restore Sunday, and then apply the changes that have happened since Sunday. If you have taken incremental backups, you must restore Sunday’s backup, and then apply changes repeatedly until you reach Friday. This can be more labor intensive, error-prone, and risky. It can also take longer.

Percona XtraBackup is capable of taking incremental backups, and when you specify the base backup as the previous incremental or full backup, you can easily make differential backups.

from MySQL Performance Blog: http://www.mysqlperformanceblog.com/2012/01/23/what-are-full-incremental-and-differential-backups/

Thought this was cool: Percona XtraDB Cluster Feature 2: Multi-Master replication

This is about the second great feature – Multi-Master replication, what you get with Percona XtraDB Cluster.

It is recommended you get familiar with general architecture of the cluster, described on the previous post.

By Multi-Master I mean the ability to write to any node in your cluster and do not worry that eventually you get out-of-sync situation, as it regularly happens with regular MySQL replication if you imprudently write to the wrong server.

This is long-waited feature, I’ve seen growing demand for it for last two years or even more.
Percona XtraDB Cluster provides it, and let’s see how it works.

With our Cluster you can write to any node, and the Cluster guarantees consistency of writes. That is the write is either committed on all nodes or not committed at all.

Let me draw some diagram there. For the simplicity I will use a two-nodes example, but the same logic is applied when you have N nodes.

As you see all queries are executed locally on the node, and only on COMMIT there is special handling.
When you issue COMMIT, your transaction has to pass certification on all nodes. If it does not pass, you
will receive “ERROR” as response on your query. After that transaction is applied on the local node.

That is response time of COMMIT consist of several parts:

  • Network roundtrip
  • Certification time
  • Local applying

Please note that applying the transaction on remote nodes does not affect the response time of COMMIT,
as it happens in the background after the response on certification.

The two important consequences of this architecture.

  • First: we can have several appliers working in parallel. This gives us true parallel replication. Slave can have many parallel threads, and you can tune it by variable wsrep_slave_threads
  • Second: There might be a small period of time when the slave is out-of-sync from master. This happens because the master may apply event faster than a slave. And if you do read from the slave, you may read data, that has not changes yet. You can see that from diagram. However you can change this behavior by using variable wsrep_causal_reads=ON. In this case the read on the slave will wait until event is applied (this however will increase the response time of the read. This gap between slave and master is the reason why this replication named “virtually synchronous replication”, not real “synchronous replication”

The described behavior of COMMIT also has the second serious implication.
If you run write transactions to two different nodes, the cluster will use an optimistic locking model.
That means a transaction will not check on possible locking conflicts during individual queries, but rather on the COMMIT stage. And you may get ERROR response on COMMIT. I am highlighting this, as this is one of incompatibilities with regular InnoDB, that you may experience. In InnoDB usually DEADLOCK and LOCK TIMEOUT errors happen in response on particular query, but not on COMMIT. Well, if you follow a good practice, you still check errors code after “COMMIT” query, but I saw many applications that do not do that.

So, if you plan to use Multi-Master capabilities of XtraDB Cluster, and run write transactions on several nodes, you may need to make sure you handle response on “COMMIT” query.

from MySQL Performance Blog: http://www.mysqlperformanceblog.com/2012/01/19/percona-xtradb-cluster-feature-2-multi-master-replication/

Thought this was cool: How Does Semisynchronous MySQL Replication Work?

With the recent release of Percona XtraDB Cluster, I am increasingly being asked about MySQL’s semi-synchronous replication. I find that there are often a number of misconceptions about how semi-synchronous replication really works. I think it is very important to understand what guarantees you actually get with semi-synchronous replication, and what you don’t get.

The first thing to understand is that despite the name, semi-synchronous replication is still asynchronous. Semi-synchronous is actually a pretty bad name, because there is no strong coupling between a commit on the master and a commit on the replicas. To understand why, let’s look at what truly synchronous replication means. In truly synchronous replication, when you commit a transaction, the commit does not complete until all replicas have also committed successfully. In MySQL’s semi-synchronous replication, the commit completes before the transaction is even sent to any of the replicas. Therefore, by definition the transaction cannot have committed on any of the replicas. If there’s any problem after the commit happens on the master, it’s possible that the replicas won’t get the transaction, and even after they do, there’s no guarantee they can apply and commit it successfully themselves (duplicate key error, anyone?). If any of these problems happens, it’s too late–the commit is already permanent on the master, and can’t be rolled back.

What should semi-synchronous replication be called instead? I believe that it should be called delayed-acknowledgment commits, because this is what actually happens. When a transaction commits on the master, the commit proceeds as normal, and the transaction is sent to the replicas as normal, but the client connection to the master is not told that the commit has completed until after at least one replica has acknowledged receiving the transaction.

Another way to look at the same thing is that semi-synchronous replication actually forces the client to be synchronized, not the replicas. The client is forced to wait until the transaction has been sent to one of the replicas, but the commit on the master is not forced to wait at all, nor are replicas forced to do anything. The commit has already happened on the master, so the cat’s out of the bag and there’s no way to force replicas to do anything. As a result, the effect is that the client’s activity is throttled so that it cannot outpace the replica’s ability to fetch updates from the master. Have you seen the bumper sticker that says “don’t drive faster than your Guardian Angel can fly?” That is the effect of this throttling.

Semi-synchronous replication also does not guarantee that your replicas will not become delayed. The client connection is forced to wait until at least one of the replicas has retrieved the transaction, but not until the transaction has actually been applied to the replica. As you probably know, it is perfectly possible to send a very long transaction to the replica in a matter of milliseconds. The replica will take a long time to apply this transaction to its own data, and during that time, it will be delayed relative to the master. However, other transactions can continue committing and sending their changes to the replica, because the process of retrieving changes from the master and applying them run in separate threads on the replica.

Finally, semi-synchronous replication does not provide strong guarantees against data loss. What do I mean by a strong guarantee against data loss? I consider the safety of my data strongly guaranteed when at least one other server must have a copy of the data before it can be committed on the master. However, that is not what happens in semi-synchronous replication. And if there is an error in semi-synchronous replication, such as a crash at the wrong moment, or a timeout, then even the throttling is abandoned, and everything defaults back to the traditional mode of replication.

What does semi-synchronous replication guarantee me then? If there are no errors or timeouts, then the guarantee is essentially that only one transaction per client is likely to be lost if the master crashes.

I do not mean to sound negative, or to send the message that semi-synchronous replication is not useful. It is useful, but if you misunderstand it, you could be relying on a strong guarantee that is not actually provided.

If you want to learn more about this, then I encourage you to read the relevant section of the MySQL manual. But read carefully, for example, the following sentences:

When a commit returns successfully, it is known that the data exists in at least two places (on the master and at least one slave). If the master commits but a crash occurs while the master is waiting for acknowledgment from a slave, it is possible that the transaction may not have reached any slave.

Finally, I would be interested to hear how many people are actually running semi-synchronous replication in production. I have a feeling that very few people are, even though a lot of people seem to have heard about it. What are your experiences?

from MySQL Performance Blog: http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

Thought this was cool: Percona XtraDB Cluster Feature 1: High Availability

There and in coming posts I am going to cover main features of Percona XtraDB Cluster. The first feature is High Availability.

But before jumping to HA, let’s review general architecture of the Percona XtraDB Cluster.

1. The Cluster consists of Nodes. Recommended configuration is to have at least 3 nodes, but you can make it running with 2 nodes too.
2. Each Node is regular MySQL / Percona Server setup. The point is that you can convert your existing MySQL / Percona Server into Node and roll Cluster using it as base. Or otherwise – you can detach Node from Cluster and use it as just a regular server.
3. Each Node contains the full copy of data. That defines XtraDB Cluster behavior in many ways. And obviously there are benefits and drawbacks.

Cluster architecture

Benefits of such approach:

  • When you execute a query, it is executed locally on the node. All data is available locally, no need for remote access.
  • No central management. You can loose any node at any point of time, and the cluster will continue to function
  • Good solution for scaling a read workload. You can put read queries to any of the nodes

Drawbacks:

  • Overhead of joining new node. The new node has to copy full dataset from one of existing node. If it is 100GB, it copies 100GB.
  • This can’t be used as an effective write scaling solution. There might be some improvements in write throughput when you run write traffic to 2 nodes vs all traffic to 1 node, but you can’t expect a lot. All writes still have to go on all nodes.
  • You have several duplicates of data. For 3 nodes – 3 duplicates

This basically defines how Percona XtraDB Cluster can be used for High Availability.

Basic setup: you run 3-nodes setup.
The Percona XtraDB Cluster will continue to function when you take any of nodes down.
At any point of time you can shutdown any Node to perform maintenance or make configuration changes.
Or Node may crash or become network unavailable. The Cluster will continue to work, you can continue to run queries on working nodes.

The biggest question there, what will happen when the Node joins the cluster back, and there were changes to data while the node
was down.

Let’s focus on this with details.
There is two ways that Node may use when it joins the cluster: State Snapshot Transfer (SST) and Incremental State Transfer (IST).

  • SST is the full copy if data from one node to another. SST is used when new node joins the cluster, it has to transfer data from existing node.
    There is three methods of SST available in Percona XtraDB Cluster: mysqldump, rsync and xtrabackup ( Percona XtraBackup with support of XtraDB Cluster will be released soon, so far you need to use our source code repository).
    The downside of mysqldump and rsync is that your cluster becomes READ-ONLY for time that takes to copy data from one node to another (SST applies FLUSH TABLES WITH READ LOCK command).
    Xtrabackup SST does not require READ LOCK for full time, only for syncing .frm files (the same as with regular backup).
  • Even with that, SST may be intrusive, that’s why there is IST mechanism. If down your node for short period of time, and then start it, the node is able to fetch only changes made during period it was down.
    This is done using caching mechanism on nodes. Each node contains a cache, ring-buffer, (the size is configurable) of last N changes, and the node is able to transfer part of this cache. Obviously IST can be done only if amount of changes needed to transfer is less than N. If it exceeds N, then the joining node has to perform SST.

You can monitor current state of Node by using
SHOW STATUS LIKE 'wsrep_local_state_comment', when it is ‘Synced (6)’, the node is ready to handle traffic.

from MySQL Performance Blog: http://www.mysqlperformanceblog.com/2012/01/17/xtradb-cluster-feature-1-high-availability/

Thought this was cool: Optimize Your SQL With Percona’s Online Query Advisor!

Wouldn’t it be nice if you could get expert advice on your SQL queries to find problems in them, the same way that programmers can use lint-check tools to warn about bugs in their C?

if ( execute = 1 ) {
   launch_missile();
}


Such a simple mistake, but it’s the kind of thing that James Bond movies are made of, isn’t it? Well, a lot of SQL queries have similar bugs, and thanks to Miguel Trias, now there’s a tool to help you find them. This is the second addition to our online suite of tools for MySQL users. You paste a query, it tells you what’s wrong with it. Simple as that.

Find the bug in this query:

select * from t1
   left join t2 using(id)
where t2.created_date < 2012-02-15;


Do you see it? Congratulations! I’ve analyzed that query with the tool, and shared the results with you. Click here to see if you were right.

That demonstrates another feature of the tool: you can share queries so they’re public, and other people can see them. The tool uses the same account that you might have already created for our online MySQL Configuration Wizard, which helps you choose good settings for your server.

Give it a try, and tell your friends! Enjoy hunting for bugs and problems in your SQL queries!

from MySQL Performance Blog: http://www.mysqlperformanceblog.com/2012/02/15/optimize-your-sql-with-perconas-online-query-advisor/

Thought this was cool: Troubleshooting MySQL Upgrade Performance Regressions

So lets say you upgraded from MySQL 5.1 to Percona Server 5.5 and instead of expected performance improvement you see your performance being worse. What should you do ?
First if you followed MySQL upgrade best practices such as testing your workload with pt-upgrade the chances of this happening are rather slim. But lets assume you have not followed these recommendations to the book or some things just slipped through.

First lets talk about what you should not do. You should not panic and go change all kind of configuration options in case you get lucky. Sometimes you do but more often you just waste your time. Instead try to understand what has changed and what exactly changed for worse.

I am speaking about MySQL upgrade – version change but in a lot of cases this will happen when settings are changed at the same time, operating system is upgraded or system is moved to “better” hardware. It is best if you can isolate the problem to one of these – make sure changing only MySQL version on the same hardware with same settings causes regression. You can always improve them later after you got well performing baseline. Do not change storage engine at the same time eather.

Make sure you’re looking at fully warmed up system. Rather often people panic and what thought to be slow down with version upgrade is rather system warming up. You need to compare apples to apples such as both systems after restart or even better both systems fully warmed up.

To spot what exactly is causing the problem it is good to use pt-query-digest to compare workloads on servers. Best if you can analyze full query logs but if this is not possible at least look at network traffic from tcpdump. This should help you to spot queries which are a lot slower on new MySQL version.

Once you have the queries you will most likely able to repeat the problem by running query on old and new MySQL Server version and observing the difference. Sometimes though it might not show performance difference ran alone on idle system – in this case you might be dealing with concurrency problem and you might need to create a more complicated test which runs this query concurrently many times or even workload consisting of several query types. Though it is rather rare for this to be needed so lets not go there.

It is ideal if you can repeat the problem on “test server” where you can analyze it without side load. Setting up 2 MySQL Servers side by side (for example with MySQL Sandbox) can especially be helpful.

Once you have spotted the query which performs differently between MySQL Server versions you should:

Check Query Plan Run EXPLAIN to see if plans for the query are the same. Changing Query Execution plans is the most common regression problem. If plan has changed check
if you can get the plan manually to the old one by using hints such as STRAIGHT_JOIN, FORCE INDEX, BIG_RESULT/SMALL_RESULT. Check whatever stats are the same (run SHOW INDEXES FROM

for tables involved and check cardinality) Different stats can often cause different plans. Run ANALYZE TABLE on both MySQL Versions to see if you can get statistics to be same or close. I should warn you though. In some cases the faster plan may be wrong plan from optimizer standpoint and updating stats may cause old MySQL Version to get slower plan instead of new one getting faster. Check stats sampling settings such as innodb-stats-method and myisam-stats-method. For Innodb storage engines it is possible for stats to be different for the same data because it uses random sampling to update the stats. You may consider disabling innodb_stats_on_metadata and increasing innodb-stats-sample-pages to get more accurate results. In recent Percona Server versions and MySQL 5.6 you can also store innodb stats in table so you have more control over them.

In the end the tricky thing about Query Plan is you might not be able to get new version to pick the old plan no matter how you try and you might be forced to change your application by changing how query is written and provide the hints.

Check Query Execution If query execution plan looks the same it is good to check whenever it seems to be executing same on low level. FLUSH STATUS; SHOW STATUS can show whenever internal operation is different, you can also check SHOW PROFILE to see where differences are. Running query in the loop and getting o_profile sample can also be a good idea. If you found query is executing differently even if the plan is the same it could be because changes to the options (might be even defaults) or some more subtle plan changes which are not seen in explain. It also could be some additional internal overhead being added. If you have query with same plan taking a lot longer to execute you might be up for deep investigation. You may also wish to create a repeatable test case which does not use any sensitive data and file the bug at this point.

Check Binaries Make sure you’re not using debug binary or differently optimized binaries.

When you’re about to file a bug or report the problem through MySQL support channel if often helps to know whenever it is the problem in stick MySQL changes between versions on Percona enhancements. Before reporting bug to Percona Server bug tracking system check whenever the bug is in Percona Server indeed by running the same test on MySQL server of the same version. If you’re Support customer you do not need to do it as Support Engineer can do it for you provided we have complete test case provided.

from MySQL Performance Blog: http://www.mysqlperformanceblog.com/2012/02/15/troubleshooting-mysql-upgrade-performance-regressions/

Thought this was cool: Comment on INSERT INTO … SELECT Performance with Innodb tables. by josh

+1 for @adelainev. I have the same question whether this behavior still applies in MySQL 5.5+. This post was written in 2006, so just want to see where things stand currently.

from Comments for MySQL Performance Blog: http://www.mysqlperformanceblog.com/2006/07/12/insert-into-select-performance-with-innodb-tables/comment-page-1/#comment-865383

Thought this was cool: Hyder: Transactional Indexed Record Manager for Shared Flash Storage

If you work in the database world, you already
know
Phil
Bernstein
. He’s the author of Principles
of Transaction Processing
and has
a long track record as a successful and prolific database researcher.  Past
readers of this blog may remember Phil’s guest blog posting on
Google
Megastore
. Over the past few years,
Phil has been working on an innovative
NoSQL
system
based
upon flash storage. I like the work because it pushes the limit of what can be done
on a single server with transaction rates approaching 400,000, leverages the characteristics
of flash storage in a thought provoking way, and employs interesting techniques such
as log-only storage.

 

Phil presented Hyder at the Amazon ECS series
a couple of weeks back (a past ECS presentation at:
High
Availability for Cloud Computing Database Systems
.

 

In the Hyder system, all cores operate on
a single shared transaction log. Each core (or thread) processes
Optimistic
Concurrency Control
(OCC) database
transactions one at a time. Each transaction posts its after-image to the shared log.
One core does OCC and rolls forward the log. The database is a
binary
search tree
serialized into the
log (A
B-tree would
work equally well in this application). Because the log is effectively a no-overwrite,
log-only datastore, a changed node require that the parent must now point to this
new node which forces the parent to be updated as well. Now its parent needs updating
and this cascading set of changes proceeds to the root on each update.

 

The tree is maintained via copy-on-write
semantics where updates are written to the front of the log with references to unchanged
tree nodes pointing back to the appropriate locations in the log. Whenever a node
changes, the changed node is written to the front of the log. Consequently all database
changes result in changes to all nodes to the top of the search tree.

 

This has the downside of requiring
many tree nodes to be updated on each database update but has the upside of the writes
all being sequential at the front of the log. Since it is a no-overwrite store, when
an update is made, the old nodes remain so transactional time travel is easy. The
old search tree root still point to a complete tree that was current as of the point
in time when that root was the current root of the search tree.  As
new nodes are written, some old nodes are no longer part of the current search tree
and can be garbage collected over time.

Transactions are implemented by
writing an intention log record to the front of the log with all changes required
by this transaction and these tree nodes point either to other nodes within the intention
record or to unchanged nodes further back in the log. This can be done quickly and
all updates can proceed  in parallel without
need for locking or synchronization.

 

Before the transaction can be completed, it
must now be checked for conflict using
Optimistic
Concurrency Control
.
If there are no conflicts, the root of the search tree is atomically moved to point
to the new root and the transaction is acknowledged as successful. If the transaction
is in conflict, it is failed and the tree root is not advanced and the intention record
becomes garbage.

 

Most of the transactional update
work can be done concurrently without locks but two issues come to mind quickly:

 

1)      Garbage
collection
: because the systems is constantly rewriting large
portions of the search tree, old versions of the tree a spread throughout the log
and need to be recovered.

2)      Transaction
Rate:
The transaction rate is limited by the rate at which
conflicts can be checked and the tree root advanced.

 

The latter is the biggest concern
and the rest of the presentation focuses on the rate with which this bottleneck can
be processed.  The presenter showed that
rates in 400,000 transaction per second where obtained in performance testing so this
is a hard limit but it is a fairly high hard limit. This design can go a long way
before partitioning is required.

 

If you want to dig deeper, the
Hyder presentation is at:

http://mvdirona.com/jrh/TalksAndPapers/Hyder4Amazon5Dec2011.pdf

 

More detailed papers can
be found at:

 

Philip A. Bernstein,
Colin W. Reid, Sudipto Das: Hyder – A Transactional Record Manager for Shared Flash.
CIDR 2011: 9-20

http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper2.pdf

 

Philip A. Bernstein,
Colin W. Reid, Ming Wu, Xinhao Yuan: Optimistic Concurrency Control by Melding Trees.
PVLDB 4(11): 944-955 (2011)

http://www.vldb.org/pvldb/vol4/p944-bernstein.pdf

 

Colin W. Reid, Philip
A. Bernstein: Implementing an Append-Only Interface for Semiconductor Storage. IEEE
Data Eng. Bull. 33(4): 14-20 (2010)

http://sites.computer.org/debull/A10dec/hyder.pdf

 

Mahesh Balakrishnan,
Philip A. Bernstein, Dahlia Malkhi, Vijayan Prabhakaran, Colin W. Reid: Brief Announcement:
Flash-Log – A High Throughput Log. DISC 2010: 401-403

http://www.springerlink.com/content/c732l27h3mrn3170/

 

James
Hamilton

e: jrh@mvdirona.com

w: http://www.mvdirona.com

b: http://blog.mvdirona.com / http://perspectives.mvdirona.com

 



From Perspectives.
from Perspectives: http://perspectives.mvdirona.com/2011/12/14/HyderTransactionalIndexedRecordManagerForSharedFlashStorage.aspx