Saturday, July 16, 2016

Galera awareness and ProxySQL Scheduler

One of the most common request for ProxySQL it to support Galera cluster.
Support for Galera requires a lot of features to be added to ProxySQL , from checking if a node is in sync state, if it is a garbd process, auto-detect of new nodes, send writes to only one node,  remote DC support, and more.
Furthermore, as Galera is a clustering solution and not all setups have the same requirements, to support all the possible configurations seemed to be a too complex task.

Support for Galera in all its possible configurations wasn't the only request that seems to be complex to implement in the core of ProxySQL, but also additional requests not related to Galera, like changing the weight on a node while overloaded (as measured by different metrics! CPU usage, disk IO, number of threads) or removing a node from a pool while a backup was running.

I always advised that ProxySQL is easily configurable, and anybody can write a script that configures it at runtime through the Admin interface. The script can have any complex logic or algorithm, and then configures ProxySQL accordingly. In short, I always advised that ProxySQL should be a core component, and tools can be built around it.
But there was still a missing part: a glue to link ProxySQL and the external script(s).
For this reason, recently ProxySQL supported a new feature: jobs Scheduler.
Scheduler is a cron-like solution embedded directly inside ProxySQL. For more details, please have a look at the documentation.

Scheduler can run any sort of external executable, therefore we can write a script that monitor all backends of a specific hostgroup (a Galera cluster), remove hosts that are not in sync and re-add them once in sync.
A ready to use script is distributed with ProxySQL itself: proxysql_galera_checker.sh .
The script is inspired by Percona clustercheck, and accepts 3 arguments mandatory and one optional:

$ ./proxysql_galera_checker.sh
Usage: ./proxysql_galera_checker.sh <hostname> <port> <hostgroup_id> <log_file>

When running the script we need to pass ProxySQL's hostname and the port where the Admin interface is running. The script will connect (using hardcoded credential) to the Admin interface, it will retrieves credentials to connect to the backends and a list of backends for a specific hostgroup, it will check the value of wsrep_local_state on all these backends, and if they change status will automatically reconfigure ProxySQL.

As an example, we have a Galera cluster with 3 nodes:



Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------+
| hostgroup_id | hostname    | port | status |
+--------------+-------------+------+--------+
| 0            | 10.10.10.11 | 3306 | ONLINE |
| 0            | 10.10.10.12 | 3306 | ONLINE |
| 0            | 10.10.10.13 | 3306 | ONLINE |
+--------------+-------------+------+--------+
3 rows in set (0.00 sec)




Next, we need to configure the scheduler to run the script at regular interval. In this example, we will run the job every 10000 milliseconds (10 seconds):



Admin> SHOW CREATE TABLE scheduler\G
*************************** 1. row ***************************
       table: scheduler
Create Table: CREATE TABLE scheduler (
id INTEGER NOT NULL,
interval_ms INTEGER CHECK (interval_ms>=100 AND interval_ms<=100000000) NOT NULL,
filename VARCHAR NOT NULL,
arg1 VARCHAR,
arg2 VARCHAR,
arg3 VARCHAR,
arg4 VARCHAR,
arg5 VARCHAR,
PRIMARY KEY(id))
1 row in set (0.00 sec)

Admin> INSERT INTO scheduler(id,interval_ms,filename,arg1,arg2,arg3,arg4) VALUES
(1,'10000','/var/lib/proxysql/proxysql_galera_checker.sh','127.0.0.1','6032','0',
'/var/lib/proxysql/proxysql_galera_checker.log');
Query OK, 1 row affected (0.00 sec)

Admin> select * from scheduler\G
*************************** 1. row ***************************
         id: 1
interval_ms: 10000
   filename: /var/lib/proxysql/proxysql_galera_checker.sh
       arg1: 127.0.0.1
       arg2: 6032
       arg3: 0
       arg4: /var/lib/proxysql/proxysql_galera_checker.log
       arg5: NULL
1 row in set (0.00 sec)

Admin> SELECT * FROM runtime_scheduler;
Empty set (0.00 sec)


At this stage the job is not loaded at runtime yet. So next we must run LOAD SCHEDULER TO RUNTIME:

Admin> LOAD SCHEDULER TO RUNTIME;
Query OK, 0 rows affected (0.00 sec)

Admin> select * from runtime_scheduler\G
*************************** 1. row ***************************
         id: 1
interval_ms: 10000
   filename: /var/lib/proxysql/proxysql_galera_checker.sh
       arg1: 127.0.0.1
       arg2: 6032
       arg3: 0
       arg4: /var/lib/proxysql/proxysql_galera_checker.log
       arg5: NULL
1 row in set (0.00 sec)


The first test we will run is to shutdown a node:

vagrant@n3:~$ sudo service mysql stop
 * Stopping MySQL (Percona XtraDB Cluster) mysqld


From the error log of this check, we can see that the script detected that a node is not reachable, and set it to offline:

vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
Sat Jul 16 18:41:33 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:34 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:44 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:44 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:44 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:54 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:54 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:41:54 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:04 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:04 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:04 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:14 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:42:14 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state
Sat Jul 16 18:42:14 UTC 2016 Changing server 10.10.10.12:3306 to status OFFLINE_SOFT
Sat Jul 16 18:42:14 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:42:25 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:42:25 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:42:25 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4


Also mysql_servers confirms the change:

Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------------+
| hostgroup_id | hostname    | port | status       |
+--------------+-------------+------+--------------+
| 0            | 10.10.10.11 | 3306 | ONLINE       |
| 0            | 10.10.10.12 | 3306 | OFFLINE_SOFT |
| 0            | 10.10.10.13 | 3306 | ONLINE       |
+--------------+-------------+------+--------------+
3 rows in set (0.00 sec)


To continue our tests, we will set a host in desync mode:

mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync  | OFF   |
+---------------+-------+
1 row in set (0.00 sec)

mysql> set global wsrep_desync=ON;
Query OK, 0 rows affected (0.05 sec)

mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync  | ON    |
+---------------+-------+
1 row in set (0.00 sec)


Also in this case the script will notice it:

vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:01 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:01 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:11 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:11 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:11 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:21 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:21 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:21 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:31 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 2
Sat Jul 16 18:48:31 UTC 2016 Changing server 10.10.10.11:3306 to status OFFLINE_SOFT
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:31 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:31 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:48:41 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:48:42 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:48:42 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4


And mysql_servers will confirm the change:

Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------------+
| hostgroup_id | hostname    | port | status       |
+--------------+-------------+------+--------------+
| 0            | 10.10.10.11 | 3306 | OFFLINE_SOFT |
| 0            | 10.10.10.12 | 3306 | OFFLINE_SOFT |
| 0            | 10.10.10.13 | 3306 | ONLINE       |
+--------------+-------------+------+--------------+
3 rows in set (0.00 sec)


To continue this proof of concept, we will now put the last node in sync:

mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync  | ON    |
+---------------+-------+
1 row in set (0.00 sec)

mysql> set global wsrep_desync=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> show global variables like 'wsrep_desync';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync  | OFF   |
+---------------+-------+
1 row in set (0.00 sec)


The script will detect it that the node is back ONLINE:

vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:49:53 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:49:53 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:03 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:03 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:03 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:13 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:13 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:13 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:23 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 2
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:24 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:24 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:50:34 UTC 2016 Check server 10.10.10.11:3306 , status OFFLINE_SOFT , wsrep_local_state 4
Sat Jul 16 18:50:34 UTC 2016 Changing server 10.10.10.11:3306 to status ONLINE
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:50:34 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:50:34 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4


And correctly reconfigure mysql_servers:

Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------------+
| hostgroup_id | hostname    | port | status       |
+--------------+-------------+------+--------------+
| 0            | 10.10.10.11 | 3306 | ONLINE       |
| 0            | 10.10.10.12 | 3306 | OFFLINE_SOFT |
| 0            | 10.10.10.13 | 3306 | ONLINE       |
+--------------+-------------+------+--------------+
3 rows in set (0.01 sec)


Finally, we will restart the node shutdown previously:

vagrant@n3:~$ sudo service mysql start
 * Starting MySQL (Percona XtraDB Cluster) database server mysqld                                                                                                                                             [ OK ]


Once again, the script will detect that a node is now healthy:

vagrant@n1:~$ sudo tail /var/lib/proxysql/proxysql_galera_checker.log -n 20
Sat Jul 16 18:51:15 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:25 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:51:25 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:51:25 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:35 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:51:35 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:51:35 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:45 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
ERROR 2003 (HY000): Can't connect to MySQL server on '10.10.10.12' (111)
Sat Jul 16 18:51:45 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state
Sat Jul 16 18:51:45 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:55 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:51:55 UTC 2016 Check server 10.10.10.12:3306 , status OFFLINE_SOFT , wsrep_local_state 4
Sat Jul 16 18:51:55 UTC 2016 Changing server 10.10.10.12:3306 to status ONLINE
Sat Jul 16 18:51:55 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:52:06 UTC 2016 Check server 10.10.10.11:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:52:06 UTC 2016 Check server 10.10.10.12:3306 , status ONLINE , wsrep_local_state 4
Sat Jul 16 18:52:06 UTC 2016 Check server 10.10.10.13:3306 , status ONLINE , wsrep_local_state 4


And correctly reconfigure mysql_servers:

Admin> SELECT hostgroup_id,hostname,port,status FROM mysql_servers;
+--------------+-------------+------+--------+
| hostgroup_id | hostname    | port | status |
+--------------+-------------+------+--------+
| 0            | 10.10.10.11 | 3306 | ONLINE |
| 0            | 10.10.10.12 | 3306 | ONLINE |
| 0            | 10.10.10.13 | 3306 | ONLINE |
+--------------+-------------+------+--------+
3 rows in set (0.00 sec)


Finally, do not forget to save the job to disk:

Admin> SAVE SCHEDULER TO DISK;
Query OK, 0 rows affected (0.01 sec)


Conclusion

Complete Support for Galera Cluster in ProxySQL is still not ready.
Although, the ability of ProxySQL to be easily reconfigurable at runtime, together with its Scheduler, will quickly extend ProxySQL capabilities, include a better integration with Galera in complex setups.

Saturday, June 18, 2016

Query rewrite with ProxySQL : use case scenario

One of the interesting features of ProxySQL is the support of query rewrite.
In fact, this functionality was one of the main motivations for writing a proxy that sits between the application and the database server to rewrite queries on the fly, without the need to have a developer involved in rewriting the query ASAP (if ever possible!).


In this short post I will describe a use case scenario on how to use ProxySQL to troubleshoot performance and quickly rewrite queries if needed.
The setup in this specific blog post is Nylas platform, a sharded environment where there are over 80 ProxySQL instances running.

After noticing some odd workload originated from Nylas' open source sync engine , as per the attached graph on InnoDB rows read:


We could ask ProxySQL to get the list of the most time consuming queries:


Or even ask which is the query with the highest average execution time:



Note that we could have get the same information from information_schema.events_statements_summary_by_digest , but we already knew that the load was coming from a specific node so we queried stats.stats_mysql_query_digest inside ProxySQL itself.

From the output above, it was obvious that the problem was with a specific query type running at that time.

Without copying again the whole query, the query could be simplified as:

SELECT list_of_columns
    FROM tablename
WHERE
    PK >= ? 
    AND idx_col1 IN (? , ?)
    AND col2 = ?
    AND col3 = ?
    ORDER BY PK ASC
LIMIT ? 


col2 and col3 are not indexes because do not provide enough cardinality, while idx_col1 does, so it is indexed.
Because of the WHERE clauses and the ORDER BY , mysqld processes it performing a range scan on the primary key.
The optimizer could be smart enough to optimize this query, but it is not.

The query can be rewritten as:

SELECT * FROM (
(SELECT list_of_columns
    FROM tablename
WHERE
    PK >= ? 
    AND idx_col1 = v1
    AND col2 = ?
    AND col3 = ?
    ORDER BY PK ASC
LIMIT ?)
UNION ALL
(SELECT list_of_columns
    FROM tablename
WHERE
    PK >= ? 
    AND idx_col1 = v1
    AND col2 = ?
    AND col3 = ?
    ORDER BY PK ASC
LIMIT ?)
) t ORDER BY PK LIMIT ?

How to do this with ProxySQL? Extremely easy!
The snippet below shows how to create a rule to rewrite the query, load the new rule into runtime, and persist the new rule on disk:



From the following graph of InnoDB rows read it is immediately clear the effect of the new query rewrite:



In conclusion: query rewrite is a feature that, depending from the queries running against the database server, quickly allows to isolate and correct problematic queries and improve performance.