-
Curious case of PXC node that refused to start due to SSL
In this blog, I am going to share a real-world debugging case study where a routine Percona XtraDB Cluster node restart led to an unexpected failure. I will walk through what we observed, what we checked, and how we ultimately identified the root cause.
Let’s see how the maintenance goes. It was supposed to be a simple restart. The kind you’ve done a hundred times. You SSH in, run the maintenance, bring the node back up, and go grab a coffee. Except this time, the coffee went cold on the desk… because MySQL refused to start.
The Problem
The error log of Percona XtraDB Cluster (8.0) had the following information:2025-11-05T05:26:10.982984Z 0 [ERROR] [MY-000059] [Server] SSL error: Unable to get certificate from '/var/lib/mysql/server-cert.pem'.
2025-11-05T05:26:10.983030Z 0 [Warning] [MY-013595] [Server] Failed to initialize TLS for channel: mysql_main. See below for the description of exact issue.
2025-11-05T05:26:10.983045Z 0 [Warning] [MY-010069] [Server] Failed to set up SSL because of the following SSL library error: Unable to get certificate
2025-11-05T05:26:10.983052Z 0 [Note] [MY-000000] [WSREP] New joining cluster node configured to use specified SSL artifacts
2025-11-05T05:26:10.983083Z 0 [Note] [MY-000000] [Galera] Loading provider /usr/lib64/galera4/libgalera_smm.so initial position: 07c67757-0d18-11ef-b5a9-ee5d87b39aa8:4147053897
2025-11-05T05:26:10.983098Z 0 [Note] [MY-000000] [Galera] wsrep_load(): loading provider library '/usr/lib64/galera4/libgalera_smm.so'
2025-11-05T05:26:10.983742Z 0 [Note] [MY-000000] [Galera] wsrep_load(): Galera 4.22(f6c0465) by Codership Oy <info@codership.com> (modified by Percona <https://percona.com/>) loaded successfully.
2025-11-05T05:26:10.983771Z 0 [Note] [MY-000000] [Galera] Resolved symbol 'wsrep_node_isolation_mode_set_v1'
2025-11-05T05:26:10.983784Z 0 [Note] [MY-000000] [Galera] Resolved symbol 'wsrep_certify_v1'
2025-11-05T05:26:10.983807Z 0 [Note] [MY-000000] [Galera] CRC-32C: using 64-bit x86 acceleration.
2025-11-05T05:26:10.983995Z 0 [Note] [MY-000000] [Galera] not using SSL compression
2025-11-05T05:26:10.984341Z 0 [ERROR] [MY-000000] [Galera] Bad value '/var/lib/mysql/server-cert.pem' for SSL parameter 'socket.ssl_cert': 336245135: 'error:140AB18F:SSL routines:SSL_CTX_use_certificate:ee key too small'
at /mnt/jenkins/workspace/pxc80-autobuild-RELEASE/test/rpmbuild/BUILD/Percona-XtraDB-Cluster-8.0.42/percona-xtradb-cluster-galera/galerautils/src/gu_asio.cpp:ssl_prepare_context():471
2025-11-05T05:26:10.984401Z 0 [ERROR] [MY-000000] [Galera] Failed to create a new provider '/usr/lib64/galera4/libgalera_smm.so' with options 'gcache.size=1G;gcache.recover=yes;socket.ssl=yes;socket.ssl_ca=/data00/mysqldata/ca.pem;socket.ssl_cert=/data00/mysqldata/server-cert.pem;socket.ssl_key=/data00/mysqldata/server-key.pem;socket.ssl_key=/var/lib/mysql/server-key.pem;socket.ssl_ca=/var/lib/mysql/ca.pem;socket.ssl_cert=/var/lib/mysql/server-cert.pem': Failed to initialize wsrep provider
2025-11-05T05:26:10.984434Z 0 [ERROR] [MY-000000] [WSREP] Failed to load provider
2025-11-05T05:26:10.984448Z 0 [ERROR] [MY-010119] [Server] Aborting
2025-11-05T05:26:10.984602Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.42-33.1) Percona XtraDB Cluster (GPL), Release rel33, Revision 6673f8e, WSREP version 26.1.4.3.
2025-11-05T05:26:10.985473Z 0 [ERROR] [MY-010065] [Server] Failed to shutdown components infrastructure.
MySQL was down, and the maintenance clock was running. The certificate file sitting at /var/lib/mysql/server-cert.pem was the same file that had been working perfectly fine before the restart!!
From past history, it was known that the following commands were executed correctly on the same cluster nodeSET GLOBAL ssl_ca = '/var/lib/mysql/ca.pem';
SET GLOBAL ssl_cert = '/var/lib/mysql/server-cert.pem';
SET GLOBAL ssl_key = '/var/lib/mysql/server-key.pem';
ALTER INSTANCE RELOAD TLS;Clients connected over TLS. Galera nodes communicated securely. There were zero complaints from the error log.
In other words, the SSL reload at runtime inherited the process environment that existed when MySQL originally booted. Everything was smooth, but after a restart? MySQL complains and declines to start. So what has changed?
Checking Usual Suspects
File permissions
We checked the PEM files.
Ownership: mysql:mysql.
Permissions: 644 for the cert, 600 for the key.
We compared them against the other Galera nodes, and they were identical. This didn’t look like a permissions problem.
Is SELinux to blame here?
SELinux has ruined enough DBA time that it is one of the top spots on such checklists – but it was permissive.$ getenforce
PermissiveThat means it was logging any security issues, but not blocking. And there were no AVC denials related to MySQL or the PEM files in /var/log/audit/audit.log or dmesg!
File corruption
Did the files get corrupted/replaced during or before the MySQL restart?$ openssl x509 -in /var/lib/mysql/server-cert.pem -noout -text
# Output looked perfectly valid when compared to the output from other nodes
$ openssl rsa -in /var/lib/mysql/server-key.pem -check
RSA key okThe files were fine. They parsed cleanly. OpenSSL could read them. So why couldn’t MySQL?
More Logs review
We scanned /var/log/messages and journalctl for anything unusual around the time of the restart. No disk errors. No OOM kills. No kernel panics. Nothing that screamed “I am the Dhurandhar that’s destroyed your node.” At this point, most of the usual suspects were guilt-free, staring at us, asking, “Who did it?”
The Clue
It is good to communicate with stakeholders, and we did – “Was there any recent change on your side?” to the client, and then uttered the golden words “Last week the crypto-policy was updated on all of the DB servers to comply with PCI.”
PCI > Crypto-policy – Let’s go and check it !!$ update-crypto-policies --show
FUTUREThe system was running RHEL’s FUTURE cryptographic policy.
For those unfamiliar (including me at the time), Red Hat Enterprise Linux (and its derivatives, such as Rocky, Alma, and Oracle Linux) ships with a system-wide cryptographic policy framework. It’s a centralized way to enforce minimum standards for TLS versions, cipher suites, key lengths, and signature algorithms across all applications on the system that include OpenSS and yes, anything that links against those libraries… like MySQL.
Here’s a table that shows information about the crypto-policy levels:
Policy
RSA Minimum
TLS Minimum
SHA-1 Signatures
Use Case
LEGACY
1024-bit
TLS 1.0
Allowed
Old systems compatibility
DEFAULT
2048-bit
TLS 1.2
Allowed
Standard operations
FUTURE
3072-bit
TLS 1.2
Blocked
Forward-looking hardening
FIPS
2048-bit
TLS 1.2
Blocked
FIPS 140 compliance
So FUTURE demands a 3072-bit RSA key; otherwise, it is blocked. What do we have?
$ openssl rsa -in server-key.pem -text -noout | head -1
RSA Private Key: (2048 bit, 2 primes)2048 bits! C’mon! And now I recall the error log again… The hint was there:error:140AB18F:SSL routines:SSL_CTX_use_certificate:ee key too smallNow we have our story straight.
On restart, our PXC cluster node started a new process linked against OpenSSL, which now enforced the FUTURE policy. OpenSSL looked at the 2048-bit RSA certificate and said: “Nope. Too small.”
Fixture
The quick fix here would be to adjust the policy to DEFAULT.sudo update-crypto-policies --set DEFAULTThis will accept the current SSLs, and the node will join the cluster readily.
Alternatively, to remain compliant and adhere to the security policy strictness, the fixture will be to
Generate new certificates
Deploy the keys/certs to all Galera nodes
Perform a rolling restart
Conclusion
This was a classic case of a problem hiding at the boundary between two domains, database administration and operating system security. The DBA saw valid certificates and correct MySQL configuration. The sysadmin saw a properly hardened system with a strong crypto policy. Neither was wrong. But the intersection of their two correct configurations produced a failure.
This incident reinforces the importance of cross-domain awareness, where resolving database issues sometimes requires understanding and challenging system-level security decisions.
The post Curious case of PXC node that refused to start due to SSL appeared first on Percona.
-
Building Query Analysis and Insights Dashboard in PMM
Percona Monitoring and Management is a great open source database monitoring, observability, and management tool. Query analytics is one of the prominent features DBA uses actively to trace the incidents and query performance identification.
We all know and love the Query Analytics (QAN) dashboard… It’s the first place we look when an incident alert fires or when a developer asks, “Why is the app slow?” or “What was going on during the midnight production outage?”
But sometimes, the standard dashboards just don’t tell the whole story or maybe are not clear enough. QAN is great, but shouldn’t we have more? If you have PMM running, you already have a Ferrari engine under the hood: ClickHouse. Most of us just drive it in first gear using the default UI.
In this post, we are going to take the training wheels off. We will bypass the standard QAN interface and talk directly to the ClickHouse backend to build highly specialised dashboards. We aren’t just looking for “slow” queries anymore; we are hunting for inefficiency, volatility, and the “silent killers” that standard monitoring often misses.
This is the hands-on blog, so grab your coffee and let’s turn that PMM instance into a deep-dive forensic tool.
Create a New Dashboard in PMM
Connect to PMM > Dashboards > Create New Dashboard
Save it with name “Slow Query Analysis” and Description “Slow Query Analysis from PMM’s QAN database (clickhouse)”
Click on add visualisation & select datasource “ClickHouse”
Choose SQL Builder
Paste the following query to get top 10 slow queries from the database
SELECT fingerprint
FROM pmm.metrics
WHERE service_type = 'mysql'
AND $__timeFilter(period_start)
GROUP BY fingerprint
ORDER BY sum(m_query_time_sum) DESC
LIMIT 10
Choose “Table View” on the top to view the list
When you click “Run Query” you will see the top 10 slow queries in the chosen time period.
Let’s Save the dashboard after Panel Options updates as follows7.1 Change Panel Name and Description to: “Slow Query Analysis”7.2 Legend Placement to “Bottom”, Values to “min”,”max”, “mean”7.3 Change Axis’ Scale to “Logarithmic”Logarithmic scale on an axis compresses large ranges of data, making it ideal for visualizing metrics with vastly different magnitudes. This provides good visualisation for queries of different execution time frames.7.4 Save DashboardAlright, we’re at our first step. This first result set shows the top 10 slow query fingerprints across all MySQL services tracked by PMM for the selected time range. It provides a quick, environment-wide view of the most expensive query patterns. But this does not provide a clear picture. Let’s refine the dashboard to focus on specific queries, servers and observe their performance over time.Now, let’s introduce a variable to filter the data.
Click on Settings on Dashboard’s home page8.1 Choose “Variables” tab and click on “Add Variable”8.2 Add variable configuration and Save Dashboard
Go Back to Dashboard and Edit “Slow Query Analysis” Panel.
Now you should see the Query ID filter on the top.
Change the query to the following
SELECT
period_start AS time,
left(fingerprint, 80) AS query_text,
sum(m_query_time_sum/m_query_time_cnt) AS query_time
FROM
pmm.metrics
WHERE
service_type = 'mysql'
AND $__timeFilter(period_start)
AND fingerprint IN (
SELECT fingerprint
FROM pmm.metrics
WHERE service_type = 'mysql'
AND $__timeFilter(period_start)
AND ($queryid = '' OR queryid = $queryid)
GROUP BY fingerprint
ORDER BY sum(m_query_time_sum) DESC
LIMIT 10
)
GROUP BY
time,
fingerprint
ORDER BY
time,
query_time DESC
Basically the query is fetching start time, query text and average query time for the selected period for the top 10 Queries in that time-frame.
There is a filter for the “queryid” variable which you may use if you want to filter on a specific queryid.
Choose “Time Series” as “Query Type”
Adjust Panel Options11.1 Choose “Standard options” > “Unit” as “Time / Seconds (s)” from drop down.11.2 Choose “Standard options” > “Display name” as “${__field.labels.query_text}”11.3 Click on “Save Dashboard”
Your dashboard should be ready
Now, by default this dashboard is plotting top 10 queries. If you have a query fingerprint handy, you may be able to filter the search by that specific query. That said, this is still plotting queries across all the monitored instances. Let’s move on to add the service_name filter.
Adding service_name filter
Add Variable
Create new variable named “service_name”
Use variable type “Query”
Use Data Source as “ClickHouse”
Query:
select distinct service_name from pmm.metrics where service_type = 'mysql';
Unselect all checkboxes in “Selection options”
Save Dashboard
Update Query
SELECT
period_start AS time,
left(fingerprint, 80) AS query_text,
sum(m_query_time_sum/m_query_time_cnt) AS query_time
FROM
pmm.metrics
WHERE
(service_name = '' OR service_name = '$service_name')
AND service_type = 'mysql'
AND $__timeFilter(period_start)
AND fingerprint IN (
SELECT fingerprint
FROM pmm.metrics
WHERE service_type = 'mysql'
AND $__timeFilter(period_start)
AND (service_name = '' OR service_name = '$service_name')
GROUP BY fingerprint
ORDER BY sum(m_query_time_sum) DESC
LIMIT 10
)
GROUP BY
time,
left(fingerprint, 80)
ORDER BY
time,
query_time DESCI know many of you are naturally curious and enjoy experimenting with PMM and Grafana… So you’ve probably already started thinking about how far this can be taken. Feel free to share your ideas or custom dashboards in the comments.
Sample Dashboards:
The Query Analysis and Insights Dashboard
Okay, for those who are looking to have quick results, I’ve prepared the complete Query Analysis and Insights Dashboard for you to import and use instantly.
By importing the JSON file, you’ll get the full working dashboard with all panels preconfigured, including:
Slow Query Analysis
Latency Distribution Heatmap
Query Volatility (P99 vs Average)
Lock Wait Ratio Over Time (Top Contended Queries)
Temporary Table Usage (Disk & Memory)
Query Efficiency (Rows Examined vs Rows Sent)
Error Rate vs Throughput
Workload Distribution by User
Query Volume by Client Host
Execution Time vs Lock Wait Time
This allows you to instantly explore PMM Query Analytics data, adjust time ranges and filters, and correlate query performance, contention, and workload behavior without recreating the dashboard from scratch.
Dashboard JSON available here:
Grafana: https://grafana.com/grafana/dashboards/24896
GitHub: https://github.com/Percona-Lab/pmm-dashboards/query_analysis_insights.json
Give it a go and let me know if you have suggestions or requests. Also consider sharing if you create something interesting.
Cheers.
The post Building Query Analysis and Insights Dashboard in PMM appeared first on Percona.
-
Run an ALTER TABLE for a huge table in Aurora
Recently, we received an alert for one of our Managed Services customers indicating that the auto_increment value for the table was 80% of its maximum capacity. The column was INT UNSIGNED, which has a limit of 4,294,967,295.
At 80%, we have enough time to change it to BIGINT.…. Right? Let’s see.
So we used pt-online-schema-change to perform the alter.
It started running at a good pace but slowed over time.
Why?
Well, let’s look at the definition of the table:mysql> show create table myschema.mytableG
*************************** 1. row ***************************
Table: mytable
Create Table: CREATE TABLE `mytable` (
`id` int unsigned NOT NULL AUTO_INCREMENT,
`long_column` varchar(1000) NOT NULL,
`state` tinyint unsigned NOT NULL,
`created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`short_column` varchar(30) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_long_column` (`long_column`,`state`),
KEY `idx_short_column` (`short_column`,`state`),
KEY `idx_short_col2` (`short_column`)
) ENGINE=InnoDB AUTO_INCREMENT=4009973818 DEFAULT CHARSET=utf8mb3NOTE1: The index on long_column is for a varchar column with a length of 1000; it may not be required, and an index prefix may be more helpful here.
NOTE2: The index idx_short_col2 is duplicated, as it is covered by the index idx_short_column.
Those changes require testing and are out of scope for this emergency, but they are worth mentioning.
Table size:
+---------------+------------+------------+---------+----------+---------+----------+--------+
| TABLE_SCHEMA | TABLE_NAME | TABLE_ROWS | DATA_GB | INDEX_GB | FREE_GB | TOTAL_GB | ENGINE |
+---------------+------------+------------+---------+----------+---------+----------+--------+
| myschema | mytable | 3906921584 | 1118 | 1790 | 0 | 2907 | InnoDB |
+---------------+------------+------------+---------+----------+---------+----------+--------+Look at the indexes being way bigger than the data.mysql> SELECT database_name, table_name, index_name, ROUND(stat_value * @@innodb_page_size / 1024 / 1024, 2) AS size_in_mb FROM mysql.innodb_index_stats WHERE stat_name = 'size' AND index_name != 'PRIMARY' and database_name='myschema' and table_name='mytable' ORDER BY size_in_mb DESC;
+---------------+------------+-------------------+------------+
| database_name | table_name | index_name | size_in_mb |
+---------------+------------+-------------------+------------+
| myschema | mytable | idx_long_column | 1583538.95 |
| myschema | mytable | idx_short_column | 126432.98 |
| myschema | mytable | idx_short_col2 | 122699.95 |
+---------------+------------+-------------------+------------+
3 rows in set (0.01 sec)While the pt-online-schema-change runs, it copies the data to a new table. As the data is being copied, the secondary indexes must be maintained.
NOTE the huge index for a varchar(1000) that is ~1.5T in size. Maintaining such an index becomes increasingly expensive as the data size increases.
The pt-online-schema-change had been running for ~8 days, and its latest estimate was 53 more days, which we can’t afford, since the maximum value would be exceeded in ~15 days. Copying `myschema`.`mytable`: 12% 53+16:48:01 remain
Copying `myschema`.`mytable`: 12% 53+16:48:30 remain
Copying `myschema`.`mytable`: 12% 53+16:48:59 remain
Copying `myschema`.`mytable`: 12% 53+16:49:26 remain
Copying `myschema`.`mytable`: 12% 53+16:49:53 remain
Copying `myschema`.`mytable`: 12% 53+16:50:19 remain
Copying `myschema`.`mytable`: 12% 53+16:50:49 remain
Copying `myschema`.`mytable`: 12% 53+16:51:17 remain
Copying `myschema`.`mytable`: 12% 53+16:51:45 remain
So what do we do now?
We suggested canceling the pt-online-schema-change and creating an Aurora blue-green deployment.
Then perform the direct ALTER on the green cluster. And finally, when ready, do the failover.
Sounds good, doesn’t it?
First, we need to ensure that the new cluster (green) has the replica_type_conversions parameter in its cluster parameter group to “ALL_NON_LOSSY, ALL_UNSIGNED” in order to be able to replicate from an int unsigned column to a bigint unsigned column.
So we tried that, it started too fast ~0.036% per minute, that’s 2 days. That’s great!
We left the process running over the weekend, but we noticed it started to slow down again… By Monday, it was advancing at ~0.01% every 5 mins, which gives an ETA of 34 days.
Why?
Again, using the direct ALTER MySQL copies the data to a temp table, and the bigger the data, the harder it is to maintain the indexes.
Again, unacceptable.
Note that with the above 2 approaches, we lost ~12 days of precious time, and the deadline for auto_increment exhaustion was approaching.
Then we thought: What if we drop the secondary indexes, do the alter, and then add the indexes back?
In theory, it should be faster, as:
Dropping the indexes is a metadata-only operation with ONLINE DDL.
Altering the column datatype from INT to BIGINT is not an ONLINE operation, but the fact that it doesn’t have to update secondary indexes during row copying to a new temporary table prevents the slowdown.
Adding back the secondary indexes is an ONLINE DDL operation:
“Online DDL support for adding secondary indexes means that you can generally speed the overall process of creating and loading a table and associated indexes by creating the table without secondary indexes, then adding secondary indexes after the data is loaded.”
https://dev.mysql.com/doc/refman/8.4/en/innodb-online-ddl-operations.html
So let’s do this:
The deletion of the indexes was really quick, as expected (metadata-only operation):mysql> ALTER TABLE myschema.mytable DROP INDEX idx_long_column, DROP INDEX idx_short_column, DROP INDEX idx_short_col2;
Query OK, 0 rows affected (49.40 sec)
Records: 0 Duplicates: 0 Warnings: 0
Then the change of the datatype:mysql> ALTER TABLE myschema.mytable CHANGE COLUMN id id bigint unsigned NOT NULL AUTO_INCREMENT;
Query OK, 4058047205 rows affected (13 hours 9 min 10.62 sec)
Records: 4058047205 Duplicates: 0 Warnings: 0
Looks very promising!!!
The final step, add back the indexes:mysql> ALTER TABLE myschema.mytable ADD INDEX `idx_long_column` (`long_column`,`state`), ADD INDEX `idx_short_column` (`short_column`,`state`), ADD INDEX `short_col2` (`short_column`);
ERROR 1878 (HY000): Temporary file write failure.
Why?
Well, the INPLACE operation uses the tmp dir to write sort files. In Aurora, there are certain limits for the temporary space based on the instance type.
In a regular MySQL instance, we can modify the innodb_tmpdir to another location with enough disk space; however, in Aurora, the parameter is not modifiable, which could have made the whole process easier.
Even with a larger instance type, it’s hard to create the 1.5T index without breaking open the piggy bank.
Last resort, add the indexes back with the COPY algorithm:mysql> ALTER TABLE myschema.mytable ALGORITHM=COPY, ADD INDEX `idx_long_column` (`long_column`,`state`), ADD INDEX `idx_short_column` (`short_column`,`state`), ADD INDEX `idx_short_col2` (`short_column`);
Query OK, 4147498819 rows affected (6 days 1 hour 55 min 57.00 sec)
Records: 4147498819 Duplicates: 0 Warnings: 0
Why does it work? Because ALTER TABLE using the COPY algorithm uses the datadir as the destination for the temporary table, the rows are copied there. It doesn’t have the limitation of the temporary directory mentioned above.
We were able to make it on time about 4 days before the auto_increment exhaustion, preventing downtime.
In retrospective we could have used the following approach to avoid the use of the blue/green deployment:
Perform a pt-online-schema-change on the main table, dropping the indexes, and changing the column type to bigint. ( with –no-swap-tables –no-drop-old-table –no-drop-new-table –no-drop-triggers).
Add the secondary indexes using the direct alter with the COPY algorithm in the _new table.
Once the alter finishes, swap the tables and drop the triggers.
Conclusion:
What initially looked like an easy task with pt-online-schema-change, ended up being more complex.
You need to check the data definition, the index sizes, the Aurora limits, and how the different algorithms work to make a decision on the best way to proceed with those tasks, specially on situations like these where you have the pressure of the auto_increment being exhausted and there’s risk of downtime if it is not done on time.
And of course, monitor auto_increment exhaustion for your tables, and use a reasonable threshold that gives you enough time to plan and change the table definition. You can use Percona Monitoring and Management for this, specifically on the MySQL > MySQL Table Details dashboard.
The post Run an ALTER TABLE for a huge table in Aurora appeared first on Percona.
-
MySQL Tuning on OCI HeatWave: What Still Matters, What OCI Manages, and What You Should Actually Tune
Once you move from self-managed MySQL to a MySQL DB System with HeatWave on OCI, the tuning story changes in an important way. On a self-managed server, you worry about two layers: MySQL and the operating system. On OCI MySQL DB Systems with HeatWave, Oracle runs your MySQL instance as a fully-managed service and explicitly […]
-
No More Silent Foreign Key Cascades: MySQL 9.7 Lets Child Triggers Speak Up
MySQL 9.7 introduces a long-requested improvement: Child table triggers are executed during SQL-layer foreign key cascades. Historically, cascades executed inside InnoDB did not invoke child table triggers, which created gaps in auditing, derived data maintenance, and observability. When a parent row change triggered cascading changes in child tables, those child table triggers were not executed. This […]
|