-
More than Flushing (also Caching) for innodb_flush_method, and Missing Release Candidate
Something changed in MySQL 8.4 related to caching, and it is easy to miss, so it deserves a post. And a subject adjacent to this is the missing Release Candidate for MySQL 8.4 LTS, with my hope that the next LTS will have a Release Candidate, so I also cover this topic below.
(if you are not interested in Caching and Flushing, you can jump directly to the section about Release Candidate)
-
MySQL 8.0 JSON Functions: Practical Examples and Indexing
This post covers a hands-on walkthrough of MySQL 8.0's JSON functions. JSON support has been in MySQL since 5.7, but 8.0 added a meaningful set of improvements — better indexing strategies, new functions, and multi-valued indexes — that make working with JSON data considerably more practical. The following documents several of the most commonly needed patterns, including EXPLAIN output and performance observations worth knowing about.
This isn't a "JSON vs. relational" debate post. If you're storing JSON in MySQL, you probably already have your reasons. The goal here is to make sure you're using the available tooling effectively.
Environment
mysql> SELECT @@version, @@version_comment\G
*************************** 1. row ***************************
@@version: 8.0.36
@@version_comment: MySQL Community Server - GPL
Testing was done on a VM with 8GB RAM and innodb_buffer_pool_size set to 4G. One housekeeping note worth mentioning: query_cache_type is irrelevant in 8.0 since the query cache was removed entirely. If you migrated a 5.7 instance and still have that variable in your my.cnf, remove it — MySQL 8.0 will throw a startup error.
Setting Up a Test Table
The test table simulates a fairly common pattern — an application storing user profile data and event metadata as JSON blobs:
CREATE TABLE user_events (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
user_id INT UNSIGNED NOT NULL,
event_data JSON NOT NULL,
created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (id),
INDEX idx_user (user_id)
) ENGINE=InnoDB;
INSERT INTO user_events (user_id, event_data) VALUES
(1, '{"action":"login","ip":"192.168.1.10","tags":["mobile","vpn"],"score":88}'),
(1, '{"action":"purchase","ip":"192.168.1.10","tags":["desktop"],"score":72,"amount":49.99}'),
(2, '{"action":"login","ip":"10.0.0.5","tags":["mobile"],"score":91}'),
(3, '{"action":"logout","ip":"10.0.0.9","tags":["desktop","vpn"],"score":65}'),
(2, '{"action":"purchase","ip":"10.0.0.5","tags":["mobile"],"score":84,"amount":129.00}');
Basic Extraction: JSON_VALUE vs. JSON_EXTRACT
JSON_VALUE() was introduced in MySQL 8.0.21 and is the cleaner way to extract scalar values with built-in type casting. Before that, you were using JSON_EXTRACT() (or the -> shorthand) and casting manually, which works but adds noise to your queries.
-- Pre-8.0.21 approach
SELECT user_id,
JSON_EXTRACT(event_data, '$.action') AS action,
CAST(JSON_EXTRACT(event_data, '$.score') AS UNSIGNED) AS score
FROM user_events;
-- Cleaner 8.0.21+ approach
SELECT user_id,
JSON_VALUE(event_data, '$.action') AS action,
JSON_VALUE(event_data, '$.score' RETURNING UNSIGNED) AS score
FROM user_events;
Output from the second query:
+---------+----------+-------+
| user_id | action | score |
+---------+----------+-------+
| 1 | login | 88 |
| 1 | purchase | 72 |
| 2 | login | 91 |
| 3 | logout | 65 |
| 2 | purchase | 84 |
+---------+----------+-------+
5 rows in set (0.00 sec)
The RETURNING clause is genuinely useful. It eliminates the awkward double-cast pattern and makes intent clearer when reading query code later.
Multi-Valued Indexes: The Real Game Changer
This is where 8.0 actually moved the needle for JSON workloads. Multi-valued indexes, available since MySQL 8.0.17, let you index array elements inside a JSON column directly. Here's what that looks like in practice:
ALTER TABLE user_events
ADD INDEX idx_tags ((CAST(event_data->'$.tags' AS CHAR(64) ARRAY)));
Here is what EXPLAIN shows before and after on a query filtering by tag value:
-- Without the multi-valued index:
EXPLAIN SELECT * FROM user_events
WHERE JSON_CONTAINS(event_data->'$.tags', '"vpn"')\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: user_events
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 5
filtered: 100.00
Extra: Using where
-- After adding the multi-valued index:
EXPLAIN SELECT * FROM user_events
WHERE JSON_CONTAINS(event_data->'$.tags', '"vpn"')\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: user_events
partitions: NULL
type: range
possible_keys: idx_tags
key: idx_tags
key_len: 67
ref: NULL
rows: 2
filtered: 100.00
Extra: Using where
Full table scan down to a range scan. On 5 rows this is trivial, but on a table with millions of rows and frequent tag-based filtering, that difference is significant. The improvement scales directly with table size and query frequency.
One important gotcha: MEMBER OF() and JSON_OVERLAPS() also benefit from multi-valued indexes, but JSON_SEARCH() does not. This matters when choosing your query pattern at design time:
-- This WILL use the multi-valued index:
SELECT * FROM user_events
WHERE 'vpn' MEMBER OF (event_data->'$.tags');
-- This will NOT use it:
SELECT * FROM user_events
WHERE JSON_SEARCH(event_data->'$.tags', 'one', 'vpn') IS NOT NULL;
Aggregating and Transforming JSON
A few aggregation functions worth knowing well:
-- Build a JSON array of actions per user
SELECT user_id,
JSON_ARRAYAGG(JSON_VALUE(event_data, '$.action')) AS actions
FROM user_events
GROUP BY user_id;
+---------+----------------------+
| user_id | actions |
+---------+----------------------+
| 1 | ["login","purchase"] |
| 2 | ["login","purchase"] |
| 3 | ["logout"] |
+---------+----------------------+
3 rows in set (0.01 sec)
-- Summarize into a JSON object keyed by action
SELECT user_id,
JSON_OBJECTAGG(
JSON_VALUE(event_data, '$.action'),
JSON_VALUE(event_data, '$.score' RETURNING UNSIGNED)
) AS score_by_action
FROM user_events
GROUP BY user_id;
+---------+--------------------------------+
| user_id | score_by_action |
+---------+--------------------------------+
| 1 | {"login": 88, "purchase": 72} |
| 2 | {"login": 91, "purchase": 84} |
| 3 | {"logout": 65} |
+---------+--------------------------------+
3 rows in set (0.00 sec)
JSON_OBJECTAGG() will throw an error if there are duplicate keys within a group. This is worth knowing before you encounter it in a production ETL pipeline. In that case, you'll need to deduplicate upstream or handle it in application logic before the data reaches this aggregation step.
Checking SHOW STATUS After JSON-Heavy Queries
When evaluating query patterns, checking handler metrics is a useful habit:
FLUSH STATUS;
SELECT * FROM user_events
WHERE JSON_VALUE(event_data, '$.score' RETURNING UNSIGNED) > 80;
SHOW STATUS LIKE 'Handler_read%';
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_read_first | 1 |
| Handler_read_key | 0 |
| Handler_read_last | 0 |
| Handler_read_next | 4 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 6 |
+----------------------------+-------+
7 rows in set (0.00 sec)
The Handler_read_rnd_next value confirms a full scan — no surprise since there's no functional index on the score value. For score-based filtering at scale, a generated column with an index is the right answer:
ALTER TABLE user_events
ADD COLUMN score_val TINYINT UNSIGNED
GENERATED ALWAYS AS (JSON_VALUE(event_data, '$.score' RETURNING UNSIGNED)) VIRTUAL,
ADD INDEX idx_score (score_val);
After adding that, the same query drops to a proper index range scan. Generated columns on JSON fields are available in both MySQL 8.0 and Percona Server 8.0, and they remain the most reliable path for scalar JSON field filtering at any meaningful scale.
If you're running Percona Server, pt-query-digest from the Percona Toolkit is still the most practical way to identify which JSON-heavy queries are actually causing pain in production before you start adding indexes speculatively.
Practical Observations
Multi-valued indexes (8.0.17+) are a long overdue improvement and work well when your query patterns align with JSON_CONTAINS() or MEMBER OF()
JSON_VALUE() with RETURNING (8.0.21+) is cleaner than the old cast-after-extract pattern and worth adopting consistently
Generated columns plus indexes remain the most reliable path for scalar JSON field filtering at scale
Watch for JSON_OBJECTAGG() duplicate key errors in grouped data — it surfaces as a hard error in ETL pipelines and can be easy to miss in testing if your sample data happens to be clean
Always verify index usage with EXPLAIN — the optimizer doesn't always pick up multi-valued indexes in complex WHERE clauses, and it's worth confirming rather than assuming
Summary
MySQL 8.0's JSON improvements are genuinely useful, particularly multi-valued indexes and JSON_VALUE() with type casting. They don't replace good schema design, but for cases where JSON storage is appropriate or inherited, you now have real tools to work with rather than just hoping the optimizer figures it out. The generated column pattern in particular is worth evaluating early if you know certain JSON fields will be used in WHERE clauses regularly.
Useful references:
MySQL 8.0 JSON Function Reference
Multi-Valued Indexes Documentation
JSON_VALUE() Function Reference
Percona Toolkit
-
An Open Letter to Oracle: Let’s Talk About MySQL’s Future
What Happened at the Summits We just wrapped up two MySQL Community Summits – one in San Francisco in January, and one in Brussels right before FOSDEM. The energy in the rooms: a lot of people who care deeply about MySQL got together, exchanged ideas, and left with a clear sense that we need to […]
-
New Era of MySQL Community Engagement
As we mark the milestone of MySQL’s 30-year anniversary, the celebrations around the globe have been a testament to the widespread impact and popularity of the Dolphin. This week’s annual preFOSDEM MySQL Belgian Days in Brussels are not only an opportunity to celebrate, but also a chance for Oracle to share some key updates on […]
-
Making MySQL AI-Ready: How MyVector and ProxySQL Work Together
As AI workloads become standard in modern applications, engineering teams face a familiar dilemma: MySQL is already the system of record, but vector search typically requires bolting on a separate database. That means two security models, two observability stacks, and inevitable data consistency headaches.
A concept presented at Pre-FOSDEM 2026 by René Cannaò (ProxySQL founder) and Alkin Tezuysal (Altinity) proposes a cleaner path: keep everything in MySQL, but add vector capabilities through a plugin called MyVector, with ProxySQL serving as the unified control plane.
The Core Idea
MyVector is a MySQL plugin (targeting 8.0/8.4) that adds vector storage, distance functions, and HNSW indexing directly inside MySQL — without requiring MySQL’s native VECTOR type. HNSW (Hierarchical Navigable Small World) enables approximate nearest neighbor search at O(log N) speed, making it practical for millions of vectors.
ProxySQL sits in front and handles the hard operational problems: classifying queries as OLTP vs. vector workloads, routing them to separate hostgroups, enforcing concurrency limits, and triggering circuit breakers when vector bursts threaten to impact production latency.
Why This Matters
The primary use cases are RAG pipelines over documentation and knowledge bases, incident/runbook search, and code semantic search — workloads where your source of truth is already in MySQL and you don’t want to fragment your data stack.
The architecture keeps OLTP writes going to a primary hostgroup while vector similarity queries route to dedicated replicas with the MyVector plugin loaded. If vector traffic spikes and replica lag climbs, ProxySQL can automatically shed vector load before OLTP P99 is affected.
The RAG Ingest Pipeline
The presentation also covers a CLI tool (rag_ingest) that handles the full ingestion pipeline: fetching rows from backend MySQL sources incrementally using watermarks, chunking and transforming text, generating embeddings via any OpenAI-compatible API in configurable batches, and storing results back as vectors. This gives you a complete, observable pipeline without external orchestration infrastructure.
Migration Path
The proposed adoption path is pragmatic: start with ProxySQL for routing and observability alone (no vectors yet), then add vector columns with simple distance UDFs, then graduate to HNSW indexing with dedicated vector replicas, and finally operationalize freshness, backfills, and tenant-aware QoS.
The Honest Tradeoff
This is explicitly a WIP/concept — the authors are looking for design partner feedback to shape the roadmap. The approach is conservative by design: InnoDB stability comes first, vector capabilities are additive, and failure modes are meant to be boring. When retrieval breaks, OLTP stays healthy.
For teams heavily invested in MySQL who want AI retrieval without fragmenting their data operations, this is a compelling direction worth watching.
Repos: github.com/askdba/myvector and github.com/ProxySQL/proxysql-vec.
Download the Making MySQL AI-Ready: How MyVector and ProxySQL Work Together Presentation given by
René Cannaò at preFOSDEM MySQL Belgian Days 2026 in Brussels.
The post Making MySQL AI-Ready: How MyVector and ProxySQL Work Together appeared first on ProxySQL.
|