I recently wanted to select a random row from a database, it seemed like a pretty trivial thing to do – but as often the case, it’s not as easy as I first expected it to be!
I googled for a solution to this, and there are plenty of sites that cover it. However, a lot of them give bad information, so I gave up looking after a while, and thought I’d tackle the problem myself.
My situation is complicated by the fact I want to consider that the data I want to select random data from might at some point be sharded, and I certainly wanted to be able to scale to at least a couple of million rows in a single table. Not only that, but I might also want to limit my random rows to rows that match additional criteria, such as if a row was marked as ‘disabled’.
SELECT * FROM random_test ORDER BY RAND() LIMIT 1;
This solution does indeed do exactly what is needed – it will select one random row from the table. The problem however is that this does not scale. What this method involves is for the SQL server to make a temporary table, giving each row of this table a random number, sorting by it, and then selecting the top one. For tables with only a few thousand rows, this perhaps isn’t so bad – but look how long it took for my server to do on a 1 million row table:
mysql> SELECT * FROM random_test ORDER BY RAND() LIMIT 1; +--------+-----------+ | id | something | +--------+-----------+ | 123239 | 123238 | +--------+-----------+ 1 row in set (1 min 9.57 sec)
This solution results in a much quicker execution time, but has a flaw. Here’s the SQL:
SELECT * FROM random_test, (SELECT FLOOR(MAX(id) * RAND()) AS rId FROM random_test) AS t1 WHERE random_test.id = t1.rId;
This method works by creating a random number between 0, and the maximum id, and then selecting the matching row from the table. Selecting the MAX(id) will not result in a full table scan as some people might expect, and infact with MyISAM, you’ll see that it has no overhead at all:
mysql> explain SELECT MAX(id) FROM random_testG *************************** 1. row *************************** id: 1 select_type: SIMPLE table: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: Select tables optimized away 1 row in set (0.00 sec)
The problem here is that if your Id’s are no consecutive, this query will not always return a result. You can change the query as such:
SELECT * FROM random_test, (SELECT FLOOR(MAX(id) * RAND()) AS rId FROM random_test) AS t1 WHERE random_test.id >= t1.rId LIMIT 1;
This works – however has it’s own downside. If you’re missing, for examples sake, rows 1000-2000 – any random number generated within that range will result in row 2001 being selected. This will cause certain rows within your dataset being favored.
The final idea I came up with, and will probably stick with is to use solution 1 to generate a batch of random row Id’s, and then pop an Id out of this pool every time I need a random row. This allows us to make one important optimization – we can change the query to use a covering index. MySQL (and I imagine any other SQL server) can avoid doing a full table scan if the data we are selecting is covered by an index. The query below only select the Id from the table, which of course – being the primary key – is covered.
SELECT id FROM random_test ORDER BY RAND() LIMIT 100;
So, lets have a look at how it performs:
mysql> SELECT id FROM random_test ORDER BY RAND() LIMIT 100; ... 100 rows in set (1.40 sec) mysql> SELECT id FROM random_test ORDER BY RAND() LIMIT 1000; 1000 rows in set (1.45 sec) mysql> EXPLAIN SELECT id FROM random_test ORDER BY RAND() LIMIT 1000G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: random_test type: index possible_keys: NULL key: PRIMARY key_len: 4 ref: NULL rows: 1000000 Extra: Using index; Using temporary; Using filesort 1 row in set (0.00 sec)
This query is far too slow to use directly in production – but, as you can see above by the first 2 queries, it scales very well. These random Id’s can be stored in a separate table, or cached within your application. Your application can re-generate more when your pool falls below a certain threshold, or it can be cronned, or handled however you feel fit.
This method also allows for you to run this query on each of your shards if you have sharded your data – meaning it’s easily scalable. Various methods can be applied to the pool such that you don’t have a single lock preventing concurrent access – how you do this is outside of the scope of this article.
As always, it’s a compromise – the third solution is more complicated and requires extra code to handle pre-generating the Id’s – where as the first option is easy, but slow. If you’re only dealing with a few hundred, maybe even a few thousand rows, go with the first or second option. The important thing to take on board though is that there is no single solution. It is very important you understand the pros and cons of each solution, so you can best fit them to your own situation – and remember to test against real data.
Just in case it wasn’t obvious to some, I thought I’d point out that the 3rd method above can be used without generating a pool – and can be used as follows:
SELECT * FROM random_test, (SELECT id FROM random_test ORDER BY RAND() LIMIT 1) as t1 WHERE random_test.id = t1.id;
You should be aware though that this query isn’t that fast, and for 1 million rows it takes 1.5 seconds to run for me. If you’re only dealing with a 10’s of thousands of rows, it may well be a good enough compromise for you.
There are other methods not shown here for selecting random rows. One alternative is to pre-generate a random column, order by it when selecting, then update the selected rows with a new random number – this seemed like it could be a bit too write heavy for me though. If you have any other suggestions, please do leave a comment!