Selecting a random database row

I recently wanted to select a random row from a database, it seemed like a pretty trivial thing to do – but as often the case, it’s not as easy as I first expected it to be!

I googled for a solution to this, and there are plenty of sites that cover it. However, a lot of them give bad information, so I gave up looking after a while, and thought I’d tackle the problem myself.

My situation is complicated by the fact I want to consider that the data I want to select random data from might at some point be sharded, and I certainly wanted to be able to scale to at least a couple of million rows in a single table. Not only that, but I might also want to limit my random rows to rows that match additional criteria, such as if a row was marked as ‘disabled’.

Solution 1

SELECT * FROM random_test ORDER BY RAND() LIMIT 1;

This solution does indeed do exactly what is needed – it will select one random row from the table. The problem however is that this does not scale. What this method involves is for the SQL server to make a temporary table, giving each row of this table a random number, sorting by it, and then selecting the top one. For tables with only a few thousand rows, this perhaps isn’t so bad – but look how long it took for my server to do on a 1 million row table:

mysql> SELECT * FROM random_test ORDER BY RAND() LIMIT 1;
+--------+-----------+
| id     | something |
+--------+-----------+
| 123239 | 123238    |
+--------+-----------+
1 row in set (1 min 9.57 sec)

Solution 2

This solution results in a much quicker execution time, but has a flaw. Here’s the SQL:

SELECT * FROM random_test, (SELECT FLOOR(MAX(id) * RAND()) AS rId FROM random_test) AS t1 WHERE random_test.id = t1.rId;

This method works by creating a random number between 0, and the maximum id, and then selecting the matching row from the table. Selecting the MAX(id) will not result in a full table scan as some people might expect, and infact with MyISAM, you’ll see that it has no overhead at all:

mysql> explain SELECT MAX(id) FROM random_testG
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away
1 row in set (0.00 sec)

The problem here is that if your Id’s are no consecutive, this query will not always return a result. You can change the query as such:

SELECT * FROM random_test, (SELECT FLOOR(MAX(id) * RAND()) AS rId FROM random_test) AS t1 WHERE random_test.id >= t1.rId LIMIT 1;

This works – however has it’s own downside. If you’re missing, for examples sake, rows 1000-2000 – any random number generated within that range will result in row 2001 being selected. This will cause certain rows within your dataset being favored.

Solution 3

The final idea I came up with, and will probably stick with is to use solution 1 to generate a batch of random row Id’s, and then pop an Id out of this pool every time I need a random row. This allows us to make one important optimization – we can change the query to use a covering index. MySQL (and I imagine any other SQL server) can avoid doing a full table scan if the data we are selecting is covered by an index. The query below only select the Id from the table, which of course – being the primary key – is covered.

SELECT id FROM random_test ORDER BY RAND() LIMIT 100;

So, lets have a look at how it performs:

mysql> SELECT id FROM random_test ORDER BY RAND() LIMIT 100;
...
100 rows in set (1.40 sec)

mysql> SELECT id FROM random_test ORDER BY RAND() LIMIT 1000;
1000 rows in set (1.45 sec)

mysql> EXPLAIN SELECT id FROM random_test ORDER BY RAND() LIMIT 1000G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: random_test
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 1000000
        Extra: Using index; Using temporary; Using filesort
1 row in set (0.00 sec)

This query is far too slow to use directly in production – but, as you can see above by the first 2 queries, it scales very well. These random Id’s can be stored in a separate table, or cached within your application. Your application can re-generate more when your pool falls below a certain threshold, or it can be cronned, or handled however you feel fit.

This method also allows for you to run this query on each of your shards if you have sharded your data – meaning it’s easily scalable. Various methods can be applied to the pool such that you don’t have a single lock preventing concurrent access – how you do this is outside of the scope of this article.

Conclusion

As always, it’s a compromise – the third solution is more complicated and requires extra code to handle pre-generating the Id’s – where as the first option is easy, but slow. If you’re only dealing with a few hundred, maybe even a few thousand rows, go with the first or second option. The important thing to take on board though is that there is no single solution. It is very important you understand the pros and cons of each solution, so you can best fit them to your own situation – and remember to test against real data.

Update

Just in case it wasn’t obvious to some, I thought I’d point out that the 3rd method above can be used without generating a pool – and can be used as follows:

SELECT * FROM random_test, (SELECT id FROM random_test ORDER BY RAND() LIMIT 1) as t1 WHERE random_test.id = t1.id;

You should be aware though that this query isn’t that fast, and for 1 million rows it takes 1.5 seconds to run for me. If you’re only dealing with a 10’s of thousands of rows, it may well be a good enough compromise for you.

There are other methods not shown here for selecting random rows. One alternative is to pre-generate a random column, order by it when selecting, then update the selected rows with a new random number – this seemed like it could be a bit too write heavy for me though. If you have any other suggestions, please do leave a comment!

8 thoughts on “Selecting a random database row”

  1. Also worth remembering that the query cache in mysql can’t cache anything that uses RAND() (for obvious reaosns).

    But with your third method, the second round queries could be cached, as my interpretation is that they wouldn’t need RAND() and would just retrieve on primary key.

  2. Why is
    SELECT * FROM random_test ORDER BY RAND() LIMIT 1;
    so much slower than
    SELECT id FROM random_test ORDER BY RAND() LIMIT 100;

  3. This is just a suggestion, but would it not be best to select COUNT(*) on that table, (or indeed another seperate table that is updated by 1 for every new insert), then perform php rand() from 0 to max index on the table count. Then try to select that index with the php generated index.

    eg:

    $total = “SELECT total FROM table_insert_counter”;

    ////// PHP RANDOM GENERATOR //////
    $seed = (float) microtime( ) * 100000000;

    srand($seed);

    $random_index = rand(1, $total);
    //////////////////////////////////////////////

    /////// MYSQL TABLE WITH AUTO-INCREMENT ///////
    $query = “SELECT row FROM main_table WHERE id = $random_index”;
    ///////////////////////////////////////////////////////////////

    1. Your suggestion relies on sequential IDs and no holes in the table. It also
      means all inserts into that table would now need wrapping in a transaction to ensure consistency.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.