Improving the blog tags (categories) by switching SQL query function from LIKE to REGEXP

Written by - 0 comments

Published on - Listed in MariaDB MySQL Databases Personal PHP


Visitors of this blog surely have noticed the so-called "Blog Tags" on the right side of this website. If you're reading on a mobile device, these tags are shown below a blog post. 

These tags are there for categorizing a blog post. It helps visitors to quickly find other blog posts related to a specific topic. 

In the background, my self-coded CMS saves the tags in a column "tags" in the MariaDB database:

MariaDB > SELECT title,tags FROM blog WHERE id = 1488;
+-----------------------------------------------+---------------+
| title                                         | tags          |
+-----------------------------------------------+---------------+
| Keepalived VIP not responding on RHEL 9 / EL9 | Linux Network |
+-----------------------------------------------+---------------+
1 row in set (0.000 sec)

As you can see, multiple tags are space separated entries in the same (varchar) column.

Querying tags using LIKE string function

Until recently I used the LIKE string function to search for blog posts matching a specific tag:

mysqli_query($dbh, "SELECT * FROM blog WHERE tags LIKE '%$tag%' ORDER BY timeanddate DESC");

This query worked fine - in most situations.

The used LIKE function in the SQL query can be understood like a grep command (without addtional parameters). For easily identifiable tags, such as Linux or Monitoring this works fine.

A problem however are tags which "contain" the same word. For example: The string "DB" shows up in "MariaDB" and in "MongoDB". The current SQL query using LIKE finds all these blog articles, similar to what a grep would do on the command line:

MariaDB > SELECT title,tags FROM blog WHERE tags LIKE '%DB%' ORDER BY timeanddate DESC LIMIT 0,5;
+----------------------------------------------------------------------------------------+-------------------------------------------+
| title                                                                                  | tags                                      |
+----------------------------------------------------------------------------------------+-------------------------------------------+
| How to change Icingaweb2 password for icingaadmin user in backend database             | Icinga Monitoring Databases MySQL MariaDB |
| Travis CI builds fail with expired key error on MongoDB APT repository                 | Cloud Coding Linux MongoDB                |
| How to reset MariaDB root login to use unix socket authentication                      | MariaDB Linux Databases                   |
| How to list and backup only specific MySQL or MariaDB databases using Ansible Playbook | Ansible Linux MySQL MariaDB               |
| Analyzing and fixing Internal Server Error in PowerDNS API and Opera DNS UI            | DNS PowerDNS MySQL MariaDB Linux          |
+----------------------------------------------------------------------------------------+-------------------------------------------+
5 rows in set (0.007 sec)

I could argue and say that it's not relevant for the "DB" string. But for the "Proxy" string it is. I wanted to clearly distinguish topics between "Proxy" and "HAProxy" - yet both posts showed up under the "Proxy" tag:

Blog posts using tag filtering with LIKE query function in the background

It works - but it's not 100% accurate. 

Switching to REGEXP string function

While I was debugging a Wordpress issue for a customer, I came across some SQL queries which used the REGEXP string function. I personally wasn't aware that MySQL/MariaDB could do this so I had to look it up. Using grep and sed and other commands on the command line I'm familiar with regular expressions. That could actually be a great improvement in my "blog tag" issue!

My first attempt failed, as I wanted to use a query with whitespaces (using \s) as separator between the different words in the string. But after more research I found out that REGEXP supports special characters to indicate "word start" and "word end" - or better said "word boundaries":

[[:<:]], [[:>:]]
These markers stand for word boundaries. They match the beginning and end of words, respectively. A word is a sequence of word characters that is not preceded by or followed by word characters. A word character is an alphanumeric character in the alnum class or an underscore (_). 

Perfect, that's even better than trying to use whitespaces in the regular expression.

I adjusted the SQL query to use REGEXP (with the word boundaries) instead of LIKE:

MariaDB > SELECT title,tags FROM blog WHERE tags REGEXP '[[:<:]]Proxy[[:>:]]' ORDER BY timeanddate DESC LIMIT 0,5;
+---------------------------------------------------------------+-----------------------------------------+
| title                                                         | tags                                    |
+---------------------------------------------------------------+-----------------------------------------+
| How to verify and monitor Tinyproxy as a HTTP forward proxy   | Monitoring Proxy Linux Internet Network |
| Follow-up on using NTLM on Windows 7 for Proxy Authentication | Windows Proxy Network                   |
| Using NTLM on Windows 7                                       | Windows Proxy                           |
| Hot times, summer in the city                                 | Personal Bluecoat Proxy                 |
+---------------------------------------------------------------+-----------------------------------------+
4 rows in set (0.008 sec)

Only the Proxy tags now show up in the results (no more HAProxy posts)! 

After adjusting the query in the PHP code to use REGEXP instead of LIKE:

mysqli_query($dbh, "SELECT * FROM blog WHERE tags REGEXP '[[:<:]]${tag}[[:>:]]' ORDER BY timeanddate DESC");

And now, finally, my blog shows only the posts with the distinguished blog tag:

I had been thinking of improving this for quite a while. Luckily I came across these REGEXP queries during one of my troubleshooting sessions. :-)


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PHP   Perl   Personal   PostgreSQL   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder    Linux