Rewrite Perl script with HTML::Strip to use newer HTML::Restrict module

Published on - Listed in Linux Perl Coding

As this is July 2024, we are in the first month after the end of official EPEL 7 (RHEL 7, CentOS 7) support. Some of the EPEL 7 servers have been running for a long time, so it would not be a suprise when certain applications or scripts wouldn't work anymore after a distribution upgrade.

In this particular example, I've come across a problematic Perl script which uses HTML::Strip and used to work fine under RHEL 7. But once the server was upgraded, the script would fail.

The purpose of HTML::Strip

The purpose of the HTML::Strip module is to look for HTML tags (e.g. <a....) and removes the HTML code from a standard input (stdin), handled as argument.

root@rhel7 ~ $ cat /home/ck/

use HTML::Strip;

my $tf = HTML::Strip->new();
my $html_dirty=$ARGV[0];
my $html_clean = $tf->parse($html_dirty);
print $html_clean."\n";

On a normal text input, this would simply show the same text again:

root@rhel7 ~ $ /home/ck/ "Text output"
Text output

But if the text is detected to be inside HTML tags, the text (within the tags) is removed:

root@rhel7 ~ $ /home/ck/ "<Text output>"
root@rhel7 ~ $

See? No output.

Can't locate HTML/ in @INC

But after the OS upgrade from RHEL7 to RHEL8, this script wouldn't work anymore and fail with the following error:

root@rhel8 ~ $ /home/ck/ "Text output"
Can't locate HTML/ in @INC (you may need to install the HTML::Strip module) (@INC contains: /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5) at ./ line 15.
BEGIN failed--compilation aborted at ./ line 15.

Annoying, but it can happen. Seems like not all installed packages, the Perl packages specifically, were upgraded. That's what I thought.

But as it turns out, the HTML::Strip RPM package is called perl-HTML-Strip and - surprise! - only exists for EPEL 7:

A dnf search did not return the wanted perl-HTML-Strip package either, however hinted to another package (perl-HTML-Restrict):

root@rhel8 ~ $ dnf search perl | grep strip
Red Hat CodeReady Linux Builder for RHEL 8 x86_  24 MB/s | 9.7 MB     00:00
perl-HTML-Restrict.noarch : Perl module to strip unwanted HTML tags and attributes

Note: I also found packages named "perl-HTML-StripScripts" and "perl-HTML-StripScripts-Parser" using dnf, however they did not provide the needed HTML/ file.

The reason for the "missing" perl-HTML-Strip package in newer RHEL (EPEL) versions seems to be a bug in HTML::Strip with UTF8 encoded text. Or the package maintainer just wasn't up to it anymore. Who knows.

Shifting to HTML::Restrict module

The dnf search above already pointed to another Perl module (perl-HTML-Restrict) available as package install. A quick look at the HTML::Restrict documentation showed that it is very similar to the old HTML::Strip module.

Rewriting the Perl script to use HTML::Restrict instead of HTML::Strip would eventually turn out easier than trying to get the old HTML::Strip module somehow into the RHEL8 system!

Installing the perl-HTML-Restrict package installed a bunch of other Perl modules from the codeready and epel repositories as well:

root@rhel8 ~ $ dnf install perl-HTML-Restrict.noarch
Install  23 Packages

Total download size: 1.0 M
Installed size: 1.8 M
Is this ok [y/N]: y

The Perl script was then rewritten to use the newer HTML::Restrict module:

root@rhel8 ~ $ cat /home/ck/

use HTML::Restrict;

my $hr = HTML::Restrict->new();
my $html_dirty=$ARGV[0];
my $html_clean = $hr->process($html_dirty);
print $html_clean."\n";

Told you it looks very similar! ;-)

And execution of the script works again!

root@rhel8 ~ $ /home/ck/ "Text output"
Text output

root@rhel8 ~ $ /home/ck/ "<Text output>"
root@rhel8 ~ $

