Perl

Search for files using Perl

Attention, open in a new window. PDFPrintE-mail

I had to try to find a file in my eBooks collection, but due to various issues, this eBooks collection is only half organised at this point. I tried to search for the files using the Vista Search dialog in the top right corner (stolen right from Mac OS), but due to the fact they were on a mapped drive, this search functionality would not work (Worse, it was returning "No items match your search" rather than admitting it couldn't work). So I decided that I needed to write a Perl script that would allow me to search for files with a keyword in their filename.

Now, don't get me wrong, I'm not just being lazy for the sake of being lazy. I would gladly search through my eBooks folder for the file if it wasn't overly complex. However, when you consider that the eBooks folder is 17.6GB is size, with over 2607 files, you might start to see why I wanted to be able to search. Additionally, I have broken the books into folders based on the names of the publisher, but there are still 1396 items that are outside of this folder (due to not having any easily identifiable way to associate them with their publisher, other than opening the file and eyeballing the first/second page, a process which takes a long time for so many files). Even within each folder (O'Reilly for example), there are inconsistencies, with some files having the year at the start of the filename, some having just the book name, some having the book name with full stops instead of spaces, some having the publishers name on the start (and even then, some being Oreilly and some being O'Reilly), and other inconsistencies in naming. I have recently written a script that goes through the filenames and removes the publisher and replaces dots with spaces, but that is a story for another post.

So as you can see, it's not just laziness (although laziness is one of the three virtues of great Perl programmers). So I threw together a script that I would be able to type the keyword I was searching for in, and it would output a listing and way to get to each of the files that matched. This resulted in:

#!/usr/bin/perl
 
use File::Find;
 
$dir = ".";
$searchTerm = shift;
find ( sub { print "$File::Find::name\n" if $_ =~ /$searchTerm/i; }, $dir);

Now, I had trouble understanding the find function (part of File::Find myself, so I'll give you a quick rundown. Basically, the find function looks in the directory listed in it's second argument ($dir in this case) for files, and each of the names of the files is then passed to the subroutine named in the first argument. In this case, I have used an anonymous subroutine. The rest is fairly simple, $File::Find::name includes the full path to the file, and the regular expression just checks to see if the filename matches.

This worked perfectly fine, but then I got the idea that I might want to limit the results even more by allowing for multiple keywords. For example, if I just write "Cookbook" as the keyword, I get 64 results (all but two being in the O'Reilly directory). If I'm looking specifically for where Perl Cookbook is, this can be a few too many results to easily skim. But if I can put in the arguments Cookbook and Perl for my script, and only print the result if both match, then I'll be much better off.

The second version of the script came out like this:

#!/usr/bin/perl
 
#
# Date: 13/02/2008
#
# Author: Adrian Pavone
#
# This script takes any number of keywords as arguments, and then
# searches the current directory, and all subdirectories, for those
# keywords. It only returns a match if it finds all keywords in the
# same filename.
#
 
use File::Find;
 
$dir = ".";
@searchTerms = @ARGV;
 
find ( sub {
          for my $term (@searchTerms) {
              return unless $_ =~ /\Q$term/i;
          }
          print "$File::Find::name\n";
       },
       $dir);

This script includes a comment header obviously, which makes it better for if I have to come back to it later (with such a short script, it's not neccessarily required, but with the name of find.pl or search.pl, the filename could mean a number of different things). Now to the modifications to the script, it's obvious now that I am accepting any number of terms as an array, and then looping each of these values. They are then being checked against the filename, and if all match, the filename is printed. The \Q is used to protect against accidental insertion of regular expression special characters, and more can be read about this by going to http://perldoc.perl.org/perlfaq6.html#How-can-I-quote-a-variable-to-use-in-a-regex%3f.

The new script is now much more versatile, and has been inserted into my Perl utilty toolkit as a replacement for the find utility, primarily due to the fact that the find utility is not available on Windows (by default anyway), whereas Perl is installed on every box that I use for any considerable amount of time. It also means that I have something that I can easily call from my own scripts whenever needed.