Perl

Where My Nerds At?

Quickie: Data Mining on the Web with PERL

This is a simple web mining script using perl.

#!/usr/bin/perl

use LWP::Simple;

$numPages=$ARGV[0];

open OUTPUT,">/home/user/out.html";
for($i=1;$i<=$numPages;$i++){
	print $i."\n";
        $content=get("http://coderswasteland.com/node/".$i);
        print OUTPUT $content;
        print OUTPUT "******************\n";
}
close OUTPUT;

This code is useful to pull information from any Drupal website. It takes the number of pages to crawl as a command line argument and uses that to increment through the site, grabbing articles. The content is then all saved to a single output file, each site separated by a series of asterisks from which you may do whatever regex or parsing is necessary to achieve your desired result.

I have a much longer version of this which does parsing and builds feeds. You may also wish to forgo writing to an output file then later reading from it and just process the information from within $content, split it into logical pieces, etc. This script is merely to get you started. If you're looking into web mining, I assume you already know about parsing and other topics, and this is just a quick way to grab web content.

A spiffier version would be to follow links on a page and popping info onto a tree rather auto-incrementing a URL. this is left as an exercise for the reader :)

Quickie: Multiple File Search and Replace

If you've ever run into the necessity to change the same data over multiple files you know how hard it is to go through each and every file and update it by hand or at best do an indiviual regex on each. Here's how you can make the changes all in on fell swoop.

perl -pi -w -e 's/searchregex/replaceregex/g;' *.fileextension

-w display any warnings
-i in place edit
-p loop over files
-e execute this line of code

Here's an example:

perl -pi -w -e 's/\x0D//g;' *.txt

This looks for those pesky ^M characters and replaces them with nothing.

Quickie: Strip ^M (Control-M) Characters from Input File with PERL

For anyone that does file I/O and has to sometimes work with Windows-generated files in Linux, I feel your pain. Windows has these little nuances that sometimes makes our GNU/Linux world a fun place to live. Luckily, PERL has a simple little system in place that allows us to remove control characters - Regular Expressions. Those not familiar will find great references at http://www.regular-expressions.info/ (a fantastic place to begin) and http://www.regextester.com/ (where you can test your brilliant work).

People who just want a quick piece of code, look no further:

If you want to do it all in one line from the CLI (of course replace *.txt with whatever extension):

perl -pi -w -e 's/\x0D//g;' *.txt

If you'd rather do it inline in a Perl script:

#Good Code
$yourLine =~ s/\x0D//g; #strips ^M characters

Simply trying to strip ^M characters with

#Bad Code
$yourLine =~ s/\^M//g; #strips ^M characters

unfortunately does not work. The previous hex value works great for me. I've run into this problem many times while taking third party data feeds which are sometimes generated in Windows and trying to preprocess them in my GNU/Linux environment. The ^M sends gets interpreted as a new line and wreaks havoc on feeds where you expect all the information to be on one line in a fixed number of columns.

For more information on control characters, please go to: http://www.cs.tut.fi/~jkorpela/chars.html where Jukka Korpela explains in-depth what control characters are, what issues you may have, and how you may go about resolving them.

Latest Video


only search Coder's Wasteland
Powered by Drupal, an open source content management system

Digg this