Mathematics, philosophy, programming, in-line skating and everything in between. More about me…

My Blog

My Latest Tweets

Follow me on Twitter…
English | Czech
Choose your language. I write in English, but I translate most of my articles to Czech as well. Zvolte si jazyk. Píšu anglicky, ale většinu svých článků překládám i do češtiny.

When RSS and Atom Are Missing

Feeds are perfect to watch interesting web sites for changes. But what if you want to watch a site that does not offer any feeds? Manual checking is inconvenient and unreliable. One of the easiest solutions is my primitive script, webWatch.

This script is meant to be executed periodically (most likely using the cron daemon). When executed for the first time, it downloads a given page and remembers its content. On each subsequent run, actual page content is compared to the previous one. Whenever a change is detected, you are noticed by e-mail.

At the moment, changes are detected and reported by a unified diff. It’s not the best tool in the world to process HTML, but the output is usually readable enough. Replacing diff by another program would be simple, of course.

The script accepts three arguments:

  1. the URL to watch
  2. the e-mail address where to send the reports
  3. an AWK program to filter the HTML before it is diffed (optional)

The third argument is there to tackle web sites that often change in places you are not interested in. For instance, some pages display current date so they would set off false positives every day. A quick’n’dirty one-liner to extract the contents of a <div id="main">…</div> could be:

open&&/<div/{l=$0;open+=gsub(/<div/,"",l)}; open; open&&/<\/div/{l=$0;open-=gsub(/<\/div/,"",l)}; /<div id="main">/{open=1;print $0};

This is how an example crontab could look like:

# Check every ten minutes, send reports to <somebody@example.net>
*/10 * * * *	/home/zephyr/bin/webWatch http://www.example.com/page1 somebody@example.net

# Check every hour, send reports to <example@example.com>. Consider only the contents of <div id="main">…</div>.
0 * * * *		/home/zephyr/bin/webWatch http://www.example.com/page2 example@example.com 'open&&/
/{open=1;print $0};'

The timing can be obviously set in any other way. Just bear in mind that it is not very polite to poll a web server every single minute or so. I’d say that a 10-minute interval is a sensible minimum.

Feel free to give webWatch a try. I’ll be happy to hear your opinions and suggestions.

Download: webWatch (472B)

September 8, MMVIII — Linux and bash.

Speak your mind

Allowed HTML tags are a, blockquote, em, code, li, ol, p, pre, strong, ul. Links to other comments in the form “[IV]” or “[4]” are detected automatically.