When RSS and Atom Are Missing
Feeds are perfect to watch interesting web sites for changes. But what if you want to watch a site that does not offer any feeds? Manual checking is inconvenient and unreliable. One of the easiest solutions is my primitive script, webWatch.
This script is meant to be executed periodically (most likely using the cron daemon). When executed for the
first time, it downloads a given page and remembers its content. On each subsequent run, actual page content is compared
to the previous one. Whenever a change is detected, you are noticed by e-mail.
At the moment, changes are detected and reported by a unified diff. It’s not the best tool in the
world to process HTML, but the output is usually readable enough. Replacing diff by another program
would be simple, of course.
The script accepts three arguments:
- the URL to watch
- the e-mail address where to send the reports
- an AWK program to filter the HTML before it is diffed (optional)
The third argument is there to tackle web sites that often change in places you are not interested in.
For instance, some pages display current date so they would set off false positives every day. A quick’n’dirty
one-liner to extract the contents of a <div id="main">…</div> could be:
open&&/<div/{l=$0;open+=gsub(/<div/,"",l)}; open; open&&/<\/div/{l=$0;open-=gsub(/<\/div/,"",l)}; /<div id="main">/{open=1;print $0};
This is how an example crontab could look like:
# Check every ten minutes, send reports to <somebody@example.net> */10 * * * * /home/zephyr/bin/webWatch http://www.example.com/page1 somebody@example.net # Check every hour, send reports to <example@example.com>. Consider only the contents of <div id="main">…</div>. 0 * * * * /home/zephyr/bin/webWatch http://www.example.com/page2 example@example.com 'open&&//{open=1;print $0};'The timing can be obviously set in any other way. Just bear in mind that it is not very polite to poll a web server every single minute or so. I’d say that a 10-minute interval is a sensible minimum.
Feel free to give webWatch a try. I’ll be happy to hear your opinions and suggestions.
Download: webWatch (472B)
Speak your mind
Allowed HTML tags are a, blockquote, em, code, li, ol, p, pre, strong, ul. Links to other comments in the form “[IV]” or “[4]” are detected automatically.