So I came across a fun assignment this week that I’m sure has been done by many different people in many different programming languages. The challenge was to “scrape” a website for information autonomously and save it off to a file.
I accomplished this by first using a wrapper class for .NET’s own HTTPWebRequest object that simplified posting to a web site and retrieving the result. I then used regular expressions to find the data I wanted, stored it in a string, and later wrote it to a file.
I’m not going provide the specific program I wrote as it’s still proprietary, but I will give a small example of how this can be done. The example will include: posting to a website, retrieving the results (HTML for the page), and parsing the resulting page to find what you want.
The class I used to post to the site was done by Robert May and can be found here: http://geekswithblogs.net/rakker/archive/2006/04/21/76044.aspx
Here is an example of using this class to perform a search at CraigsList under the ‘for sale’ category and retrieving the results:
// Create the post object PostSubmitter post = new PostSubmitter("http://provo.craigslist.org/search/sss"); // Add our parameters post.PostItems.Add( "query", "Ford Truck" ); // Specify our action type (Post | Get) post.Type = PostSubmitter.PostTypeEnum.Get; // Retrieve the results string result = post.Post();
Here is an example of parsing out the items returned by the search and printing them to the command line:
// This expression looks for a sub-string in the form of // <anything>ItemText</something> and returns the item text. string expression = "[^<]*]*>(?[^<]*)"; // This executes the regular expression and returns all // matches found. MatchCollection matches = Regex.Matches( result, expression, RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.IgnoreCase ); // This illustrates how to retrieve the ItemText from the // collection of matches. foreach (Match regexMatch in matches) { Console.WriteLine(regexMatch.Groups["item"].Value); }
And here’s the entire program. To compile you need to include PostSubmitter in your project and add a reference to System.Web.
using System; using System.Text.RegularExpressions; using Snowball.Common; namespace TestPost { class Program { // The url to post to. Must include the 'http://' part static string target_url = "http://provo.craigslist.org/search/sss"; static void Main(string[] args) { // Create the post object PostSubmitter post = new PostSubmitter(target_url); // Add our parameters post.PostItems.Add( "query", // The name of the input field for the search. // This can be found by viewing the target pages // source in a browser. "Ford Truck" // The value to submit for this field ); // Specify we're doing a GET. Could also perform a POST this way. // It just depends on what the target site uses. Again, you can // find out by viewing the page source. post.Type = PostSubmitter.PostTypeEnum.Get; // Retrieve the results string result = post.Post(); /* Parse the result and pull out the items listed for sale. */ // This expression looks for a sub-string in the form of // <anything>ItemText</something> and returns the item text. string expression = "[^<]*]*>(?[^<]*)"; // This executes the regular expression and returns all // matches found. MatchCollection matches = Regex.Matches( result, expression, RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.IgnoreCase ); // This illustrates how to retrieve the ItemText from the // collection of matches. foreach (Match regexMatch in matches) { Console.WriteLine(regexMatch.Groups["item"].Value); } Console.ReadKey(); } } }
About the author