So I came across a fun assignment this week that I’m sure has been done by many different people in many different programming languages. The challenge was to “scrape” a website for information autonomously and save it off to a file.
I accomplished this by first using a wrapper class for .NET’s own HTTPWebRequest object that simplified posting to a web site and retrieving the result. I then used regular expressions to find the data I wanted, stored it in a string, and later wrote it to a file.
I’m not going provide the specific program I wrote as it’s still proprietary, but I will give a small example of how this can be done. The example will include: posting to a website, retrieving the results (HTML for the page), and parsing the resulting page to find what you want.
The class I used to post to the site was done by Robert May and can be found here: http://geekswithblogs.net/rakker/archive/2006/04/21/76044.aspx
Here is an example of using this class to perform a search at CraigsList under the ‘for sale’ category and retrieving the results:
// Create the post object PostSubmitter post = new PostSubmitter("http://provo.craigslist.org/search/sss"); // Add our parameters post.PostItems.Add( "query", "Ford Truck" ); // Specify our action type (Post | Get) post.Type = PostSubmitter.PostTypeEnum.Get; // Retrieve the results string result = post.Post();