C#: Retrieve data from webpage

September 29th, 2008 by Mel Leave a reply »

So I came across a fun assignment this week that I’m sure has been done by many different people in many different programming languages. The challenge was to “scrape” a website for information autonomously and save it off to a file.

I accomplished this by first using a wrapper class for .NET’s own HTTPWebRequest object that simplified posting to a web site and retrieving the result. I then used regular expressions to find the data I wanted, stored it in a string, and later wrote it to a file.

I’m not going provide the specific program I wrote as it’s still proprietary, but I will give a small example of how this can be done. The example will include: posting to a website, retrieving the results (HTML for the page), and parsing the resulting page to find what you want.

The class I used to post to the site was done by Robert May and can be found here: http://geekswithblogs.net/rakker/archive/2006/04/21/76044.aspx

Here is an example of using this class to perform a search at CraigsList under the ‘for sale’ category and retrieving the results:

// Create the post object
PostSubmitter post =
    new PostSubmitter("http://provo.craigslist.org/search/sss");
 
// Add our parameters
post.PostItems.Add(
    "query",
    "Ford Truck"
);
 
// Specify our action type (Post | Get)
post.Type = PostSubmitter.PostTypeEnum.Get;
 
// Retrieve the results
string result = post.Post();

Here is an example of parsing out the items returned by the search and printing them to the command line:

// This expression looks for a sub-string in the form of
// "<p>...<a...>...ItemText...</a> and returns the item text.
string expression = "<p>[^<]*<a[^>]*>(?<item>[^<]*)</a>";
 
// This executes the regular expression and returns all 
// matches found.
MatchCollection matches =
    Regex.Matches(
        result,
        expression,
        RegexOptions.Singleline |
            RegexOptions.Multiline |
            RegexOptions.IgnoreCase
);
 
// This illustrates how to retrieve the ItemText from the
// collection of matches.
foreach (Match regexMatch in matches)
{
    Console.WriteLine(regexMatch.Groups["item"].Value);
}

And here’s the entire program. To compile you need to include PostSubmitter in your project and add a reference to System.Web.

using System;
using System.Text.RegularExpressions;
 
using Snowball.Common;
 
namespace TestPost
{
    class Program
    {
        // The url to post to. Must include the 'http://' part
        static string target_url = "http://provo.craigslist.org/search/sss";
 
 
        static void Main(string[] args)
        {
            // Create the post object
            PostSubmitter post = new PostSubmitter(target_url);
 
            // Add our parameters
            post.PostItems.Add(
                "query",     // The name of the input field for the search.
                // This can be found by viewing the target pages
                // source in a browser.
 
                "Ford Truck" // The value to submit for this field
            );
 
            // Specify we're doing a GET. Could also perform a POST this way.
            // It just depends on what the target site uses. Again, you can
            // find out by viewing the page source.
            post.Type = PostSubmitter.PostTypeEnum.Get;
 
            // Retrieve the results
            string result = post.Post();
 
 
            /*
               Parse the result and pull out the items listed for sale.
            */
 
            // This expression looks for a sub-string in the form of
            // "<p>...<a...>...ItemText...</a> and returns the item text.
            string expression = @"<p>[^<]*<a[^>]*>(?<item>[^<]*)</a>";
 
            // This executes the regular expression and returns all 
            // matches found.
            MatchCollection matches =
                Regex.Matches(
                    result,
                    expression,
                    RegexOptions.Singleline |
                        RegexOptions.Multiline |
                        RegexOptions.IgnoreCase
            );
 
            // This illustrates how to retrieve the ItemText from the
            // collection of matches.
            foreach (Match regexMatch in matches)
            {
                Console.WriteLine(regexMatch.Groups["item"].Value);
            }
 
            Console.ReadKey();
        }
    }
}
Advertisement

6 comments

  1. Jennifer says:

    hey – i need a little assistance – i am working on somthing similar to it – i have dvloped a web browser – now  want to add an extra fnctionlty tht if i am on some page in that spcific browser and git save/imprt or any button it can get the required data and export it into an excel file.

    can you pls help – i ll xtrmly appreciate – its so important for me – plsssss – waiting for reply. 

  2. juan matos says:

    Thanks Mel.  I’ve been out of work for 3 years and finally have a job interview coming up Thursday.  The job’s main responsibilty is to write programs that extract data off the internet.  I applied for the job and was asked to come in for an interview.  But my only experience with data extraction from the web for a program to use has been via copy and paste.  So I started searching the web for help so that I can famiiarize myself with this type of coding asap so that I don’t look like a total non-contender.  Your code fits the bill perfect as far as what it accomplishes, and I’ve been to many sites.  Super thanks for putting the tutorial together.  If I get the job I’ll be sure to show you my appreciation through paypal or something.  Thanks.

  3. juan matos says:

    Hi there.  Thanks for the tutorial.  But it doesn’t work for me.  I have copied and pasted the program complete with the PostSubmitter class and nothing gets printed to the screen.  I printed the post.count and I get 0 matches.  It appears that the regular ezpression isn’t mathing anything.  Is Craigslist format different now?  Is the problem the regular expression.  Does this code as-is still work for you now?  I’m just learning Regular Expressions so this is tuff for me.  Would you please help me to get this program to work? I’ll greattly appreciate it.  I can link your website to mine if you want.  But My site comes and goes but should be nice and stable soon.  What do you say?

    • Mel says:

      Hey Juan,

      Thanks for bringing this to my attention! As you noticed I wrote this post over 3 years ago and I’m sure that craigslist has altered their website format since then. I’ll take a look at it as soon as I get a chance and update the code as necessary. Thanks!

  4. Erin says:

    Here is a similar example in python.  This one actually uses ie to do most of the heavy lifting since it was a javascript intensive site.

    http://phlik.ishpeck.net/index.....76600phlik

    it’s on my old website that is just horrid to navigate

Leave a Reply