Regular expressions in C#
October 20, 2009 by Conor · Leave a Comment
Before I started my Web Crawler project I used always wonder how you’d go about pulling data from raw html. After reading up on the inner workings of a web crawler it became clear that I should be using Regular Expressions and quit fooling around with String functions such as indexOf and subString.
So basically, a regular expression is an ‘expression’ that is used to find patterns in text that are of interest to the user. I’m not going to go into specifics here as I am still learning the basics and don’t want to misinform anybody. If you’re interested you should visit the links below or if you’re already familiar with them and would like to see how to use them in C# read on.Here is a link to the MSDN documentation on the RegularExpression class.
The example below is a simplified version of the example on the MSDN site. It finds anchor tags in html and print the matches to the screen.
using System.Text.RegularExpressions;
String html = "<a href=\"http://www.conorhackett.com\">Conor Hackett</a>";
Regex expression = new Regex("<a\\s*[^<]+\\s*</a>");
// Run expression on html and store all matches in matches.
// MatchCollection stores all found matches in a collection of objects of type Match.
MatchCollection matches = expression.Matches(html);
// Print out all the found matches.
foreach (Match match in matches)
{
Console.WriteLine(match.Value + "\n");
}
The above snippet should output
<a href="http://www.conorhackett.com">
Links:
http://www.ultrapico.com/Expresso.htm
http://www.codeproject.com/KB/dotnet/regextutorial.aspx
http://regexlib.com/
Basic usage of WebClient class in C#.NET
October 18, 2009 by Conor · Leave a Comment
Ok, so i’ve been learning both VB & C# (.NET 3.5) for about a month now. I’me progressing slowly with VB but C# is going a lot smoother and I feel a lot more comfortable with it. This is probably down to the fact that i’ve come from Java and the syntax is pretty much identical.
My first big project is a web crawler that will be able to find me rapidshare.com links. I’m probably being a bit too ambitious and I haven’t blogged about it until now just in case it was a complete failure. It hasn’t and I think I have made some decent progress that is worthy of at least 1 post anyway! Enough rambling, lets get into some code!
I’m using the WebClient class, documentation available here.
To access this class we need to declare it for use like so:
using System.Web;
Then after that there’s really only a few lines. All you need to do is create a new WebClient object then call the downloadString method with the URL as a parameter. Take note that it must begin with http:// or similar or else you’ll get an exception. In fact you should really have this in a try,catch block. The downloadString method returns a String that you can process as you like. It’s really that simple.
Here goes:
WebClient client = new WebClient();
String result = client.DownloadString("http://www.conorhackett.com");
I’m currently working on a Crawler class of my own that will provide methods to download a page as html and grab links from it using regular expressions. At the moment i’m finalising a little app that will demonstrate these features.
WP Plugins!
October 4, 2009 by Conor · Leave a Comment
As far as looks and function are concerned I can finally say that my blog is coming along nicely. By way of content there is still much room for improvement. I just thought i’d tell you about the various plugins that I have installed.
My favourite is WP-Cumuls which takes all your tags and/or categories and displays them in a nice revolving ‘cloud’ it’s made with flash and looks pretty nice! Maintenance Is a simple plugin that takes your blog offline if you need to do some urgent maintenance on it. I only used it once when I first installed wordpress and had no posts or theme.
This blog is going to contain a lot of code snippets etc so I knew I would need something to display code as if you were reading it from an IDE. That is where SyntaxHighlighter Plus comes in. This plugin is pretty cool because it’s not just for wordpress. You could install it on any site that you need proper code highlighting on. Follow the link to get a list of supported languages.
I am also using Akismet to monitor all comments posted and check them for spam. It is actually quite good and even though my blog is pretty new it has already caught 2 rogue comments. Although it is installed by default when you install WP, it is worth noting that before you can use it you need to get an API key from here before you can activate it and start protecting yourself from spammed comments.
I should really start blogging about programming now.. I promise my next post will be more exciting!!
Automatically login to Eircom stats site
Ok, so it’s pretty annoying having to login to the eircom site each time you want to check your usage. I figured there must be an easier way. What I needed to do was automatically submit the login form and go me to the stats page, quite simple really! There’s two parts, the login form and the onload event that tells the browser to submit the form when the page has loaded.
Just replace the opening body tag with this:
<body onload="window.document.stats.submit()">
And put this inside the body tags:
<form method="POST" name="stats" action="http://broadbandsupport.eircom.net/stats.asp">
<input type="hidden" name="username" value="YOUR-PHONE NUMBER" />
<input type="hidden" name="password" value="YOUR ACCOUNT NUMBER" />
</form>
*NOTE: Your phone number must be in the format [areacode-number]
I plan on making an iGoogle gadget for this. I will also try and support other ISPs. I must also put my PHP skils to the test and try to dynamically generate the html and let users download the complete html file for use with their account.
STAY TUNED!
UPDATE: Here is the full html document. Just right-click and “save target as”. You’ll then need to open the file in notepad to enter your phone number and account number.
About Me
Just a quick post to tell you a little bit about me and the type of content you should expect to find on this blog as it grows.
My name is Conor Hackett and i’m currently studying Computer Science in Griffith College Dublin. I am one week into third year of this course and I have loved every bit of it so far except for the odd bit of Math/Repeat Exam related stress!
At the moment my main interests are in anything computer related. I’m loving Web Development right now using CodeIgniter with PHP and MySQL, I would consider myself to be more of a backend, behind the scenes developer as apposed to a designer. I really don’t have a good eye for colour etc and I hate using CSS becuase of the whole area of browser cross compatibility..! Saying all of that, I would really like to take a course in pure web design and photoshop some day and change all that!
On the software programming side of things I have been using Java for the last two years in college but this year I will be learning Visual Basic.NET. Not too happy about that as I would prefer C# but maybe i’ll try learn that on my own alongside VB. To date I haven’t made any GUI programs, it has been all on the console unfortunately but this year should change all that with the use of VB!
Thats really it for the moment, hopefully if I am disciplined and post here regularly thats what you should find yourself reading about if you do decide to come back.
P.S. Some time in the near fututre I will maintain a proper “About” page and keep it up to date as this post will become outdated very quickly(Hopefully!)