Blog

Since we are writing a program that makes use of the public infrastructure that is the Internet, it makes sense to play fair and make our programs behave properly so that we can avoid clashes with the webmasters, or even worse retaliation.

Webmasters may not want their site to be spidered by other computers, and they have a way to say it clearly with the robots.txt file. Inside a robots.txt file, which is a simple text file stating one or more rules, a webmaster can describe if scraping is allowed at all, or otherwise limit scraping to only some files and folders, or only to some specific spiders.

Our plan to playing fair is to read the site’s robots file, and stay well clear of the actual content if we are not allowed to read it. The wisest thing to do here is to stand on the shoulders of giants, since there are already other people who have done this in the past. And luckily some of these folks have shared their efforts for everyone else to use. Which means that we do not have to reinvent the wheel, and will only write a few lines of code to read the robots.txt file and make our program compliant.

What we are going to do is to use the package manager inside Visual Studio in order to add and reuse the existing code from the RobotsTxt author Çağdaş Tekin. So first open the solution and from the Project menu choose the item that says ‘Manage NuGet Packages’. A new window will open. Choose the online option on the left and in the ‘Search Online’ box enter RobotsTxt and press enter. A list with the relevant packages will be shown:

Install the one that says ‘RobotsTxt: A robots.txt parser for .Net’. When prompted to select the project to install to, just click OK since at the moment we only have one project. The package will be installed, and a green circle with a checkmark will confirm this. We can close the package manager window and get back to our task at hand.

We will now add a new function called CheckRobots, that uses RobotsTxt to determine if we are allowed to spider the page that the user requested. Add the following code at the bottom of Form1.cs:

private bool CheckRobots(string url)
{
var robotsFileLocation = new Uri(url).GetLeftPart(
UriPartial.Authority) + "/robots.txt";
var robotsFileContent = client.DownloadString(robotsFileLocation);
Robots robots = Robots.Load(robotsFileContent);
return robots.IsPathAllowed("keywordChecker", url);
}

This function gets a url as a string parameter and returns a boolean value to signify if this url can be accessed by a robot or not. In the function’s first line, we concatenate the robots.txt filename to the base url and store it in the robotsFileLocation variable. This new url is where we are going to look for the robots file and download it. Then we download the actual robots.txt file from the website and store it in robotsFileContent, and next we ask the RobotsTxt module to scan the file and tell us if we are allowed to read the page or not. This is done in two steps. First we load the file with Robots.Load and then return the result with robots.IsPathAllowed.

This should work on sites that explicitly state their conditions with the robots.txt file. Of course on the web, we can encounter a whole lot of other possible situations that can cause an exception when reading the file. In this case if the robots.txt file does not exist or is unreadable, we will assume that we are pretty much allowed to read anything on the particular website. To do this, we will embed our code inside a try catch statement like such:

    var robotsFileLocation = new Uri(url).GetLeftPart(
UriPartial.Authority) + "/robots.txt";
try
{
var robotsFileContent = client.DownloadString(
robotsFileLocation);
Robots robots = Robots.Load(robotsFileContent);
return robots.IsPathAllowed("keywordChecker", url);
}
catch
{
return true;
}

An exception is a condition inside our running code where the program’s flow deviates from our ideal outcome. For example in our case we are trying to read the robots.txt file, but we haven’t managed it somehow. This will cause an exception to be triggered by the code. We did not use any exception handlers so far, but it’s wise to use them in your code. A good programmer always tries to think about what unexpected situations can arise while the end user is working with the program and adds code so that exceptions do not blow up in the face of users.

If we expect that some part of our code can cause an exception, we embed this code inside a try block. Then the code that we want to run once an exception is triggered is added inside a catch block. In our case, if we cannot read the file, an exception is thrown and when we catch it the function returns the value true to signify that the url is allowed. Another code structure that we are not using here is the finally block. The finally block is used to run any code regardless if there is an exception or not, and is normally used to add some cleanup code for any opened streams or dispose of any unwanted objects. But you do not need to know this right now, so let’s move on.

Actually, the last thing we need to do is to call the new CheckRobots function from our button click event, encapsulate the code from the previous iteration inside an if statement, to be run only if we are allowed, or otherwise display a message to the user that the site is not allowed to be spidered.

private void btnCheck_Click(object sender, EventArgs e)
{
client = new WebClient();
var url = txtUrl.Text;
url = !string.IsNullOrEmpty(url) && Uri.IsWellFormedUriString(url,
UriKind.Absolute) ? url : "http://www.gametrailers.com";
var keywords = txtKeywords.Text;
keywords = !string.IsNullOrEmpty(keywords) ? keywords :
"final fantasy";

if (CheckRobots(url))
{
var pageContent = client.DownloadString(url);
var keywordLocation = pageContent.IndexOf(keywords,
StringComparison.InvariantCultureIgnoreCase);
StringBuilder sb = new StringBuilder();
if (keywordLocation >= 0)
{
var pageIds = Regex.Matches(pageContent, @"id=""\s*?\S*?""");
string matchedId = closestId(keywordLocation, pageIds);
string idTag = matchedId.Substring(4, matchedId.Length - 5);
brwPreview.Navigate(url + "#" + idTag);
sb.AppendFormat("{0} are talking about {1} today.", url,
keywords);
sb.Append("\n\nSnippet:\n" + pageContent.Substring(
keywordLocation, 100));
sb.AppendFormat("\n\nClosest id: {0}", idTag);
}
else
{
sb.Append("Keyword not found!");
}
lblResult.Text = sb.ToString();
}
else
{
lblResult.Text = "Blocked by robots.txt!";
}
}

We can observe that the program is becoming more modular now. What this means is that the programs is split in different modules, each with a specific function that can be reused in this and any future programs. You may have noticed that we are already reaping the benefits of modularity by re-using code for a function we need developed by someone else, but we will expand more on modularity in the next installment.

Once again, the full source code for this tutorial is available at GitHub.

Last time we left off with a GUI for our keyword checker program. What would be the next logical building block to add now that we have our checker with a nice preview pane?

Well the preview pane is not too handy if it does not show the part that we are looking for, so let us improve that today. We will use regular expressions to achieve this. In short, a regular expression acts as a text search, but instead of using a specific keyword to search for, it is generally more effective with a pattern that finds a set of text results that match this pattern. So our first piece of code is to add a reference to System.Text.RegularExpressions at the top of our form code.

But first, let’s diverge a bit to describe how we are going to scroll the preview pane to our desired location. Since the web page is in HTML format, we can find an HTML element on the page that we can just scroll to. We just need the id of that element. Thankfully, with regular expressions we can get a list of all the ids inside the page, and then scroll to the one that is closest to our content. Simple.

The regex (a shorter term for regular expression) that we are going to use to find all matches of element ids is this: @"id=""\s*?\S*?"""

And we will use it as follows:

var pageIds = Regex.Matches(pageContent, @"id=""\s*?\S*?""");

This will give us the page ids we wanted and conveniently store them as a list of matches in our pageids variable. Now we also need a private function that will give us the closest element to our content. A function is a piece of code that does a specific task, and we usually create a function for each simple task we need, so that we can use it in different parts of our program without having to rewrite the same code over and over. It could also be used by other programs, if it weren’t for that private adjective I’ve used (in technical terms called access modifier). The private access modifier limits the way that the function can be used only within the same class, in our case the program’s form. We are happy with that, so let’s move on.

Here’s our function:

private string closestId(int keywordLocation,
MatchCollection matchingIds)
{
int? closestId = null;
string closestIdName = null;
foreach (Match id in matchingIds)
{
if (closestId != null)
{
int idDistance = Math.Abs(id.Index - keywordLocation);
if (idDistance < closestId.Value)
{
closestId = idDistance;
closestIdName = id.Value;
}
}
else
{
closestId = Math.Abs(id.Index - keywordLocation);
closestIdName = id.Value;
}
}
return closestIdName;
}

The function, which I named closestId, will take two parameters. The first one is the index of our original keyword search (which is described in the first part of the tutorial), and the second parameter is the list of regex matches. What is important is that this list of matches contains the id and index of each match. What this function does is to iterate through the list of matches in order to find the closest one to our keywordLocation. The distance between each match and the keyword is calculated with the absolute distance function called Math.Abs (now that is a handy public function!). Every time that a new minimum distance is found, we store the value of this distance until we find a better one, whereby it will replace the current minimum. Initially the value of the closest distance is null, so the first match in the list will always be set as the closest in the first iteration. Once the loop ends, we just return the name of the closest id that we found. The function would then be called from the main function like this:

string matchedId = closestId(keywordLocation, pageIds);

Actually, we just need the id of the element without the id= part, so let’s go ahead and strip it off:

string idTag = matchedId.Substring(4, matchedId.Length - 5);

This last piece of code can also go inside the closestId function, so feel free to put it there. The last piece of the puzzle is to navigate to the page as we did before, but by adding the id to the url (prefixed with a hash sign) we get the nice effect of scrolling to the element with this id into view.

brwPreview.Navigate(url + "#" + idTag);

This method is not guaranteed to work 100% of the time, as some website may not have any Id elements or the id of the closest element may not be so close to our content, but it’s a start. I also increased the size of the window from the previous tutorial so that we have more space for the preview pane. The full source code for this tutorial is available on GitHub. Here is a sample screenshot.

It is all good to have the basics of a program working on the console, but we can do better than that. Today we shall add a graphical user interface (GUI) to our keyword checker.

Let us start a new project like we did in part one, only this time choose Windows Forms Application as the project type. This will allow us to add an interface with clickable buttons. Once the project is ready, we will have a number of components that can be drag and dropped from our toolbox (shown below) to our form (the user window).

Add the following components to the form as follows:

Component typeNameText content
labellblUrlURL
labellblKeywordsKeywords
textboxtxtUrl
textboxtxtKeywords
buttonbtnCheckCheck
labellblResult
webBrowserbrwPreview

The resulting form should look as shown here below. We have two text controls so that our users can enter the Url and keywords, a button that will trigger the search process, a label to show a snippet of the website content with the selected keywords, and a web browser to show a short preview of the page that the user is searching. There is no doubt that once finished, this interface will look better than the console application that we had before.

Now we need to convert the code from our console application to make it work with our new form. The code should run when the user clicks the button. We create a handler for the button by double clicking on the button and the editor will wire the necessary handler for us. When running the application, handlers make sure that the correct code is executed when particular events are triggered, in the case the click event of the Check button. Here is the code that goes in the btnCheck_Click(...) method that was just created for us, followed by the explanation:

var client = new WebClient();
var url = txtUrl.Text;
url = !string.IsNullOrEmpty(url) && Uri.IsWellFormedUriString(url,
UriKind.Absolute) ? url : "http://www.gametrailers.com";
var keywords = txtKeywords.Text;
keywords = !string.IsNullOrEmpty(keywords) ? keywords : "final fantasy";
var pageContent = client.DownloadString(url);
var keywordLocation = pageContent.IndexOf(keywords,
StringComparison.OrdinalIgnoreCase);
StringBuilder sb = new StringBuilder();
if (keywordLocation >= 0)
{
sb.AppendFormat("{0} are talking about {1} today.", url, keywords);
sb.Append("\n\nSnippet:\n" + pageContent.Substring(keywordLocation,
100));
brwPreview.Navigate(url);
}
else
{
sb.Append("Keyword not found!");
}
lblResult.Text = sb.ToString();

As usual, first we create a web client instance that will be used to fetch the results. Then comes the new part. Instead of reading the user’s input from the console, we read the url from the form’s textbox. This is done by using the textbox’s name (txtUrl) and reading its Text value (which holds the user’s input), and assigning it to the url variable. We then check if the url is valid, and use the default one otherwise (to understand how this works, take a look at the previous tutorial). We do likewise with the keywords textbox. As we also did previously, then we read the results from the page, check if the required keywords exist, and display the results on the user’s screen.

One difference this time is that we use the StringBuilder class to prepare the output before displaying it (by copying it to lblResult.Text), as opposed to directly writing each part of the result to the console.

The other difference is that now that we have a graphical interface, we can embed a preview of the page inside our form. This can be achieved quickly by using the browser component from our toolbox and pointing it to the selected url (simply done by using the Navigate method). We will improve this in a future tutorial.

Running the project and entering some url and keywords will look as shown here:

Again, a full version of the code is also available on github here.

In the last tutorial, I showed you how to do a quick web client to find out if a particular site contains the keywords that you are interested in. Today, we’ll make a small addition. We are going to add command line parameters, so we can look for the U.S. job situation on a whim.

Command line parameters allow us to add settings to a program when it is being launched, so that our program’s users can choose where and what to look for without having to recompile the program.

If we look at the main method we did earlier, we can see that it accepts the following parameter:

string[] args

This means that any parameters from the command line are contained in this array of strings.

Let’s say that we are going to accept two parameters: the first one for the url to be checked, and the second one for the keywords to find. If only one or no parameters are passed, we will default to our settings that were used in the previous example. The following code is used to read the url parameter:

var url = args.Length > 0 && Uri.IsWellFormedUriString(args[0],
UriKind.Absolute) ? args[0] : "http://www.gametrailers.com";

Here we are creating a new variable called url and setting it to a value. Now comes the interesting part. We are using a ternary operator to set this value. A ternary operation has the following format.

(condition) ? (value when true) : (value when false)

So it is really a shorthand to check for a particular condition and set a value accordingly. It is mostly used when the condition has two possible outcomes. In the conditional statement we check if there are any command line parameters (args.Length > 0) and (&&) the 1st parameter (args[0]) is a well formed url. The ternary operator then takes care that if both of these conditions are true, we use the passed url, otherwise we use the default gametrailers url.

Next we do the same with the keywords parameter:

var keywords = args.Length > 1 ? args[1] : "final fantasy";

This time we only check if there is a second parameter (args.Length > 1) and use it (args[1]) if true or the default keywords. Now let’s end by trying some searches.

The full version of the code is also available on github here.

In this short introduction to C#, I tried to do something different by making a program that scrapes a website to report if today it is showing content that we want to see. The act of scraping for an automated program is to retrieve a website's content in order to obtain some useful information. The example that we will build will check if our favourite website (e.g. gametrailers.com) are posting some information about our favourite game (e.g. Final Fantasy) and any of its modern versions and updates. If so, we can then visit the Gametrailers website safe in the knowledge that we will view something about Final Fantasy.

First, open Visual Studio and create a new console application. I named mine keywordCheck, but you are free to chose your own name.

This will create a standard program class containing a Main method that will be executed every time that we run the program. It is currently empty, so let us fix that.

Since we will be using the system's web client library to connect to and fetch the required page, let us add a reference to that at the top of our class:

using System.Net;

Now let us try to fetch the page that we require, using the following code:

static void Main(string[] args)
{
var client = new WebClient();
var url = "http://www.gametrailers.com";
Console.Write(client.DownloadString(url));
Console.ReadKey();
}

Here, we are first initialising a new instance of a web client and setting it to the client variable. Then we are setting the url variable with the required url to fetch, and finally we instruct the client to fetch this url for us and output the page’s HTML to the console. When we run the program, we can confirm that we are indeed fetching the page:

That’s great, but we’re still not there yet. Let us add a new variable to hold the keywords in. Then we can make a check to see if these keywords are included in the downloaded web page. If the website includes the text that we are looking for, we display a confirmation message:

    var client = new WebClient();
var url = "http://www.gametrailers.com";
var keywords = "final fantasy";
var pageContent = client.DownloadString(url);
if (pageContent.IndexOf(keywords, StringComparison.OrdinalIgnoreCase)
>= 0)
{
Console.WriteLine(url + " are talking about " + keywords +
" today.");
}
Console.ReadKey();

The IndexOf method will return a positive number if the text is found. This number indicates the position in the page where the keywords were found. We also instruct this method to ignore the case when comparing strings so we make sure to still find the keywords even if they are in a different case. The if statement will display a message if the returned number from IndexOf is positive.

To finish off this tutorial, we will also display a snippet of the text where the keywords are included in the fetched website. Nothing big and fancy, but it will give us a general idea of what the page’s content is.

static void Main(string[] args)
{
var client = new WebClient();
var url = "http://www.gametrailers.com";
var keywords = "final fantasy";
var pageContent = client.DownloadString(url);
var keywordLocation = pageContent.IndexOf(keywords, StringComparison
.OrdinalIgnoreCase)>= 0)
if (keywordLocation >= 0)
{
Console.WriteLine(url + " are talking about " + keywords +
" today.");
Console.WriteLine("\nSnippet:\n" + pageContent.Substring(
keywordLocation, 100));
}
Console.ReadKey();
}

And here is the result:

Next time we will see how to improve upon this code, like adding command line parameters or a GUI, for example. A full version of the code is also available on github here, part 2 of the tutorial here.