Still Life

A Series of Mental Snapshots

Posts Tagged ‘Scraping’

4th Year Project: Scraping Pages!

Posted by Steve on October 27, 2008

The first step to my project was to get the main information that I was going to display on my web page. As mentioned in the previous post, the two pages I want information from are the schedule of classes and the undergraduate calendar. I used Ruby to accomplish this, if you want more info on scraping with Ruby drop me a comment.

The information I needed to pull from the schedule of classes was the basics for each class including the department, the class name and number, the enrollment cap and total, what day/time it was schedule for and the instructor.  A quick inspection of the page demonstrated that it was very poorly design, no information had tagged IDs or anything useful in that way. After a little digging and with the help of this command:

ie.show_tables

I was able to ascertain that the schedule of classes page had 2 tables, with the second table containing ALL of the desired information on the page; now if this isn’t bad web design I don’t know what is!

Alright I was getting somewhere, the information is in a table, now it was just a matter of finding the right logic to catch the information I wanted. Again after a little investigation I found that the department name would appear in the first column of a given row, and after I caught the department name there was a concrete pattern to where the rest of the information I wanted was placed. The logic I ended up using (and excuse the lack of indenting, wordpress is being a pain) was:


size.times do

if(classArray[count][0] == dept)

Write desired information to an excel form (for the time being)

end

count = count+1 #Increase count so that it checks the next row

end

So this basically iterates through each department page and pulls down all of the information I need! I apply very similar logic to pull down all of the information from the Undergraduate calendar. As things stand I have all of the current information available to students in an excel sheet, this is only temporary, I will shortly be putting it in to a MySQL database so that I could start putting it with the Django framework!

As far as next steps, well that would be to start learning Python/Django and getting a development environment up to play with, should be fun!

–Steve

Advertisements

Posted in Fourth Year Project | Tagged: , , , , , , , , , , , | Leave a Comment »