Wednesday 9 November 2011

Today you will explore the use of regular expressions to scrape information from a web page. You will use the regular expression sandpit. Read the instructions on the sandpit, and experiment to make sure you understand how it works.

The regex crib sheet may be helpful, for reference.

The regular expression sandpit lets you enter a regular expression, and then tag the strings that match it with HTML markup. For example, here is some text scraped from http://uktv.co.uk/food/recipe/aid/513585
The first part of this exercise uses this text.

Ingredients
8 tbsp olive oil
450 g cooked peeled prawns, (ideally freshly shelled), drained and patted dry
1 medium onion, finely chopped
4 stick celery
2 cloves garlic, finely chopped
1 red chilli, de-seeded and finely chopped
1 bunch of fresh mint, leaves chopped
100 g cherry tomatoes, halved
juice of 1 lemons
4 courgettes, grated
150 ml white wine
black pepper
500 g fresh tagliarini
200 ml fresh vegetable stock
25 g unsalted butter
40 g parmesan, freshly grated
Method
Heat the olive oil in a large, heavy-based frying pan. 
Add the prawns and fry until lightly browned. 
Add the onion, celery, garlic, chilli and mint. Fry until the prawns turn nutty brown. 
Add the cherry tomatoes and cook for 2 minutes. 
Stir in the lemon juice, then the courgettes. Cook for 5 minutes. 
Pour in the white wine, bring to the boil and cook for 2 minutes. Remove from direct heat. 
Bring a large pan of salted water to the boil. Add the fresh tagliarini and cook until al dente, around 2 minutes. Strain. 
Add the strained tagliarini to the prawn mixture, followed by the vegetable stock, butter and Parmesan cheese. 
Mix well together over a medium heat with a wooden spoon until glossy. Season with salt and freshly ground pepper and serve at once.

Most recipes on this site have a similar format.

Paste the text into the regular expression sandpit.

Task 1

First, put the regular expression [0-9].* into the regexp box. Then put li into the tag box. This shows you the result of enclosing each match for the regular expression between tags <li> and </li>.

This almost lists all the ingredients, but not quite. Try using the following regular expressions, 1 by 1. Try to understand what is happening.../p>

  • [0-9].*
  • \n[0-9].*
  • \n([0-9]|juice).*
  • \n([0-9]|juice|black).*

Once you are satisfied that you have successfully picked out all the ingredients, click tag text.

To finish the markup of the ingredients, try the regular expression (<li>\n.*)*</li> first with b as the tag, and then ul. Again click tag text.

Now use similar methods to tag each step in the method with li and the whole sequence of steps with ol.

Complete the markup by tagging the words "Ingredients" and "Method" with the tag h2.

Task 2

Now find a recipe online (choose one that is well-structured). Scrape the text using copy and paste and use appropriate regular expressions and tags to mark it up in HTML.

Task 3

Find the Amazon UK page listing Digital Cameras. Use View Source in your browser to examine the html for this page. How are the prices tagged?

Copy the text from the web page (not from the View Source page) and paste it into the regular expression sandpit replace the Jabberwocky poem.

Can you write a regular expression that selects all the prices on this page and nothing else?


The Amazon page also has model numbers, details of resolution (megapixels), zoom factor (e.g. 8x Optical) and screen size (e.g. 2.7 inch), but these are not tagged specially in the page source.

To tag them you would first have to find them. Use the regular expression sandpit to find a regular expressions for each data type – to find all occurrences of each of each data type.

Task 4

For extra fun, scrape the text from this page at www.parliament.uk and tag each address in bold.

No comments:

Post a Comment