Thursday, 14 October 2010

Tagging with regular expressions

For this exercise you will find a recipe online, scrape the text from the web page, and then tag it with HTML markup. For example, here is some text scraped from
You should work through the exercise with this text first.

8 tbsp olive oil
450 g cooked peeled prawns, (ideally freshly shelled), drained and patted dry
1 medium onion, finely chopped
4 stick celery
2 cloves garlic, finely chopped
1 red chilli, de-seeded and finely chopped
1 bunch of fresh mint, leaves chopped
100 g cherry tomatoes, halved
juice of 1 lemons
4 courgettes, grated
150 ml white wine
black pepper
500 g fresh tagliarini
200 ml fresh vegetable stock
25 g unsalted butter
40 g parmesan, freshly grated
Heat the olive oil in a large, heavy-based frying pan. 
Add the prawns and fry until lightly browned. 
Add the onion, celery, garlic, chilli and mint. Fry until the prawns turn nutty brown. 
Add the cherry tomatoes and cook for 2 minutes. 
Stir in the lemon juice, then the courgettes. Cook for 5 minutes. 
Pour in the white wine, bring to the boil and cook for 2 minutes. Remove from direct heat. 
Bring a large pan of salted water to the boil. Add the fresh tagliarini and cook until al dente, around 2 minutes. Strain. 
Add the strained tagliarini to the prawn mixture, followed by the vegetable stock, butter and Parmesan cheese. 
Mix well together over a medium heat with a wooden spoon until glossy. Season with salt and freshly ground pepper and serve at once.

Most recipes on this site have a similar format.

Paste the text into the regular expression sandpit.
You will use this tool to mark up the text in html. Read the instructions on the sandpit, and experiment to make sure you understand how it works.

First, put the regular expression [0-9].* into the regexp box. Then put li into the tag box. This shows you the result of enclosing each match for the regular expression between tags <li> and </li>.

This almost lists all the ingredients, but not quite. Try using the following regular expressions, 1 by 1. Try to understand what is happening.../p>

  • [0-9].*
  • \n[0-9].*
  • \n([0-9]|juice).*
  • \n([0-9]|juice|black).*

Once you are satisfied that you have successfully picked out all the ingredients, click tag text.

To finish the markup of the ingredients, try the regular expression (<li>\n.*)*</li> first with b as the tag, and then ul. Again click tag text.

Now use similar methods to tag each step in the method with li and the whole sequence of steps with ol.

Complete the markup by tagging the words "Ingredients" and "Method" with the tag h2.

Now find a recipe online (choose one that is well-structured). Scrape the text using copy and paste and use appropriate regular expressions and tags to mark it up in HTML.

For extra fun, scrape the text from this page at and tag each address in bold.

No comments:

Post a Comment