Thursday, 25 November 2010

regex review

Find the Amazon UK page listing Digital Cameras. Use View Source in your browser to examine the html for this page. How are the prices tagged?

Explain briefly how XSLT could be used to extract a table of prices from this page.

On the printouts provided, circle the (non-empty) strings matched by each of the following regular expressions. You can use a separate sheet for each expression.



Can you write a regular expression that selects all the model numbers on this page and nothing else? Use the regular expression sandpit to check your answer – copy text from the web page to replace the Jabberwocky poem.

The Amazon page also has details of resolution (megapixels), zoom factor (e.g. 8x Optical) and screen size (e.g. 2.7 inch), but these are not tagged specially in the page source.

To tag them you would first have to find them. Use the regular expression sandpit to find three regular expressions – one for each data type – to find all occurrences of each of these three data types.

Friday, 12 November 2010

Assignment 3: The Tablet

Deadline: Monday 29 November, 16:00

You can submit your report via TurnItIn. See email of 26th November for signup details. Once you have registered, you can sign in at the following url:

The deadline for your submission is 1600 UTC on Monday 29th November. Unless you have good reason for not meeting this deadline, and we have agreed an extension, late work will not be marked.

Electronic submission is preferred, but if for any reason you cannot submit your work using TurnItIn, you should submit two hard-copies of your report to the ITO on the 4th floor of the Appleton Tower before the deadline.

First, download and read a document published in 1988 entitled,
"TABLET: the personal computer of the year 2000".

The abstract says,
"The University of Illinois design extends the freedom of pen and notepad with a machine that draws on the projected power of 21st century technology. Without assuming any new, major technological breakthroughs, it seeks to balance the promises of today’s growing technologies with the changing role of computers in tomorrow’s education, research, security, and commerce.

"The design is simple, yet sleek. Roughly the size and weight of a notebook, the machine has no moving parts and resembles the dark, featureless monolith from a well known movie."
For this assignment you should write a report (1,700-2,500 words), entitled "Predicting the Future", using this paper as a case study. Your report should discuss the extent to which the future of technology can be predicted 10 to 20 years ahead.
  1. Briefly summarize the key features of the personal computers available in 1988.
  2. You should examine the technological trends on which the authors base their predictions (p.28). Select three (or more) key technology metrics and report on how well their development over the past 22 years has matched the predictions of the paper. 
  3. Use the internet to identify, and report on current predictions for changes in these metrics over the coming decade.
  4. How do devices available today match the predictions made 22 years ago? Which features of the "Tablet" are now available on mass-market personal devices, and when were they introduced to the market? What, if any, current technologies did the authors fail to predict?
  5. Use Google to find out what you can of what has become of the authors, and report briefly.
  6. For the final section of your report you should take a stab at making your own predictions, based on the technology forecasts you reported in part 3, for a device that might be in common use in 10-20 years time.
Your report should be between 1,700 and 2,500 words in length. The School's standard guidelines on plagiarism apply.

Further reading

As we may think, Vannevar Bush (July 1945)

Thursday, 4 November 2010

Assignment 2 – Who owns information?

(hand-in 4pm 15 Nov)

For the second assignment, you should write a report on the way digital technologies are challenging traditional notions of privacy and ownership of information.

You can choose to focus on one of two areas
  1. One concerns issues of copyright and 'digital rights management'. How do works stored digitally differ from works embodied in media such as books or CDs? How do rights of 'fair use' fare in the face of digital rights management? Should you be able to lend items from your digital library to friends, just as you can lend embodied media? How can society best encourage creativity in the production of 'digital assets'? How can we ensure that future generations will have access to works being produced today?
  2. The second issue concerns personal data. How and why do entities such as Google and Facebook collect data? What are the personal risks and benefits of this activity? What are the risks and benefits for society? How can the risks be mitigated and the benefits enjoyed? To what extent is it desirable or practical for government to limit such activities? Is government also collecting such data – if so, why and how?
Whichever focus you choose, your report should address the following issues:
  • What is the value of the information in question? How will it be used? Who benefits financially? Who enjoys other benefits of ownership or use?
  • What is the cost of the information in question? What efforts contribute to its production, and how are they stimulated or funded?
  • What new risks and opportunities are created by the use of digital technologies to store, process and communicate this information.
  • You should also include a list of references you have consulted in the preparation of your report.
The following Wikipedia articles may be useful as background reading, but you should use the web to find original sources to inform your report.


Your report should be between two and three-thousand words in length. The School's standard guidelines on plagiarism apply.

Lab session – advanced search

Google offers an advanced search page – for details see this article or this note on search operators just search for "google advanced search tips". Google also offers many specialised searches.

Your task for this lab is to use these tools to find information on ownership of information. You will need this for the second assignment.

You should experiment with use of the boolean operators and synonyms.

As an exercise, try to find a query that returns as few documents as possible – while still returning at least one document. This is a competition, in case of a tie on number of documents, the shortest query wins. In case of a tie on that criterion also, the query using the shortest 'words' wins – and any string enclosed in quotes (") counts as one 'word'.

Thursday, 28 October 2010

Eliza Lab

Today you'll experiment with using regular expressions to mimic Eliza's responses. Using this tool, type in search patterns to match the kinds of sentences that people often type as input to Eliza. Then in the replacement pattern box, write a pattern to produce an Eliza-like response. For example, the search pattern
  • I am ( [a-z]+)
with the replacement pattern
  • How long have you been $2?
will match the sentence
  • I am bored
typed in the input text box and produce
  • How long have you been bored?
in the output text box.
So, now you try it. Write a regular expression search pattern to match sentences of the form:
  • I am sick
  • I am tired
  • I am hungry
Does your search pattern match the following strings and produce the kind of response you would expect?
  • Today I am bored
  • I feel very happy
  • I feel awfully tired
If not, fix your search pattern so that the system generates sensible outputs for each of these. Now try these
  • I feel cold and I want a nap
  • I feel cold and I want a hot drink
  • I like food and drink
  • I like to sing and dance
What if you wanted to generate the response:
  • Have you always enjoyed eating and drinking
in response to the input:
  • I like to eat and drink
Would your search pattern also work for
  • I like to sing and dance
Now write a search pattern that will match sentences like:
  • I am tired and hungry
and produce the response:
  • What do you think makes you hungry and tired?
Now play with this version of Eliza. Find cases where it doesn't respond very naturally and try and write search and replacement patterns that solve the problems you encounter. (Note: they may not all be solvable just by this simple search and replace method.)

Thursday, 14 October 2010

Tagging with regular expressions

For this exercise you will find a recipe online, scrape the text from the web page, and then tag it with HTML markup. For example, here is some text scraped from
You should work through the exercise with this text first.

8 tbsp olive oil
450 g cooked peeled prawns, (ideally freshly shelled), drained and patted dry
1 medium onion, finely chopped
4 stick celery
2 cloves garlic, finely chopped
1 red chilli, de-seeded and finely chopped
1 bunch of fresh mint, leaves chopped
100 g cherry tomatoes, halved
juice of 1 lemons
4 courgettes, grated
150 ml white wine
black pepper
500 g fresh tagliarini
200 ml fresh vegetable stock
25 g unsalted butter
40 g parmesan, freshly grated
Heat the olive oil in a large, heavy-based frying pan. 
Add the prawns and fry until lightly browned. 
Add the onion, celery, garlic, chilli and mint. Fry until the prawns turn nutty brown. 
Add the cherry tomatoes and cook for 2 minutes. 
Stir in the lemon juice, then the courgettes. Cook for 5 minutes. 
Pour in the white wine, bring to the boil and cook for 2 minutes. Remove from direct heat. 
Bring a large pan of salted water to the boil. Add the fresh tagliarini and cook until al dente, around 2 minutes. Strain. 
Add the strained tagliarini to the prawn mixture, followed by the vegetable stock, butter and Parmesan cheese. 
Mix well together over a medium heat with a wooden spoon until glossy. Season with salt and freshly ground pepper and serve at once.

Most recipes on this site have a similar format.

Paste the text into the regular expression sandpit.
You will use this tool to mark up the text in html. Read the instructions on the sandpit, and experiment to make sure you understand how it works.

First, put the regular expression [0-9].* into the regexp box. Then put li into the tag box. This shows you the result of enclosing each match for the regular expression between tags <li> and </li>.

This almost lists all the ingredients, but not quite. Try using the following regular expressions, 1 by 1. Try to understand what is happening.../p>

  • [0-9].*
  • \n[0-9].*
  • \n([0-9]|juice).*
  • \n([0-9]|juice|black).*

Once you are satisfied that you have successfully picked out all the ingredients, click tag text.

To finish the markup of the ingredients, try the regular expression (<li>\n.*)*</li> first with b as the tag, and then ul. Again click tag text.

Now use similar methods to tag each step in the method with li and the whole sequence of steps with ol.

Complete the markup by tagging the words "Ingredients" and "Method" with the tag h2.

Now find a recipe online (choose one that is well-structured). Scrape the text using copy and paste and use appropriate regular expressions and tags to mark it up in HTML.

For extra fun, scrape the text from this page at and tag each address in bold.

Tuesday, 28 September 2010

Monday, 27 September 2010

Data and information

The slides are available as pdf (with notes).
Slogan for this lecture: "The more data you have the easier it is to separate the information from the noise."


Mapping the World's Photos

Monday, 20 September 2010

Introduction 20 September 2010

The slides are available as pdf (with notes) – just click the title above.
By the end of this lecture you should be able to give examples of analogue and digital storage of information.

You should know about the basic units of information (bits and bytes), and be able to do some basic calculations in powers of 2.

Please book tickets for guest lectures on
Thursday 23 September and Monday 4 October